Big picture

Binary categorical variables

Suppose that a mechanical engineer wishes to relate the effective life of a cutting tool (\(y\)) used on a lathe to the lathe speed in revolutions per minute (\(x_1\)) and the type of cutting tool used.

\[x_2 = \left\{ \begin{array}{ll} \text{0} & {\text{ if the observation is from tool type A}}\\ \text{1} & {\text{ if the observation is from tool type B}} \end{array} \right.\]

Then we assume the usual MLR \[\begin{equation} \label{eq:MLR0} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon. \quad\quad(1) \end{equation}\]

Q1. How to interpret \((\beta_0, \beta_1, \beta_2)\)?

Interaction

What if we do not want to enforce a common slope? We can use an interaction term to achieve this goal. Specifically, we assume a more flexible MLR model \[\begin{equation} \label{eq:MLR1} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon, \quad\quad(2) \end{equation}\]

where \(x_3 = x_1 x_2\) is the interaction between \(x_1\) and \(x_2\).

Q2. How to intercept all \(\beta\)’s in Model 2?

  • \(\beta_0\): the intercept for tool Type A (baseline).

  • \(\beta_1\): the slope for tool Type A (baseline).

  • \(\beta_2\): the difference in the intercept resulting from changing from tool Type A to tool Type B.

  • \(\beta_3\): the difference in the slope resulting from changing from tool Type A to tool Type B.

Categorial variables with more than two levels

\[x_1 = \left\{ \begin{array}{ll} 1 & \text{B} \\ 0 & \text{otherwise} \end{array}, \right. \quad\quad x_2 = \left\{ \begin{array}{ll} 1 & \text{C} \\ 0 & \text{otherwise} \end{array} \right. \] Then we use the MLR model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon.\]

As a result, we have: \[\begin{array}{lll} \text{E}\left[Y~|~F = A \right] = & \text{E}\left[Y~|~ x_1 = 0 \text{ and } x_2 = 0 \right] = & \beta_0 \\ \text{E}\left[Y~|~F = B \right] = & \text{E}\left[Y~|~x_1 = 1 \text{ and } x_2 = 0 \right] = & \beta_0 + \beta_1 \\ \text{E}\left[Y~|~F = C \right] = & \text{E}\left[Y~|~x_1 = 0 \text{ and } x_2 = 1 \right] = & \beta_0 + \beta_2 \end{array} \]

Q3: how to interpret \((\beta_0, \beta_1, \beta_2)\)?

Q4: based on the interpretation above, what will be the estimates of \((\beta_0, \beta_1, \beta_2)\)?

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
Y <- iris$Sepal.Length
F <- iris$Species

table(F)
## F
##     setosa versicolor  virginica 
##         50         50         50

We first create dummy variables then fit a linear regression model.

x1 <- ifelse(F == "versicolor", 1, 0)
x2 <- ifelse(F == "virginica", 1, 0)
fit <- lm(Y ~ x1 + x2)
summary(fit)
## 
## Call:
## lm(formula = Y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0060     0.0728  68.762  < 2e-16 ***
## x1            0.9300     0.1030   9.033 8.77e-16 ***
## x2            1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

The coefficient estimates match our expectation in Q4.

Ybarsetosa <- mean(Y[x1 == 0 & x2 == 0])
Ybarversicolor <- mean(Y[x1 == 1 & x2 == 0])
Ybarvirginica <- mean(Y[x1 == 0 & x2 == 1])

## beta0hat, beta1hat, beta2hat 
c(Ybarsetosa, Ybarversicolor - Ybarsetosa, Ybarvirginica - Ybarsetosa)
## [1] 5.006 0.930 1.582

The function will automatically use the dummy variables for a categorical variable.

fit <- lm(Y ~ F)
summary(fit)
## 
## Call:
## lm(formula = Y ~ F)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0060     0.0728  68.762  < 2e-16 ***
## Fversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Fvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Sometimes, you need to use factor(x) to tell R that \(x\) is a factor.

Next let’s make some inference.

boxplot(Y ~ F) # boxplot for visulization 

Q5. Test the hypothesis that there are no differences among species mean lengths.

Q6. Test the hypothesis that there are no differences between the mean lengths of setosa and versicolor.

One-way ANOVA

Q7. What is the relationship between \(\mu_i\) and \(\beta\)’s?

Notations:

\[ n = K \cdot m, \quad \bar{y}_{i+} = \frac{1}{m} \sum_{j = 1}^{m} y_{ij}, \quad \bar{y}_{++} = \frac{1}{K} \sum_{i = 1}^K \bar{y}_{i+} = \frac{1}{n} \sum_{i = 1}^K \sum_{j = 1}^{m} y_{ij}.\]

Then it follows that

\[SS_T = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(y_{ij} - \bar{y}_{++} \right)^2, \quad df = n - 1;\] \[SS_R = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(\bar{y}_{i+} - \bar{y}_{++} \right)^2, \quad df = K - 1, \quad MS_R = SS_R / (K - 1);\]

\[SS_{Res} = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(y_{ij} - \bar{y}_{i+} \right)^2, \quad df = n - K, \quad MS_{Res} = SS_{Res} / (n - K).\]

In order to test the hypotheses \[H_0: \mu_1 = \ldots = \mu_K, \quad vs. \quad H_1: \text{Not } H_0,\]

we use the \(F\)-test \[F = MS_{R}/MS_{Res} \sim F(K - 1, n - K).\]

summary(aov(Y ~ F))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## F             2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The following parameterization is widely used: \[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij}\]

Summary