Suppose that a mechanical engineer wishes to relate the effective life of a cutting tool (\(y\)) used on a lathe to the lathe speed in revolutions per minute (\(x_1\)) and the type of cutting tool used.
\[x_2 = \left\{ \begin{array}{ll} \text{0} & {\text{ if the observation is from tool type A}}\\ \text{1} & {\text{ if the observation is from tool type B}} \end{array} \right.\]
Then we assume the usual MLR \[\begin{equation} \label{eq:MLR0} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon. \quad\quad(1) \end{equation}\]Q1. How to interpret \((\beta_0, \beta_1, \beta_2)\)?
This model implies that the two types of tool lead to two parallel regression line with a common slope \(\beta_1\). - The difference between the effects of two types of tools is only in the intercept.
\(\beta_2\): the difference in mean tool life resulting from changing from tool Type A to tool Type B.
where \(x_3 = x_1 x_2\) is the interaction between \(x_1\) and \(x_2\).
Q2. How to intercept all \(\beta\)’s in Model 2?
\(\beta_0\): the intercept for tool Type A (baseline).
\(\beta_1\): the slope for tool Type A (baseline).
\(\beta_2\): the difference in the intercept resulting from changing from tool Type A to tool Type B.
\(\beta_3\): the difference in the slope resulting from changing from tool Type A to tool Type B.
\[x_1 = \left\{ \begin{array}{ll} 1 & \text{B} \\ 0 & \text{otherwise} \end{array}, \right. \quad\quad x_2 = \left\{ \begin{array}{ll} 1 & \text{C} \\ 0 & \text{otherwise} \end{array} \right. \] Then we use the MLR model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon.\]
As a result, we have: \[\begin{array}{lll} \text{E}\left[Y~|~F = A \right] = & \text{E}\left[Y~|~ x_1 = 0 \text{ and } x_2 = 0 \right] = & \beta_0 \\ \text{E}\left[Y~|~F = B \right] = & \text{E}\left[Y~|~x_1 = 1 \text{ and } x_2 = 0 \right] = & \beta_0 + \beta_1 \\ \text{E}\left[Y~|~F = C \right] = & \text{E}\left[Y~|~x_1 = 0 \text{ and } x_2 = 1 \right] = & \beta_0 + \beta_2 \end{array} \]
Q3: how to interpret \((\beta_0, \beta_1, \beta_2)\)?
Q4: based on the interpretation above, what will be the estimates of \((\beta_0, \beta_1, \beta_2)\)?
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Y <- iris$Sepal.Length
F <- iris$Species
table(F)
## F
## setosa versicolor virginica
## 50 50 50
We first create dummy variables then fit a linear regression model.
x1 <- ifelse(F == "versicolor", 1, 0)
x2 <- ifelse(F == "virginica", 1, 0)
fit <- lm(Y ~ x1 + x2)
summary(fit)
##
## Call:
## lm(formula = Y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6880 -0.3285 -0.0060 0.3120 1.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
## x1 0.9300 0.1030 9.033 8.77e-16 ***
## x2 1.5820 0.1030 15.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
The coefficient estimates match our expectation in Q4.
Ybarsetosa <- mean(Y[x1 == 0 & x2 == 0])
Ybarversicolor <- mean(Y[x1 == 1 & x2 == 0])
Ybarvirginica <- mean(Y[x1 == 0 & x2 == 1])
## beta0hat, beta1hat, beta2hat
c(Ybarsetosa, Ybarversicolor - Ybarsetosa, Ybarvirginica - Ybarsetosa)
## [1] 5.006 0.930 1.582
The function will automatically use the dummy variables for a categorical variable.
fit <- lm(Y ~ F)
summary(fit)
##
## Call:
## lm(formula = Y ~ F)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6880 -0.3285 -0.0060 0.3120 1.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
## Fversicolor 0.9300 0.1030 9.033 8.77e-16 ***
## Fvirginica 1.5820 0.1030 15.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
Sometimes, you need to use factor(x) to tell R that \(x\) is a factor.
Next let’s make some inference.
boxplot(Y ~ F) # boxplot for visulization
Q5. Test the hypothesis that there are no differences among species mean lengths.
Q6. Test the hypothesis that there are no differences between the mean lengths of setosa and versicolor.
Q7. What is the relationship between \(\mu_i\) and \(\beta\)’s?
Notations:
\[ n = K \cdot m, \quad \bar{y}_{i+} = \frac{1}{m} \sum_{j = 1}^{m} y_{ij}, \quad \bar{y}_{++} = \frac{1}{K} \sum_{i = 1}^K \bar{y}_{i+} = \frac{1}{n} \sum_{i = 1}^K \sum_{j = 1}^{m} y_{ij}.\]
Then it follows that
\[SS_T = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(y_{ij} - \bar{y}_{++} \right)^2, \quad df = n - 1;\] \[SS_R = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(\bar{y}_{i+} - \bar{y}_{++} \right)^2, \quad df = K - 1, \quad MS_R = SS_R / (K - 1);\]
\[SS_{Res} = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(y_{ij} - \bar{y}_{i+} \right)^2, \quad df = n - K, \quad MS_{Res} = SS_{Res} / (n - K).\]
In order to test the hypotheses \[H_0: \mu_1 = \ldots = \mu_K, \quad vs. \quad H_1: \text{Not } H_0,\]
we use the \(F\)-test \[F = MS_{R}/MS_{Res} \sim F(K - 1, n - K).\]
summary(aov(Y ~ F))
## Df Sum Sq Mean Sq F value Pr(>F)
## F 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following parameterization is widely used: \[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij}\]