Lecture12

Big picture

We have considerred quantitative variables in regression analysis such as distance, pressure, income, etc.
Another important scenario is that we have qualitative variables or categorical variables as part of the regressors.
We shall use dummy variables to code the categorical variables thus transform it into a MLR model with all quantitative variables.

Binary categorical variables

Suppose that a mechanical engineer wishes to relate the effective life of a cutting tool (\(y\)) used on a lathe to the lathe speed in revolutions per minute (\(x_1\)) and the type of cutting tool used.

The type of cutting tool is qualitative and has two levels, e.g., Type A and Type B.
We code it using

\[x_2 = \left\{ \begin{array}{ll} \text{0} & {\text{ if the observation is from tool type A}}\\ \text{1} & {\text{ if the observation is from tool type B}} \end{array} \right.\]

Then we assume the usual MLR \[\begin{equation} \label{eq:MLR0} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon. \quad\quad(1) \end{equation}\]

Q1. How to interpret \((\beta_0, \beta_1, \beta_2)\)?

This model implies that the two types of tool lead to two parallel regression line with a common slope \(\beta_1\). - The difference between the effects of two types of tools is only in the intercept.
\(\beta_2\): the difference in mean tool life resulting from changing from tool Type A to tool Type B.

Interaction

What if we do not want to enforce a common slope? We can use an interaction term to achieve this goal. Specifically, we assume a more flexible MLR model \[\begin{equation} \label{eq:MLR1} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon, \quad\quad(2) \end{equation}\]

where \(x_3 = x_1 x_2\) is the interaction between \(x_1\) and \(x_2\).

Q2. How to intercept all \(\beta\)’s in Model 2?

\(\beta_0\): the intercept for tool Type A (baseline).
\(\beta_1\): the slope for tool Type A (baseline).
\(\beta_2\): the difference in the intercept resulting from changing from tool Type A to tool Type B.
\(\beta_3\): the difference in the slope resulting from changing from tool Type A to tool Type B.

Categorial variables with more than two levels

Consider a regression of \(y\) on \(F\) where \(F\) is a factor with \(K\) levels and \(K \geq 2\).
We just need to create \(K - 1\) indicator variables (i.e., binary 0/1 variables) and put them into the MLR model.
Suppose \(F\) has 3 levels (A, B, C), and our data are \[F \text{ at level } \left\{ \begin{array}{ll} 1 & \text{has data} & y_{11}, y_{12}, y_{13}, y_{14} & \text{with sample mean} & \bar{y}_{1+} \\ 2 & \text{has data} & y_{21}, y_{22}, y_{23}, y_{24}, y_{25} & \text{with sample mean} & \bar{y}_{2+} \\ 3 & \text{has data} & y_{31}, y_{32}, y_{33} & \text{with sample mean} & \bar{y}_{3+} \end{array} \right.\]
We should create 2 dummy variables \(x_1\) and \(x_2\) such that

\[x_1 = \left\{ \begin{array}{ll} 1 & \text{B} \\ 0 & \text{otherwise} \end{array}, \right. \quad\quad x_2 = \left\{ \begin{array}{ll} 1 & \text{C} \\ 0 & \text{otherwise} \end{array} \right. \] Then we use the MLR model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon.\]

As a result, we have: \[\begin{array}{lll} \text{E}\left[Y~|~F = A \right] = & \text{E}\left[Y~|~ x_1 = 0 \text{ and } x_2 = 0 \right] = & \beta_0 \\ \text{E}\left[Y~|~F = B \right] = & \text{E}\left[Y~|~x_1 = 1 \text{ and } x_2 = 0 \right] = & \beta_0 + \beta_1 \\ \text{E}\left[Y~|~F = C \right] = & \text{E}\left[Y~|~x_1 = 0 \text{ and } x_2 = 1 \right] = & \beta_0 + \beta_2 \end{array} \]

Q3: how to interpret \((\beta_0, \beta_1, \beta_2)\)?

Q4: based on the interpretation above, what will be the estimates of \((\beta_0, \beta_1, \beta_2)\)?

Example:

data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Y <- iris$Sepal.Length
F <- iris$Species

table(F)

## F
##     setosa versicolor  virginica 
##         50         50         50

We first create dummy variables then fit a linear regression model.

x1 <- ifelse(F == "versicolor", 1, 0)
x2 <- ifelse(F == "virginica", 1, 0)
fit <- lm(Y ~ x1 + x2)
summary(fit)

## 
## Call:
## lm(formula = Y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0060     0.0728  68.762  < 2e-16 ***
## x1            0.9300     0.1030   9.033 8.77e-16 ***
## x2            1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

The coefficient estimates match our expectation in Q4.

Ybarsetosa <- mean(Y[x1 == 0 & x2 == 0])
Ybarversicolor <- mean(Y[x1 == 1 & x2 == 0])
Ybarvirginica <- mean(Y[x1 == 0 & x2 == 1])

## beta0hat, beta1hat, beta2hat 
c(Ybarsetosa, Ybarversicolor - Ybarsetosa, Ybarvirginica - Ybarsetosa)

## [1] 5.006 0.930 1.582

The function will automatically use the dummy variables for a categorical variable.

fit <- lm(Y ~ F)
summary(fit)

## 
## Call:
## lm(formula = Y ~ F)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0060     0.0728  68.762  < 2e-16 ***
## Fversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Fvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Sometimes, you need to use factor(x) to tell R that \(x\) is a factor.

Next let’s make some inference.

boxplot(Y ~ F) # boxplot for visulization

Q5. Test the hypothesis that there are no differences among species mean lengths.

Set up your test;
Specify your test statistic;
Provide your p-value and state your conclusion.

Q6. Test the hypothesis that there are no differences between the mean lengths of setosa and versicolor.

One-way ANOVA

In the above example, the way we code \(x_1\) and \(x_2\) is arbitrary.
All coding schemes are based on the following model: \[y_{ij} = \mu_i + \epsilon_{ij}, \quad\quad \epsilon_{ij} \overset{iid}{\sim} N(0, \sigma^2),\] where \(i = 1, \ldots, K\) and \(j = 1, \ldots, m\).
\(\mu_i\) is the population mean of group \(i\), and there are \(K\) of them.

Q7. What is the relationship between \(\mu_i\) and \(\beta\)’s?

It turns out that we can simply the ANOVA table.

Notations:

\[ n = K \cdot m, \quad \bar{y}_{i+} = \frac{1}{m} \sum_{j = 1}^{m} y_{ij}, \quad \bar{y}_{++} = \frac{1}{K} \sum_{i = 1}^K \bar{y}_{i+} = \frac{1}{n} \sum_{i = 1}^K \sum_{j = 1}^{m} y_{ij}.\]

Then it follows that

\[SS_T = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(y_{ij} - \bar{y}_{++} \right)^2, \quad df = n - 1;\] \[SS_R = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(\bar{y}_{i+} - \bar{y}_{++} \right)^2, \quad df = K - 1, \quad MS_R = SS_R / (K - 1);\]

\[SS_{Res} = \sum_{i=1}^{K} \sum_{j=1}^{m} \left(y_{ij} - \bar{y}_{i+} \right)^2, \quad df = n - K, \quad MS_{Res} = SS_{Res} / (n - K).\]

In order to test the hypotheses \[H_0: \mu_1 = \ldots = \mu_K, \quad vs. \quad H_1: \text{Not } H_0,\]

we use the \(F\)-test \[F = MS_{R}/MS_{Res} \sim F(K - 1, n - K).\]

summary(aov(Y ~ F))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## F             2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The following parameterization is widely used: \[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij}\]

\(\mu\): grand mean
\(\tau_i\): \(i\)th treatment effect
We require \(\sum_{i = 1}^K \tau_i = 0\) to ensure indentifiability.
Here, \(\tau_i\) is a fixed effect as \(i = 1, \ldots, K\) represents all levels of our interest, compared to the helicopter example where the \(i\)th helicopter is just a sample form a large population.

Summary

We use dummary/indicator variables to address categorical regressors.
There are various ways to code a categorical variable, but all lead to the same model fitting.
What if the response is categoricay? – Generalized linear regression; we specifically will cover logistic regression.