Load data and Visualization

library(UsingR)
data(father.son)
x <- father.son$fheight
y <- father.son$sheight
n <- length(x)
plot(x, y, xlab = "father's height (inches)", ylab = "son's height (inches)")

Statistical modeling: simple linear regression

xbar = mean(x); ybar = mean(y)
Sxy = sum((x - xbar) * (y - ybar))
Sxx = sum((x - xbar)^2)
beta1hat = Sxy / Sxx
beta0hat = ybar - beta1hat * xbar
yhat = beta0hat + beta1hat * x
res = y - yhat
SS_res = sum(res^2) #residual sum of squares 
MS_res = SS_res/(n - 2) # residual mean square
sigma2hat = MS_res

Model checking

Let us check the histogram and QQ plot.

par(mfrow = c(1, 2))
hist(res, breaks = 50)
qqnorm(res, main = "Normal QQ Plot of Residuals") # use '?qqnorm' to see the help document
qqline(res, col = "blue") # add a straight line 
abline(a = 0, b = 1, col = 'red') # add a reference line 

Q1: Do we expect the staight line to have intercept 0 and slope 1 when the normal assumption holds?

To remove the scale of the residuals, we may use the standardized residual \(s_i\).

s = res/sqrt(sigma2hat)
qqnorm(s, main = "Normal QQ Plot of Standardized Residuals")
qqline(s, col = "blue")
abline(a = 0, b = 1, col = 'red')

Q2: Should we pass the normal assumption?

We can use a powerful tool, i.e., simulation, to investigate the typical behaviors of QQ plots for various distributions (heavy/light tails and skewed distributions) as follows.

set.seed(201710) # set seed for pseudo-random number generator 

my_plot = function(sample, my_title){
  qqnorm(sample, main = my_title); qqline(sample, lwd = 2, col = "blue")
}
par(mfrow = c(2, 2))
my_plot(rnorm(n), "QQ plot of a normal sample")
my_plot(rt(n, df = 2), 'QQ plot of a t2 sample')
my_plot(runif(n), 'QQ plot of a uniform sample')
my_plot(rchisq(n, df = 2), 'QQ plot of a chi-squared sample')

We can also conduct a hypothesis test on this normality assumption. The Kolmogorov-Smirnoff test is an example.

ks.test(s, pnorm)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  s
## D = 0.02208, p-value = 0.6694
## alternative hypothesis: two-sided
par(mfrow = c(1, 2))
plot(yhat, s, main = "Residuals vs. yhat"); abline(h = 0, lwd = 2, col = "blue")
plot(x, s, main = "Residuals vs. x"); abline(h = 0, lwd = 2, col = "blue")

Q3. Why the two plots look identical and only differ in the scale?

This is generally the case in simple linear regression, which means we only need to plot one of them. In multiple linear regression, the plot versus fitted values is more convenient. (Why?)

There are no apparent patterns in the residual plots, so we are satisfied with the fit of the model.

Q4. How to investigate the residual plots if the true model is a linear regression model or other more complicated models?

y = beta0hat + beta1hat * x + rnorm(n, sd = sqrt(sigma2hat)) # generate observations using SLR

# run SLR on (x, y) where y is the new response 
xbar = mean(x); ybar = mean(y)
Sxy = sum((x - xbar) * (y - ybar))
Sxx = sum((x - xbar)^2)
beta1hat = Sxy / Sxx
beta0hat = ybar - beta1hat * xbar
yhat = beta0hat + beta1hat * x
res = y - yhat
SS_res = sum(res^2) #residual sum of squares 
MS_res = SS_res/(n - 2) # residual mean square
sigma2hat = MS_res
s = res/sqrt(sigma2hat)
plot(yhat, s, main = "Residuals vs. yhat when the model is correctly specified")

Summary:

  • We have learned various residual plots for model checking.
  • We use theory and simulations to provide useful reference about what is satisfactory and what is not.
  • Next, we will zoom in to the level of individual observations and discuss outliers and leverage points.