rho = 1
p = 5
XtX = matrix(NA, p, p)
XtX[row(XtX) != col(XtX)] <- rho
XtX[row(XtX) == col(XtX)] <- 1
# solve(XtX)
eigen(XtX)$values
## [1] 5.000000e+00 1.776357e-15 0.000000e+00 0.000000e+00 0.000000e+00

Check the eigenvalues of \(X'X\):

rho = 0.9999
p = 5
XtX = matrix(NA, p, p)
XtX[row(XtX) != col(XtX)] <- rho
XtX[row(XtX) == col(XtX)] <- 1
solve(XtX)
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,]  8000.04 -1999.96 -1999.96 -1999.96 -1999.96
## [2,] -1999.96  8000.04 -1999.96 -1999.96 -1999.96
## [3,] -1999.96 -1999.96  8000.04 -1999.96 -1999.96
## [4,] -1999.96 -1999.96 -1999.96  8000.04 -1999.96
## [5,] -1999.96 -1999.96 -1999.96 -1999.96  8000.04
lambda = eigen(XtX)$values
lambda 
## [1] 4.9996 0.0001 0.0001 0.0001 0.0001
lambda[1]/lambda 
## [1]     1 49996 49996 49996 49996

Consequencies of large conditional number: - Inflated variance of \(\hat{\beta}_j\)

Next we check \(\text{Var}(\hat{\beta}_1)\) as a function of \(\rho\) (the correlation between two regressors).

p = 5; ret = rep(NA, 100)
rho_list = seq(from = 1, to = 9, length = 100)
for (i in 1:100){
  rho = 1 - 10^(-rho_list[i]); 
  XtX = matrix(NA, p, p)
  XtX[row(XtX) != col(XtX)] <- rho
  XtX[row(XtX) == col(XtX)] <- 1
  ret[i] = abs(solve(XtX)[1])
}

plot(rho_list, ret)

We check the expected squared deviation of \(\hat{\beta}_j\) from \(\beta_j\) by using \(\sigma^2 \sum_{j = 1}^p 1/\lambda_j\).

p = 5; ret = rep(NA, 100)
rho_list = seq(from = 1, to = 9, length = 100)
for (i in 1:100){
  rho = 1 - 10^(-rho_list[i]); 
  XtX = matrix(NA, p, p)
  XtX[row(XtX) != col(XtX)] <- rho
  XtX[row(XtX) == col(XtX)] <- 1
  ret[i] = sum(1/eigen(XtX)$values)
}

plot(rho_list, ret)

Example: Bridge construction

The following example is adapted from Tryfos (1998, pp. 130–1). According to Tryfos:

Before construction begins, a bridge project goes through a number of stages of production, one of which is the design stage. This phase is composed of various activities, each of which contributes directly to the overall design time. ….In short, predicting the design time is helpful for budgeting and internal as well as external scheduling purposes.

Information from 45 bridge projects was compiled for use in this study. The data are partially listed in Table 6.3 below and can be found on the book web site in the file bridge.txt. The response and predictor variables are as follows:

bridge <- read.table("bridge.txt", header=TRUE)
attach(bridge)
# Scatter plot matrix of the response variable and each of the regressors 
pairs(Time~DArea+CCost+Dwgs+Length+Spans,data=bridge,cex.labels=1.4)

After a log-transformation, there is no clear evidence of nonconstant variance; see the top row of the following plot.

# Scatter plot matrix of the log-transformed data 
pairs(log(Time)~log(DArea)+log(CCost)+log(Dwgs)+log(Length)+log(Spans),data=bridge)

Linear regression and standardized residual plots

m1 <- lm(log(Time)~log(DArea)+log(CCost)+log(Dwgs)+log(Length)+log(Spans))
StanRes1 <- rstandard(m1)
par(mfrow=c(2,3))
plot(log(DArea),StanRes1, ylab="Standardized Residuals")
plot(log(CCost),StanRes1, ylab="Standardized Residuals")
plot(log(Dwgs),StanRes1, ylab="Standardized Residuals")
plot(log(Length),StanRes1, ylab="Standardized Residuals")
plot(log(Spans),StanRes1, ylab="Standardized Residuals")
plot(m1$fitted.values,StanRes1, ylab="Standardized Residuals",xlab="Fitted values")

We can see that each residual plot shows a random pattern, thus out built model appears to be a valid model for this dataset.

# log(Time) vs. fitted values 
par(mfrow=c(1,1))
plot(m1$fitted.values,log(Time),xlab="Fitted Values")
abline(lsfit(m1$fitted.values,log(Time)))

The staightline fit in the plot above provides further evidence that our model is a valid model for this dataset.

# Diagnostic plots from R 
par(mfrow=c(2,2))
plot(m1)
abline(v=2*6/45,lty=2)

Diagnostic plots provided by R confirm our previous conclusion that the model is valid. The dashed vertical line in the bottom right-hand plot is the usual cut-off for declaring a point of high leverage (i.e.,\(2 \times (p + 1)/n\) = 12/45 = 0.267). Thus, there is a bad leverage point (i.e., case 22) that requires further investigation.

# Regression output 
summary(m1)
## 
## Call:
## lm(formula = log(Time) ~ log(DArea) + log(CCost) + log(Dwgs) + 
##     log(Length) + log(Spans))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68394 -0.17167 -0.02604  0.23157  0.67307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.28590    0.61926   3.691 0.000681 ***
## log(DArea)  -0.04564    0.12675  -0.360 0.720705    
## log(CCost)   0.19609    0.14445   1.358 0.182426    
## log(Dwgs)    0.85879    0.22362   3.840 0.000440 ***
## log(Length) -0.03844    0.15487  -0.248 0.805296    
## log(Spans)   0.23119    0.14068   1.643 0.108349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3139 on 39 degrees of freedom
## Multiple R-squared:  0.7762, Adjusted R-squared:  0.7475 
## F-statistic: 27.05 on 5 and 39 DF,  p-value: 1.043e-11
  • Notice that while the overall F-test for model is highly statistically significant (i.e., has a very small p-value), only one of the estimated regression coefficients is statistically significant (i.e., log(Dwgs) with a p-value < 0.001).
  • Even more troubling is the fact that the estimated regression coefficients for log(DArea) and log(Length) are of the wrong sign (i.e., negative), since longer bridges or bridges with larger area should take a longer rather than a shorter time to design.

In general,

  • When two or more highly correlated predictor variables are included in a regression model, they are effectively carrying very similar information about the response variable. Thus, it is difficult for least squares to distinguish their sepaate effects on the response variable.
  • In this situation the overall F-test will be highly statistically significant but very few of the regression coefficients may be statistically significant.
  • Another consequence of highly correlated predictor variables is that some of the coefficients in the regression model are of the oppo- site sign than expected.

The output from R below gives the correlations between the predictors. Notice how most of the correlations are greater than 0.8.

# Correlation matrix of regressors 
logDArea <- log(DArea)
logCCost <- log(CCost)
logDwgs <- log(Dwgs)
logLength <- log(Length)
logSpans <- log(Spans)
X <- cbind(logDArea,logCCost,logDwgs,logLength,logSpans)
c <- cor(X)
round(c,3)
##           logDArea logCCost logDwgs logLength logSpans
## logDArea     1.000    0.909   0.801     0.884    0.782
## logCCost     0.909    1.000   0.831     0.890    0.775
## logDwgs      0.801    0.831   1.000     0.752    0.630
## logLength    0.884    0.890   0.752     1.000    0.858
## logSpans     0.782    0.775   0.630     0.858    1.000

The variance inflation factors for the bridge construction example are as follows:

library(car)
## Warning: package 'car' was built under R version 3.4.1
vif(m1)
##  log(DArea)  log(CCost)   log(Dwgs) log(Length)  log(Spans) 
##    7.164619    8.483522    3.408900    8.014174    3.878397

A number of these variance inflation factors exceed 5, the cut-off often used, and so the associated regression coefficients are poorly estimated due to multicollinearity.

detach(bridge) # detach the dataset 

What’s next:

  • We shall learn model building techniques to select the best set of variables. For example, if two variables carry similar information, we only include one of them into the model.
  • We shall use regularization in model fitting, therefore we include all variables in the model and avoid undesirable performances of parameter estimation.