rho = 1
p = 5
XtX = matrix(NA, p, p)
XtX[row(XtX) != col(XtX)] <- rho
XtX[row(XtX) == col(XtX)] <- 1
# solve(XtX)
eigen(XtX)$values
## [1] 5.000000e+00 1.776357e-15 0.000000e+00 0.000000e+00 0.000000e+00
Check the eigenvalues of \(X'X\):
rho = 0.9999
p = 5
XtX = matrix(NA, p, p)
XtX[row(XtX) != col(XtX)] <- rho
XtX[row(XtX) == col(XtX)] <- 1
solve(XtX)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8000.04 -1999.96 -1999.96 -1999.96 -1999.96
## [2,] -1999.96 8000.04 -1999.96 -1999.96 -1999.96
## [3,] -1999.96 -1999.96 8000.04 -1999.96 -1999.96
## [4,] -1999.96 -1999.96 -1999.96 8000.04 -1999.96
## [5,] -1999.96 -1999.96 -1999.96 -1999.96 8000.04
lambda = eigen(XtX)$values
lambda
## [1] 4.9996 0.0001 0.0001 0.0001 0.0001
lambda[1]/lambda
## [1] 1 49996 49996 49996 49996
Consequencies of large conditional number: - Inflated variance of \(\hat{\beta}_j\)
Next we check \(\text{Var}(\hat{\beta}_1)\) as a function of \(\rho\) (the correlation between two regressors).
p = 5; ret = rep(NA, 100)
rho_list = seq(from = 1, to = 9, length = 100)
for (i in 1:100){
rho = 1 - 10^(-rho_list[i]);
XtX = matrix(NA, p, p)
XtX[row(XtX) != col(XtX)] <- rho
XtX[row(XtX) == col(XtX)] <- 1
ret[i] = abs(solve(XtX)[1])
}
plot(rho_list, ret)
We check the expected squared deviation of \(\hat{\beta}_j\) from \(\beta_j\) by using \(\sigma^2 \sum_{j = 1}^p 1/\lambda_j\).
p = 5; ret = rep(NA, 100)
rho_list = seq(from = 1, to = 9, length = 100)
for (i in 1:100){
rho = 1 - 10^(-rho_list[i]);
XtX = matrix(NA, p, p)
XtX[row(XtX) != col(XtX)] <- rho
XtX[row(XtX) == col(XtX)] <- 1
ret[i] = sum(1/eigen(XtX)$values)
}
plot(rho_list, ret)
The following example is adapted from Tryfos (1998, pp. 130–1). According to Tryfos:
Before construction begins, a bridge project goes through a number of stages of production, one of which is the design stage. This phase is composed of various activities, each of which contributes directly to the overall design time. ….In short, predicting the design time is helpful for budgeting and internal as well as external scheduling purposes.
Information from 45 bridge projects was compiled for use in this study. The data are partially listed in Table 6.3 below and can be found on the book web site in the file bridge.txt. The response and predictor variables are as follows:
bridge <- read.table("bridge.txt", header=TRUE)
attach(bridge)
# Scatter plot matrix of the response variable and each of the regressors
pairs(Time~DArea+CCost+Dwgs+Length+Spans,data=bridge,cex.labels=1.4)
After a log-transformation, there is no clear evidence of nonconstant variance; see the top row of the following plot.
# Scatter plot matrix of the log-transformed data
pairs(log(Time)~log(DArea)+log(CCost)+log(Dwgs)+log(Length)+log(Spans),data=bridge)
m1 <- lm(log(Time)~log(DArea)+log(CCost)+log(Dwgs)+log(Length)+log(Spans))
StanRes1 <- rstandard(m1)
par(mfrow=c(2,3))
plot(log(DArea),StanRes1, ylab="Standardized Residuals")
plot(log(CCost),StanRes1, ylab="Standardized Residuals")
plot(log(Dwgs),StanRes1, ylab="Standardized Residuals")
plot(log(Length),StanRes1, ylab="Standardized Residuals")
plot(log(Spans),StanRes1, ylab="Standardized Residuals")
plot(m1$fitted.values,StanRes1, ylab="Standardized Residuals",xlab="Fitted values")
We can see that each residual plot shows a random pattern, thus out built model appears to be a valid model for this dataset.
# log(Time) vs. fitted values
par(mfrow=c(1,1))
plot(m1$fitted.values,log(Time),xlab="Fitted Values")
abline(lsfit(m1$fitted.values,log(Time)))
The staightline fit in the plot above provides further evidence that our model is a valid model for this dataset.
# Diagnostic plots from R
par(mfrow=c(2,2))
plot(m1)
abline(v=2*6/45,lty=2)
Diagnostic plots provided by R confirm our previous conclusion that the model is valid. The dashed vertical line in the bottom right-hand plot is the usual cut-off for declaring a point of high leverage (i.e.,\(2 \times (p + 1)/n\) = 12/45 = 0.267). Thus, there is a bad leverage point (i.e., case 22) that requires further investigation.
# Regression output
summary(m1)
##
## Call:
## lm(formula = log(Time) ~ log(DArea) + log(CCost) + log(Dwgs) +
## log(Length) + log(Spans))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68394 -0.17167 -0.02604 0.23157 0.67307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.28590 0.61926 3.691 0.000681 ***
## log(DArea) -0.04564 0.12675 -0.360 0.720705
## log(CCost) 0.19609 0.14445 1.358 0.182426
## log(Dwgs) 0.85879 0.22362 3.840 0.000440 ***
## log(Length) -0.03844 0.15487 -0.248 0.805296
## log(Spans) 0.23119 0.14068 1.643 0.108349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3139 on 39 degrees of freedom
## Multiple R-squared: 0.7762, Adjusted R-squared: 0.7475
## F-statistic: 27.05 on 5 and 39 DF, p-value: 1.043e-11
In general,
The output from R below gives the correlations between the predictors. Notice how most of the correlations are greater than 0.8.
# Correlation matrix of regressors
logDArea <- log(DArea)
logCCost <- log(CCost)
logDwgs <- log(Dwgs)
logLength <- log(Length)
logSpans <- log(Spans)
X <- cbind(logDArea,logCCost,logDwgs,logLength,logSpans)
c <- cor(X)
round(c,3)
## logDArea logCCost logDwgs logLength logSpans
## logDArea 1.000 0.909 0.801 0.884 0.782
## logCCost 0.909 1.000 0.831 0.890 0.775
## logDwgs 0.801 0.831 1.000 0.752 0.630
## logLength 0.884 0.890 0.752 1.000 0.858
## logSpans 0.782 0.775 0.630 0.858 1.000
The variance inflation factors for the bridge construction example are as follows:
library(car)
## Warning: package 'car' was built under R version 3.4.1
vif(m1)
## log(DArea) log(CCost) log(Dwgs) log(Length) log(Spans)
## 7.164619 8.483522 3.408900 8.014174 3.878397
A number of these variance inflation factors exceed 5, the cut-off often used, and so the associated regression coefficients are poorly estimated due to multicollinearity.
detach(bridge) # detach the dataset