STATS 330: Lecture 8

STATS 330: Lecture 8 Collinearity 330 lecture 8

Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection with estimating regression coefficients To discuss added variable plots, a graphical method for deciding if a variable should be added to a regression 330 lecture 8

Variance of regression coefficients • We saw in Lecture 6 how the standard errors of the regression coefficients depend on the error variance s2: the bigger s2, the bigger the standard errors. • We also suggested that the standard error depends on the arrangement of the x’s. • In today’s lecture, we explore this idea a bit further. 330 lecture 8

Example • Suppose we have a regression relationship of the form Y=1 + 2x –w + e between a response variable Y and two explanatory variables x and w. • Consider two data sets, A and B, each following the model above. 330 lecture 8

Data sets A & B: x,w data 330 lecture 8

330 lecture 8

Conclusion: • The greater the correlation, the more variable the plane. • In fact, for the coefficient b of x 330 lecture 8

Generalization If we have k explanatory variables, then the variance of the jth estimated coefficient is where Rj2 is the R2 if we regress variable j on the other explanatory variables 330 lecture 8

Best case • If xj is orthogonal to (uncorrelated with) the other explanatory variables, then Rj2 is zero and the variance is the smallest possible i.e. 330 lecture 8

Variance inflation factor The factor represents the increase in variance caused by correlation between the explanatory variables and is called the variance inflation factor (VIF) 330 lecture 8

Calculating the VIF: theory To calculate the VIF for the jth explanatory variable, use the relationship using the residuals from regressing the jth explanatory variable on the other explanatory variables 330 lecture 8

Calculating the VIF: example For the petrol data, calculate the VIF for t.vp (tank vapour pressure) > attach(vapour.df) > tvp.reg <- lm(t.vp~t.temp + p.temp + p.vp,data=vapour.df) > var(t.vp)/var(residuals(tvp.reg)) [1] 66.13817 Correlation increases variance by a factor of 66 330 lecture 8

Calculating the VIF: quick method A useful mathematical relationship: Suppose we calculate the inverse of the correlation matrix of the explanatory variables. Then the VIF’s are the diagonal elements. > X<-vapour.df[,-5] # delete 5th column # (hc, the response) > VIF <- diag(solve(cor(X))) > VIF t.temp p.temp t.vp p.vp 11.927292 5.615662 66.138172 60.938695 330 lecture 8

Pairs plot 330 lecture 8

Collinearity • If one or more variables in a regression have big VIF’s, the regression is said to be collinear • Caused by one or more variables being almost linear combinations of the others • Sometimes indicated by high correlations between the independent variables • Results in imprecise estimation of regression coefficients • Standard errors are high, so t-statistics are small, variables are often non-significant ( data is insufficient to detect a difference) 330 lecture 8

Non-significance • If a variable has a non-significantt, then either • The variable is not related to the response, or • The variable is related to the response, but it is not required in the regression because it is strongly related to a third variable that is in the regression, so we don’t need both. • First case: small t-value, small VIF, small correlation with response • Second case, small t-value, big VIF, big correlation with response 330 lecture 8

Remedy • The usual remedy is to drop one or more variables from the model. • This “breaks” the linear relationship between the variables • This leads to the problem of “subset selection”, which subset to choose. • See Lectures 14 and 15 330 lecture 8

Example: Cement data • Measurements on batches of cement • Response variable: Heat (heat emitted) • Explanatory variables • X1: amount of tricalcium aluminate (%) • X2: amount of tricalcium silicate (%) • X3: amount of tetracalcium aluminaoferrite (%) • X4: amount of dicalcium silicate (%) 330 lecture 8

Example: Cement data Heat X1 X2 X3 X4 78.5 7 26 6 60 74.3 1 29 15 52 104.3 11 56 8 20 87.6 11 31 8 47 95.9 7 52 6 33 109.2 11 55 9 22 102.7 3 71 17 6 72.5 1 31 22 44 93.1 2 54 18 22 115.9 21 47 4 26 83.8 1 40 23 34 113.3 11 66 9 12 109.4 10 68 8 12 330 lecture 8

Example: Cement data Estimate Std. Error t value Pr(>|t|) (Intercept) 62.4054 70.0710 0.891 0.3991 X1 1.5511 0.7448 2.083 0.0708 . X2 0.5102 0.7238 0.705 0.5009 X3 0.1019 0.7547 0.135 0.8959 X4 -0.1441 0.7091 -0.203 0.8441 Residual standard error: 2.446 on 8 degrees of freedom Multiple R-Squared: 0.9824, Adjusted R-squared: 0.9736 F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07 Large p-values > round(cor(cement.df),2) Heat X1 X2 X3 X4 Heat 1.00 0.73 0.82 -0.53 -0.82 X1 0.73 1.00 0.23 -0.82 -0.25 X2 0.82 0.23 1.00 -0.14 -0.97 X3 -0.53 -0.82 -0.14 1.00 0.03 X4 -0.82 -0.25 -0.97 0.03 1.00 Big R-squared Big correlation 330 lecture 8

Cement data Omit Heat > diag(solve(cor(cement.df[, -1]))) X1 X2 X3 X4 38.49621 254.42317 46.86839 282.51286 Very large! Very large! cement.df$X1 + cement.df$X2 + cement.df$X3 + cement.df$X4 [1] 99 97 95 97 98 97 97 98 96 98 98 98 98 330 lecture 8

Drop X4 > diag(solve(cor(cement.df[, -c(1,5)]))) X1 X2 X3 3.251068 1.063575 3.142125 VIF’s now small Estimate Std. Error t value Pr(>|t|) (Intercept) 48.19363 3.91330 12.315 6.17e-07 *** X1 1.69589 0.20458 8.290 1.66e-05 *** X2 0.65691 0.04423 14.851 1.23e-07 *** X3 0.25002 0.18471 1.354 0.209 Residual standard error: 2.312 on 9 degrees of freedom Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9764 F-statistic: 166.3 on 3 and 9 DF, p-value: 3.367e-08 X1, X2 now signif 330 lecture 8

Added variable plots (AVP’s) To see if a variable, say x, is needed in a regression: • Step 1: Calculate the residuals from regressing the response on all the explanatory variables except x • Step 2: calculate the residuals from regressing x on the other explanatory variables • Step 3: Plot the first set of residuals versus the second set NB: Also called partial regression plots in some books 330 lecture 8

Rationale • The first set of residuals represents the variation in y not explained by the other explanatory variables • The second set of residuals represents the part of x not explained by the other explanatory variables • If there is a relationship between the two sets, there is a relationship between x and the response that is not accounted for by the other explanatory variables • Thus, if we see a relationship in the plot, x is needed in the regression!!! 330 lecture 8

Example: the petrol data Let’s do an AVP for tank vapour pressure, t.vp. > rest.reg<- lm(hc~t.temp + p.temp + p.vp,data=vapour.df) > y.res<-residuals(rest.reg) > tvp.reg<-lm(t.vp~t.temp + p.temp + p.vp,data=vapour.df) > tvp.res<-residuals(tvp.reg) > plot(tvp.res,y.res, xlab = "Tank vapour pressure", ylab="Hydrocarbon emission", main = “AVP for Tank vapour pressure") 330 lecture 8

Hint of a relationship: so variable required? 330 lecture 8

Short cut in R There is a function added.variable.plots in R to draw the plots automatically. This is one of the functions in the R330 package which must be installed before the function can be used. Note this useful trick! > vapour.lm<-lm(hc~.,data=vapour.df) > par(mfrow=c(2,2)) # 2 x 2 array of plots > added.variable.plots(vapour.lm) 330 lecture 8

Not significant in the regression (Lect 7) 330 lecture 8

Some curious facts about AVP’s • Assuming a constant term in both regressions, a least squares line fitted though the AVP goes through the origin. • The slope of this line is the fitted regression coefficient for the variable in the original regression • The residuals from this line are the same as the residuals from the original regression 330 lecture 8

STATS 330: Lecture 8