1 / 11

Stepwise Regression

Stepwise Regression. Y may depend on many independent variables How to find a subset of X’s that best predict Y?

Télécharger la présentation

Stepwise Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stepwise Regression • Y may depend on many independent variables • How to find a subset of X’s that best predict Y? • There are several criteria (e.g., adjusted R2, AIC, BIC, likelihood ratio test, etc.) for model selection and many algorithms for including or excluding X’s in the model: forward selection, backward elimination, stepwise regression, etc. • With the availability of statistical packages, stepwise regression is now most commonly used. X1 Y X2 X3 X4 X5 X6

  2. A Data Set for Multiple Regression Measurements on men involved in a physical fitness course at N. C. State University. Fitness is typically measured by oxygen intake rate (oxy) which is difficult (at least cumbersome when one is exercising oneself) to measure. The study goal is to develop an equation to predict oxy based on exercise tests rather than on oxygen consumption measurements. The dataset has 31 observations. The variables in the data set are: age (in years) weight (in kg) oxy (oxygen intake rate, ml per kg body weight per minute) runtime (time to run 1.5 miles, in minutes) rstpulse (heart rate while resting) runpulse (heart rate while running, at the same time when oxygen rate was measured) maxpulse (maximum heart rate recorded while running).

  3. R Functions library(Hmisc) cor(myD,method="pearson|spearman") pairs(~age+weight+runtime+rstpulse+runpulse+maxpulse+oxy) rmat<-rcorr(as.matrix(myD), type="pearson|spearman") rmat print(rmat[1],digits=5) fit<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse) anova(fit) summary(fit) full.model<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse) best.model<-step(full.model,direction="backward") min.model<-lm(oxy~1) best.model<-step(min.model,direction="forward", scope="~age+weight+runtime+rstpulse+runpulse+maxpulse") new<-data.frame(specify values here) predict(fit,new,interval="confidence") predict(fit,new,interval="prediction")

  4. Correlation matrix age weight oxy runtime rstpulse runpulse maxpulse age 1.00000 -0.23354 -0.30459 0.18875 -0.16410 -0.33787 -0.43292 0.2061 0.0957 0.3092 0.3777 0.0630 0.0150 weight -0.23354 1.00000 -0.16275 0.14351 0.04397 0.18152 0.24938 0.2061 0.3817 0.4412 0.8143 0.3284 0.1761 oxy -0.30459 -0.16275 1.00000 -0.86219 -0.39936 -0.39797 -0.23674 0.0957 0.3817 <.0001 0.0260 0.0266 0.1997 runtime 0.18875 0.14351 -0.86219 1.00000 0.45038 0.31365 0.22610 0.3092 0.4412 <.0001 0.0110 0.0858 0.2213 rstpulse -0.16410 0.04397 -0.39936 0.45038 1.00000 0.35246 0.30512 0.3777 0.8143 0.0260 0.0110 0.0518 0.0951 runpulse -0.33787 0.18152 -0.39797 0.31365 0.35246 1.00000 0.92975 0.0630 0.3284 0.0266 0.0858 0.0518 <.0001 maxpulse -0.43292 0.24938 -0.23674 0.22610 0.30512 0.92975 1.00000 0.0150 0.1761 0.1997 0.2213 0.0951 <.0001

  5. Scatterplot matrix

  6. rcorr in Hmisc oxy age weight runtime rstpulse runpulse maxpulse oxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24 age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43 weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25 runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23 rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31 runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93 maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00 P oxy age weight runtime rstpulse runpulse maxpulse oxy 0.0957 0.3817 0.0000 0.0260 0.0266 0.1997 age 0.0957 0.2061 0.3092 0.3777 0.0630 0.0150 weight 0.3817 0.2061 0.4412 0.8143 0.3284 0.1761 runtime 0.0000 0.3092 0.4412 0.0110 0.0858 0.2213 rstpulse 0.0260 0.3777 0.8143 0.0110 0.0518 0.0951 runpulse 0.0266 0.0630 0.3284 0.0858 0.0518 0.0000 maxpulse 0.1997 0.0150 0.1761 0.2213 0.0951 0.0000 > print(rmat) oxy age weight runtime rstpulse runpulse maxpulse oxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24 age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43 weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25 runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23 rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31 runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93 maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00

  7. Backward elimination Start: AIC=58.16 oxy ~ age + weight + runtime + rstpulse + runpulse + maxpulse Df Sum of Sq RSS AIC - rstpulse 1 0.571 129.41 56.299 <none> 128.84 58.162 - weight 1 9.911 138.75 58.459 - maxpulse 1 26.491 155.33 61.958 - age 1 27.746 156.58 62.208 - runpulse 1 51.058 179.90 66.510 - runtime 1 250.822 379.66 89.664 Step: AIC=56.3 oxy ~ age + weight + runtime + runpulse + maxpulse Df Sum of Sq RSS AIC <none> 129.41 56.299 - weight 1 9.52 138.93 56.499 - maxpulse 1 26.83 156.23 60.139 - age 1 27.37 156.78 60.247 - runpulse 1 52.60 182.00 64.871 - runtime 1 320.36 449.77 92.917 the current model, i.e., without eliminating rstpulse

  8. Forward addition Start: AIC=104.7 oxy ~ 1 Df Sum of Sq RSS AIC + runtime 1 632.90 218.48 64.534 + rstpulse 1 135.78 715.60 101.313 + runpulse 1 134.84 716.54 101.354 + age 1 78.99 772.39 103.681 <none> 851.38 104.699 + maxpulse 1 47.72 803.67 104.911 + weight 1 22.55 828.83 105.867 Step: AIC=64.53 oxy ~ runtime Df Sum of Sq RSS AIC + age 1 17.7656 200.72 63.905 + runpulse 1 15.3621 203.12 64.274 <none> 218.48 64.534 + maxpulse 1 1.5674 216.91 66.311 + weight 1 1.3236 217.16 66.346 + rstpulse 1 0.1301 218.35 66.516 Step: AIC=63.9 oxy ~ runtime + age Df Sum of Sq RSS AIC + runpulse 1 39.885 160.83 59.037 + maxpulse 1 14.885 185.83 63.516 <none> 200.72 63.905 + weight 1 5.605 195.11 65.027 + rstpulse 1 2.641 198.07 65.494 Step: AIC=59.04 oxy ~ runtime + age + runpulse Df Sum of Sq RSS AIC + maxpulse 1 21.9007 138.93 56.499 <none> 160.83 59.037 + weight 1 4.5958 156.24 60.139 + rstpulse 1 0.4901 160.34 60.943 Step: AIC=56.5 oxy ~ runtime + age + runpulse + maxpulse IVs whose addition will improve fit IVs whose addition will make it worse

  9. Package leaps x<-as.matrix(myD) DV<-x[,1] IV<-x[,2:7] library(leaps) leaps(IV, DV, names=names(myD)[2:7], method="Cp") leaps(IV, DV, names=names(myD)[2:7], method=“adjr2")

  10. Criteria used in model selection • Ra2 • Cp • SBC (BIC) • AIC • Significance level Burnham, K. P. and D. R. Anderson. 2002 Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. Springer. (Best book on model selection)

More Related