Choosing the “best” model

Choosing the “best” model

Télécharger la présentation

Choosing the “best” model

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. Choosing the “best” model (Session 08)

2. Learning Objectives At the end of this session, you will be able to • use a simple descriptive approach to select of the most appropriate subset of explanatory variables • apply methods of variable selection (based on statistical tests) in a meaningful way to get the “best” model • appreciate the effect on t-probabilities when x’s are added or dropped from a model • understand dangers of using automatic selection procedures

3. Example of choosing “best” set of x’s Consider data (fictitious) from a retrospective study of patients surviving less than 4 months after being diagnosed as having acute leukaemia. Objective: To identify factors affecting survival time. Variables were: y = survival time (days) after diagnosis x1 = no: of chemotherapy sessions x2 = total volume of blood transfused x3 = no: of days of hospital care x4 = age of patient (years).

5. Summary statistics for all regressions How many possible regression models exist? Example with x1 and x3 to show summaries: ---------+--------------------------------------- Source | SS df MS F Prob>F ---------+--------------------------------------- Model | 1488.691 2 744.346 6.07 0.0188 Residual | 1227.072 10 122.707 ---------+--------------------------------------- Total | 2715.763 12 226.314 ---------+--------------------------------------- No. of parameters fitted (p) = 3 R2p = 1488.69 / 2715.07 = 0.5482 Adjusted R2p =1 – 122.71 / 226.31 = 0.4578

6. Descriptive approach (all regressions)

7. A descriptive approach… continued Plot R2 versus no. of parameters (p) in model Which model would you select on the basis of these results?

8. A descriptive approach… continued Alternatively, plot residual mean square. Small residual mean square is good! Which model would you select on the basis of the residual mean square?

9. An inferential approach… • Use a sequential procedure to select variables that contribute most, and significantly, to the regression model. • Three popular methods exist: • Forward selection • Backward elimination • Stepwise regression

10. Forward selection … Select the “best” single variable - see slide 6 Ask, “Is it contributing significantly?” Answer: Yes (see below) ----------------------------------------- y | Coef. Std. Err. t P>|t| -------+--------------------------------- x4 | -.73816 .1546 -4.77 0.001 const. | 117.57 5.2622 22.34 0.000 ----------------------------------------- Now consider 2-variable models with x4.

11. Two-variable models with x4 ----------------------------------------- y | Coef. Std.Err. t P>|t| -------------+--------------------------- x4 | -.61395 .04864 -12.62 0.000 x1 | 1.4400 .13842 10.40 0.000 const.| 103.10 2.1240 48.54 0.000 ----------------------------------------- x4 | -.45694 .69595 -0.66 0.526 x2 | .31090 .74861 0.42 0.687 const.| 94.160 56.627 1.66 0.127 ----------------------------------------- x4 | -.72460 .07233 -10.02 0.000 x3 | -1.1999 .18902 -6.35 0.000 const.| 131.28 3.2748 40.09 0.000 -----------------------------------------

12. Three-variable models with x4, x1 ----------------------------------------- y | Coef. Std.Err. t P>|t| -------------+--------------------------- x4 | -.23654 .17329 -1.37 0.205 x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052 const. | 71.648 14.142 5.07 0.001 ----------------------------------------- x4 | -.64280 .04454 -14.43 0.000 x1 | 1.0519 .22368 4.70 0.001 x3 | -.41004 .19923 -2.06 0.070 const. | 111.68 4.5625 24.48 0.000 ----------------------------------------- Model with x1, x2 and x4 would be selected! - despite x4 now being non-significant!

13. Backward elimination gives x1,x2 --------------------------------------- y | Coef. Std.Err. t P>|t| -----+--------------------------------- x1 | 1.5511 .74477 2.08 0.071 x2 | .51017 .7238 0.70 0.501 x3 | .10191 .7547 0.14 0.896 x4 | -.14406 .7091 -0.20 0.844 --------------------------------------- x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052 x4 | -.23654 .17329 -1.37 0.205 --------------------------------------- x1 | 1.4683 .12130 12.10 0.000 x2 | .66225 .04585 14.44 0.000 ---------------------------------------

14. Stepwise selection procedure… This is similar to forward selection, but at each stage of the process, all x’s in the model are re-assessed to check if those that entered the model at an earlier stage still remain “important”. Note: Software packages allow automatic use of one of these with pre-specified p-values for selection and deletion of variables. Usually available only with quantitative x’s.

15. Discussion… in small groups • Look back at results. What do you observe with the forward and backward procedures. Do they give the same results? • Did the selection using forward seem sensible, given that for x4, the p-value =0.205? • Can you work out what model would results with a stepwise selection procedures? • Is it a good idea to use such automatic selection procedures available in software packages? If not, why not?

16. Discussion continued… Suppose a medical researcher told you that a model without x2 was not meaningful, how would you proceed with your model selection? What other latent (lurking) variables, measurable or non-measurable, might affect y? What further steps would you undertaken before accepting the final model?