1 / 90

Building useful models: Some new developments and easily avoidable errors

Building useful models: Some new developments and easily avoidable errors. Michael Babyak, PhD. What is a model ?. Y = f(x1, x2, x3…xn). Y = a + b1x1 + b2x2…bnxn. Y = e a + b1x1 + b2x2…bnxn. “All models are wrong, some are useful” -- George Box. A useful model is Not very biased

liam
Télécharger la présentation

Building useful models: Some new developments and easily avoidable errors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

  2. What is a model ? Y = f(x1, x2, x3…xn) Y = a + b1x1 + b2x2…bnxn Y = e a + b1x1 + b2x2…bnxn

  3. “All models are wrong, some are useful” -- George Box • A useful model is • Not very biased • Interpretable • Replicable (predicts in a new sample)

  4. Some Premises • “Statistics” is a cumulative, evolving field • Newer is not necessarily better, but should be entertained in the context of the scientific question at hand • Data analytic practice resides along a continuum, from exploratory to confirmatory. Both are important, but the difference has to be recognized. • There’s no substitute for thinking about the problem

  5. Statistics is a cumulative, evolving field: How do we know this stuff? • Theory • Simulation

  6. Concept of Simulation Y = b X + error bs1 bs2 bsk-1 bsk bs3 bs4 ………………….

  7. Concept of Simulation Y = b X + error bs1 bs2 bsk-1 bsk bs3 bs4 …………………. Evaluate

  8. Simulation Example Y = .4 X + error bs1 bs2 bsk-1 bsk bs3 bs4 ………………….

  9. Simulation Example Y = .4 X + error bs1 bs2 bsk-1 bsk bs3 bs4 …………………. Evaluate

  10. True Model:Y = .4*x1 + e

  11. Ingredients of a Useful Model Correct probability model Based on theory Good measures/no loss of information Useful Model Comprehensive Parsimonious Tested fairly Flexible

  12. Correct Model • Gaussian: General Linear Model • Multiple linear regression • Binary (or ordinal): Generalized Linear Model • Logistic Regression • Proportional Odds/Ordinal Logistic • Time to event: • Cox Regression or parametric survival models

  13. Generalized Linear Model Normal Binary/Binomial Count, heavy skew, Lots of zeros Poisson, ZIP, negbin, gamma General Linear Model/ Linear Regression Logistic Regression ANOVA/t-test ANCOVA Chi-square Regression w/ Transformed DV Can be applied to clustered (e.g, repeated measures data)

  14. Factor Analytic Family Structural Equation Models Partial Least Squares Latent Variable Models (Confirmatory Factor Analysis) Multiple regression Common Factor Analysis Principal Components

  15. Use Theory • Theory and expert information are critical in helping sift out artifact • Numbers can look very systematic when the are in fact random • http://www.tufts.edu/~gdallal/multtest.htm

  16. Measure well • Adequate range • Representative values • Watch for ceiling/floor effects

  17. Using all the information • Preserving cases in data sets with missing data • Conventional approaches: • Use only complete case • Fill in with mean or median • Use a missing data indicator in the model

  18. Missing Data • Imputation or related approaches are almost ALWAYS better than deleting incomplete cases • Multiple Imputation • Full Information Maximum Likelihood

  19. Multiple Imputation

  20. Modern Missing Data Techniques • Preserve more information from original sample • Incorporate uncertainty about missingness into final estimates • Produce better estimates of population (true) values

  21. Don’t throw waste information from variables • Use all the information about the variables of interest • Don’t create “clinical cutpoints” before modeling • Model with ALL the data first, then use prediction to make decisions about cutpoints

  22. Dichotomizing for Convenience = Dubious Practice(C.R.A.P.*) • Convoluted Reasoning and Anti-intellectual Pomposity • Streiner & Norman: Biostatistics: The Bare Essentials

  23. Implausible measurement assumption “not depressed” “depressed” A B C Depression score

  24. Loss of power http://psych.colorado.edu/~mcclella/MedianSplit/ Sometimes through sampling error You can get a ‘lucky cut.’ http://www.bolderstats.com/jmsl/doc/medianSplit.html

  25. Dichotomization, by definition, reduces the magnitude of the estimate by a minimum of about 30% Dear Project Officer, In order to facilitate analysis and interpretation, we have decided to throw away about 30% of our data. Even though this will waste about 3 or 4 hundred thousand dollars worth of subject recruitment and testing money, we are confident that you will understand. Sincerely, Dick O. Tomi, PhD Prof. Richard Obediah Tomi, PhD

  26. Power to detect non-zero b-weight when x is continuous versus dichotomized True model: y =.4x + e

  27. Dichotomizing will obscure non-linearity Low High CESD Score

  28. Dichotomizing will obscure non-linearity: Same data as previous slide modeled continuously

  29. Type I error rates for the relation between x2 and y after dichotomizing two continuous predictors.Maxwell and Delaney calculated the effect of dichotomizing two continuous predictors as a function of the correlation between them. The true model is y = .5x1 + 0x2, where all variables are continuous. If x1 and x2 are dichotomized, the error rate for the relation between x2 and y increases as the correlation between x1 and x2 increases.

  30. Is it ever a good idea to categorize quantitatively measured variables? • Yes: • when the variable is truly categorical • for descriptive/presentational purposes • for hypothesis testing, if enough categories are made. • However, using many categories can lead to problems of multiple significance tests and still run the risk of misclassification

  31. CONCLUSIONS • Cutting: • Doesn’t always make measurement sense • Almost always reduces power • Can fool you with too much power in some instances • Can completely miss important features of the underlying function • Modern computing/statistical packages can “handle” continuous variables • Want to make good clinical cutpoints? Model first, decide on cuts afterward.

  32. Sample size and the problem of underfitting vs overfitting • Model assumption is that “ALL” relevant variables be included—the “antiparsimony principle” • Tempered by fact that estimating too many unknowns with too little data will yield junk

  33. Sample Size Requirements • Linear regression • minimum of N = 50 + 8:predictor (Green, 1990) • Logistic Regression • Minimum of N = 10-15/predictor among smallest group (Peduzzi et al., 1990a) • Survival Analysis • Minimum of N = 10-15/predictor (Peduzzi et al., 1990b)

  34. Consequences of inadequate sample size • Lack of power for individual tests • Unstable estimates • Spurious good fit—lots of unstable estimates will produce spurious ‘good-looking’ (big) regression coefficients

  35. All-noise, but good fit R-squares from a population model of completelyrandom variables Events per predictor ratio

  36. Simulation: number of events/predictor ratio Y = .5*x1 + 0*x2 + .2*x3 + 0*x4 -- Where r x1 x4 = .4 -- N/p = 3, 5, 10, 20, 50

  37. Parameter stability and n/p ratio

  38. Peduzzi’s Simulation: number of events/predictor ratio P(survival) =a + b1*NYHA + b2*CHF + b3*VES +b4*DM + b5*STD + b6*HTN + b7*LVC --Events/p = 2, 5, 10, 15, 20, 25 --% relative bias = (estimated b – true b/true b)*100

  39. Simulation results: number of events/predictor ratio

  40. Simulation results: number of events/predictor ratio

  41. Approaches to variable selection • “Stepwise” automated selection • Pre-screening using univariate tests • Combining or eliminating redundant predictors • Fixing some coefficients • Theory, expert opinion and experience • Penalization/Random effects • Propensity Scoring • “Matches” individuals on multiple dimensions to improve “baseline balance” • Tibshirani’s “Lasso”

  42. Any variable selection technique based on looking at the data first will likely be biased

  43. “I now wish I had never written the stepwise selection code for SAS.” • --Frank Harrell, author of forward and backwards selection algorithm for SAS PROC REG

  44. Automated Selection: Derksen and Keselman (1992) Simulation Study • Studied backward and forward selection • Some authentic variables and some noise variables among candidate variables • Manipulated correlation among candidate predictors • Manipulated sample size

  45. Automated Selection: Derksen and Keselman (1992) Simulation Study • “The degree of correlation between candidate predictors affected the frequency with which the authentic predictors found their way into the model.” • “The greater the number of candidate predictors, the greater the number of noise variables were included in the model.” • “Sample size was of little practical importance in determining the number of authentic variables contained in the final model.”

  46. Simulation results: Number of noise variables included Sample Size 20 candidate predictors; 100 samples

  47. Simulation results: R-square from noise variables Sample Size 20 candidate predictors; 100 samples

  48. Simulation results: R-square from noise variables Sample Size 20 candidate predictors; 100 samples

  49. SOME of the problems with stepwise variable selection. 1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen and Keselman) 9. It allows us to not think about the problem 10. It uses a lot of paper

More Related