Handling Missing Data

SSC Case Study 2002 Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre

Presentation Outline • Preliminary analysis • Various plots • Assessing the missing pattern • Spearman rank correlation, logistic regression • Data analysis with missing data - Multiple Imputation • Random hot deck imputation with bootstrap • PROC MI and MIANALIZE (SAS) • Transcan function (Hmisc library in S plus or R) • Conclusions • Further work Objective: Compare different approaches to handle missing data from a practitioner’s point of view SSC Conference Hamilton Ontario May 2002

Preliminary analysis RESPONSE OVERVIEW Sample size: 2389 Males: 1097 (45.9%) Females: 1292 (54.1%) Observed: 1691 Missing: 698 (28.8%) Mean: 0.9129 • The response variable is highly skewed to the left. Histogram of observed responses DVHST94 SSC Conference Hamilton Ontario May 2002

Preliminary analysis • 8 covariates in total, first 4 shown here. • There appears to be a pattern of two clusters in the response DVHST94 (below 0.5 and above 0.5). • DVBMI94 appears to have some “wild” values ( = 96) • 43 observations , all males. (3.9% of males sample) • Wild values were replaced with the mean DVBMI94 of males • DVBMI94 transformation: NEW.DVBMI94 = abs (DVBMI94 – 22) SSC Conference Hamilton Ontario May 2002

Preliminary analysis • There are no obvious linear patterns between the covariates and the response DVHST94 • DVPP94 is recoded as dichotomous: NEW.DVPP94 = 0 (91% of observations) NEW.DVPP94 > 0 (9% of observations) • The AGEGRP covariate is recoded to NEW.AGE NEW.AGE = mid range value (AGEGRP) – 20 SSC Conference Hamilton Ontario May 2002

Preliminary analysis Mean DVHST94 SSC Conference Hamilton Ontario May 2002

Preliminary analysis • Strength of marginal relationships between the covariates and the response using generalized Spearman chi-square SSC Conference Hamilton Ontario May 2002

Assessing the missing pattern • The missing pattern of the response does not appear to depend on the sampling weights SSC Conference Hamilton Ontario May 2002

Assessing the missing pattern • The missing values depend on age SSC Conference Hamilton Ontario May 2002

Assessing the missing pattern LOGISTIC REGRESSION Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 *** NEW.AGE 0.181625 0.007524 24.140 < 2e-16 *** SEXMale -0.847947 0.131475 -6.450 1.12e-10 *** DVHHIN94 0.047828 0.026768 1.787 0.0740 . DVSMKT94 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 NUMCHRON -0.087992 0.048783 -1.804 0.0713 . VISITS 0.012483 0.006563 1.902 0.0572 . NEW.WT6 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665 % missing for males: 24% % missing for females: 34% SSC Conference Hamilton Ontario May 2002

Multiple imputation Methods: • Random Hot Deck MI with Bootstrap • SAS PROC MI and PROC MIANALIZE • Function TRANSCAN in S-plus from Hmisc Library (Frank Harrel) SSC Conference Hamilton Ontario May 2002

Multiple Imputation • IMPUTATION: • Impute the missing entries of the incomplete data sets B times, resulting in B complete data sets. • ANALYSIS: • Analyze each of the B completed data sets using weighted least squares. • POOLING: • Integrate the B analysis results into a final result. Simple rules exist for combining theB analyses. POOLING IMPUTATION ANALYSIS INCOMPLETE DATA IMPUTED DATA FINAL RESULTS ANALYSIS RESULTS SSC Conference Hamilton Ontario May 2002

Estimated Estimated Observed Missing response response Complete data Choose randomly with replacement Probability ~ weights ( , ) (Within variance,R-square) ( , ) (Within variance ,R-square) Same procedure Compute 95% CI for judging significance of predictors Random hot-deck MI with Bootstrap B = 1000 replicates SSC Conference Hamilton Ontario May 2002

PROC MI & MIANALYZE Method PROC MI • By default generates 5 imputation values for each missing value • Imputation method: MCMC (Markov Chain Monte Carlo) • EM algorithm determines initial values • MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn • Assumption: Data follows multivariate normal distribution • PROC REG • Fits five weighted linear regression models to the five complete data sets obtained from PROC MI (used by_imputation_statement ) PROC MIANALIZE Reads the parameter estimates and associated covariance matrix from the analysis performed on the multiple imputed data sets and derives valid statistics for the parameters SSC Conference Hamilton Ontario May 2002

TRANSCAN(Splus,Hmisc) Frank Harrell Transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. • Advantage: • Does not need normality assumption or symmetry of residuals. • Does shrinkage to avoid overfitting • Disadvantage: • “Freezes” the imputation model before drawing the multiple imputations. • It approximates the multiple imputation algorithm described by Rubin’s Bayesian bootstrap. • Draws a sample of size rfrom rnon-missing residuals. • Chooses a sample of size m from this sample of size rwith replacement. mis the number of missing values. • LSBootstrapBootstrap • Generates imputed values with the linear imputation model and the bootstrapped residuals. This algorithm is repeated B times to obtain the multiple imputed data sets that are analyzed using WLS with the function LM. SSC Conference Hamilton Ontario May 2002

Comparing imputation methods • Ranking: • TRANSCAN ( Advantage: shrinkage correction to prevent over fitting) • PROC MI (Drawback: normality assumption) • Bootstrap random hot deck (does not use the information of the covariates) SSC Conference Hamilton Ontario May 2002

Significant variables SSC Conference Hamilton Ontario May 2002

Conclusions about the missing pattern • The missing values of the response variable DVHST94 are not MCAR. The probability of missing depends primarily on the age and sex covariates, therefore the missing values are MAR. SSC Conference Hamilton Ontario May 2002

Conclusions about multiple imputation • Transcan function appeared to perform better than PROC MI for imputing and analyzing this data set given non-normality. • Random hot deck MI with bootstrap gave significantly biased results. This approach does not take into account the information provided by the covariates therefore is not appropriate for data MAR. SSC Conference Hamilton Ontario May 2002

Conclusions about the data analysis • The health status of the population tends decreases with age. • People with higher income tend to have better health than people with less income. • People with lower health status demand more medical services (visits to a doctor). • People that are propense to depression have lower health. • Smoking does not appear to have a decisive influence on the health status. SSC Conference Hamilton Ontario May 2002

Future work • GLM could be used to model the categorical response GQ.H1 using a multinomial logistic model to impute the missing categorical responses • Interactions of the significant variables with the insignificant variables should be explored in order to further assess the concomitant effects (e.g. smoking and depression). SSC Conference Hamilton Ontario May 2002

Thank you ! Acknowledgements: Special thanks to professor Peggy Ng and George Monette for their support. SSC Conference Hamilton Ontario May 2002

Handling Missing Data