Jenny H. Qin and Mike Singleton Kentucky CODES Kentucky Injury Prevention & Research Center

Performing Sensitivity Analyses of Imputed Missing Values Jenny H. Qin and Mike Singleton Kentucky CODES Kentucky Injury Prevention & Research Center University of Kentucky www.kiprc.uky.edu

Multiple Imputation in Public Health Research Handling Missing Data in Nursing Research with Multiple Imputation NHTSA: Transitioning to Multiple Imputation! A new Method to Impute Missing BAC values in FARS Application of Multiple Imputation in Medical Studies: from AIDS to NHANES Multiple Imputation Publications

Questions??? • May I use MI to deal with missing data problems for my data sets? • How can I believe that the MI will give me better analysis results? • What should I do to get good results from MI? www.kiprc.uky.edu

A sensitivity analysis tests if our study results are sensitive to our assumptions (missing data mechanism), data conditions (missing data rate), and choices (imputation models or number of imputations) made for obtaining the results ??? Sensitivity Analyses on Imputed Values Answers www.kiprc.uky.edu

Set 1 Results 1 1 Missing Data Mechanism Set 2 Results 2 Proc MIANALYZE Imputation Model 3 Set 3 Results 3 . . . . . . Proc MI Options 4 Missing Data Rate 2 Results Set n Results n MI Process Analysis Model Data Set of Interest Proc MI www.kiprc.uky.edu Set n

CODES Application Research Question: What was the relationship between driving under the influence of drugs and/or alcohol, and being killed or hospitalized in a crash, for motorcycle riders in Kentucky in 2001? Outcome (Dependent Variable): Killed or Hospitalized (K/H) Risk Factor Candidates (Independent Variables): Age, gender, suspected DUI, posted speed limit, helmet use, fixed object,head-on collision, collision time, rural vs. urban www.kiprc.uky.edu

Analysis Model • Logistic Regression Model: • K/H =β0 + β1*DUI + β2*Speed + β3*Fixed + β4*Head-On • Total records in our study Data set: • 1,226 • Records with missing values: • 14 (1.1%) www.kiprc.uky.edu

This Gold Standard result is used to compare with all other results. Results for the Gold Standard Conclusion: comparing motorcyclists with DUI to motorcyclists without DUI, the odds of being killed or hospitalized are 2.5 times greater than the odds of not being killed or hospitalized, when other factors are controlled. www.kiprc.uky.edu

Imputation Model • Analysis Model: • K/H = β0 + β1*DUI + β2*Speed + β3*Fixed + β4*Head-On • Imputation Model: • K/H DUI Speed Fixed Head-On • Note: The imputation model does not have to be identical to the analysis model, but at least it should include all of the analysis covariates. You can add any additional variables that are correlated to the variables that have missing values. www.kiprc.uky.edu

MCAR MAR NMAR 1 Missing Data Mechanism Imputation Model Analysis Model 3 Proc MIANALYZE Data Analysis Study Data Set Proc MI 2 4 Proc MI options Missing Data Rate Results SA: 1 Missing Data Mechanism www.kiprc.uky.edu

SA: 1 Missing Data Mechanism • Missing Completely At Random (MCAR) • DFN: the missing data values are a simple random sample of all data values. • We simulated this condition by using SAS Proc SurveySelect to pick a random sample from the study data set, then set DUI = missing for those selected cases. • Missing At Random (MAR) • DFN: the probability of missing values on one variable is unrelated to the values of this variable, after controlling for other variables in the analysis • We simulated this condition by setting DUI = missing for riders aged 46 or older • Not Missing At Random (NMAR) • DFN: the probability of missing values on one variable is related to the values of this variable even if we control other variables in the analysis • We simulated this condition by setting DUI = missing for uninjured riders who were not suspected of DUI (DUI=‘NO’). www.kiprc.uky.edu

www.kiprc.uky.edu

Missing Data Mechanism 1 2 Missing Data Rate (25%) Imputation Model 3 Proc MI Options 4 Sensitivity analysis on missing data mechanism: Different Same Same Same What is the result? www.kiprc.uky.edu

Conclusions of SA on Missing Data Mechanism • Even if we used the simplest imputation model MI was able to produce results that are consistent with the Gold Standard when the missing data mechanisms were MCAR or MAR, but not NMAR • we would predict the increased odds of death or hospitalization for riders suspected of DUI to be 1.78 (1.15 2.76) for NMAR, while our Gold Standard predicts it to be 2.51 (1.58 3.98). www.kiprc.uky.edu

1 Missing Data Mechanism Imputation Model Analysis Model 3 Proc MIANALYZE Data Analysis Study Data Set Proc MI 4 2 Missing Data Rate Proc MI options Results 6% 25% 50% SA: 2 Missing Data Rate www.kiprc.uky.edu

SA: 2 Missing Data Rate • Data sets with MCAR (Test on percentage of values missing for DUI as 6%, 25%, 50% respectively) • Data sets with MAR (Test on percentage of values missing for DUI as 6%, 25%, 50% respectively) www.kiprc.uky.edu

Missing Data Mechanism MCAR or MAR 1 Missing Data Rate 2 Imputation Model 3 Proc MI Options 4 Sensitivity analysis on Missing Data Rate? Same Different Same Same What is the result? www.kiprc.uky.edu

Conclusions of SA on Missing Data Rate • For both missing data mechanisms, the 50% missing case produced the DUI parameter estimate farthest from the Gold Standard estimate, as well as the widest 95% CI. However, for MCAR the difference from the Gold Standard estimate was -7%, whereas for MAR it was 42%. In addition, the 95% CI for 50%MCAR was 19% wider than the Gold Standard 95% CI, whereas for 50%MAR it was 106% wider. • It shows that the simplest imputation model is not sufficient to handle very high missing data rates . www.kiprc.uky.edu

Model1 Model2 Model3 Model4 1 Missing Data Mechanism Imputation Model Analysis Model 3 Proc MIANALYZE Data Analysis Study Data Set Proc MI 2 2 Proc MI options Missing Data Rate Results SA: 3 Imputation Model www.kiprc.uky.edu

SA: 3 Imputation Model • Data set with MAR and values missing for DUI=50% • Tests on the following 4 Imputation models • Model1: D/H DUI Speed Fixed Head-on Model1 = Analysis model, it is the simplest imputation model • Model2: Model1 + age_group + colltime (Categorical) • Model3: Model1 + age_group + hour (Continuous) • Model4: Model1 + age_group + hour_normal (Continuous) We are adding age and collision time to help predict DUI in Model2, Model3, and Model4 www.kiprc.uky.edu

Missing Data Mechanism MAR 1 2 Missing Data Rate (50%) Imputation Models 3 Proc MI Options 4 Sensitivity analysis on Imputation Model Same Same Different Same What is the result? www.kiprc.uky.edu

Conclusions of SA on Imputation Models • Models 2, 3, and 4 are all improvements over model 1, and produced DUI parameter estimates and 95% CI widths close to those of the Gold Standard. • So even with 50% missing values (MAR), we are able to get a good result by using a richer imputation model. • The higher percent missing values (MAR) in your data set, the more you must include additional predictors in the imputation model. www.kiprc.uky.edu

Comparison of No MI and Model 4 to the Gold Standard www.kiprc.uky.edu

No MI Comparison of No MI and Model 4 to the Gold Standard MI G.S. MI G.S. G.S. MI MI G.S. www.kiprc.uky.edu

1 Missing Data Mechanism Imputation Model Analysis Model 3 Proc MIANALYZE Data Analysis Study Data Set Proc MI 2 4 Proc MI: number of MI Missing Data Rate Results N=0 N=5 N=10 N=20 N=2 SA: 4 Proc MI: Number of Imputations www.kiprc.uky.edu

SA: 4 Proc MI: Number of Imputations • Data set with MAR and values missing for DUI=50%, use Model4 to do MI • Test on different number of imputations • N=0 • N=2 • N=5 • N=10 • N=20 www.kiprc.uky.edu

Missing Data Mechanism MAR 1 2 Missing Data Rate (50%) Imputation Model 3 Number of Imputation 4 Sensitivity analysis on Number of Imputations Same Same Same Different What is the result? www.kiprc.uky.edu

Conclusions of SA on Number of Imputations • In our example, n=5 to 10 is enough to get good results for data set with 50% MAR on DUI. • No MI (complete cases only), we would conclude that: motorcyclists with DUI had 4.2 (2.1, 8.4) times more likely killed or hospitalized than motorcyclists without DUI. But from the Gold Standard, the OR is 2.5 (1.5, 4.0) www.kiprc.uky.edu

Summary---Answers? • May I use MI to deal with missing data problems for my data sets? Seems a good idea to try MI. Depend on the missing data mechanisms of variables with missing values in your data sets (however, even our results with MI for NMAR were better than No MI) • How can I believe that the MI will give me the better analysis results? We found that using MI on our example gave us much better analysis results than No MI (the complete cases only) • How can I get better analysis results by using MI? Understand the relationship between variables in your data sets; Know the missing data mechanisms of variables; Determine the percent of missing information; Build a reasonable imputation model; Use Proc MI options wisely www.kiprc.uky.edu

Poll Results Q1. I like Denver.Q2. I like TRF.Q3. I liked the talk.Q4. I will use the MI. Missing Data Problems Everywhere www.kiprc.uky.edu

Acknowledgment Special thanks to Dr. Mike McGlincy, who gave us helpful suggestions during our study of sensitivity analyses on imputed values and insightful comments on the analysis results. www.kiprc.uky.edu

Thank You www.kiprc.uky.edu

Questions? www.kiprc.uky.edu

Can We Improve Analysis Results for NMAR by Using a More Complex Imputation Model? Model5=Model1+age+hour +gender+safety Model4=Model1+age+hour Model1=K/H + DUI + Speed + Fixed + Head-on No MI=Complete cases only www.kiprc.uky.edu

Multiple Imputation inference involves three distinct phases: 1. The missing data are filled in m times to generate m complete data sets (using imputation model) 2. The m complete data sets are analyzed by using standard procedures (using analysis model) 3. The results from the m complete data sets are combined for the inference www.kiprc.uky.edu

Statistical Assumptions for Multiple Imputation 1. The MI procedure assumes that the data are from a continuous multivariate distribution. It also assumes that the data are from a multivariate normal distribution when the MCMC method is used According to Schafer’s MI FAQ page, MI tends to be quite forgiving of assumption for normal distribution. For example: when working with binary or ordered categorical variables, it is often acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. Variables whose distributions are heavily skewed may be transformed to approximate normality and then transformed back to their original scale after imputation. • Proc MI and Proc MIANALYZE assume that the missing data are Missing At Random (MAR) MCAR is unlikely for real world crash datasets NMAR may be shifted to MAR by using a richer imputation model to help predict missing values. Because crash datasets include many related variables that can help predict each other www.kiprc.uky.edu

Jenny H. Qin and Mike Singleton Kentucky CODES Kentucky Injury Prevention & Research Center