Multiple Imputation with large proportions of missing data :how much is too much?

Multiple Imputation with large proportions of missing data:how much is too much? Jin is designed by Dr. Huber Texas A&M HSC

Motivation Motivations and Examples Korean Female Colon Cancer ☞ ☞ Is smoking protective? Not sure b/c Huge missing!!

background Types of Missing data 1. Missing Completely At Random(MCAR) : depends neither on observation nor on missing 2. Missing At Random(MAR) : depends only on observation 3. Not Missing At Random(NMAR) : depends both on observationand on missing Diff. by Why data are missing Affect the effectiveness and biasness of methods for missing data

background Methods of handling Missing data 1. Complete Case Analysis(CCA) 2. Available Case Analysis(ACA) 3. Mean imputation 4. Expectation and Maximum(EM) 5.Multiple Imputation Older Methods Single Imputation Multiple Imputation Only CCA and MI

Methods of handlingMissing data background 1. Delete all cases of missing values on Y1,Y2,Y3 1. Complete Case Analysis (CCA) 2. Analyze remaining cases 1. CCA = NOT using any methods of handling missing data 2. By deleting cases, power will be decreased (b/c reduced sample size)

background Methods of handlingMissing data 2. Multiple Imputation (MI) (1) Imputation Step (2) Analysis Step (3) Combination Step MI has 3 steps

background Methods of handling Missing data 2. MI (1) Imputation Step “5 complete datasets”

background Methods of handling Missing data 2. MI (2) Analysis Step * Standard statistical procedure > regression for each complete datasets (5) separately Analyzed 5 times

background Methods of handling Missing data combined to 1 result 2. MI (3) Combination Step > the results from 5 data are combined to ONE with combination equations. Combined estimate: Variance Total: Var. Within: Var. Between: DF: Fraction missing Info. : Confidence Interval:

background Methods of handling Missing data * Comparison of methods to handle missing values Excellent Estimation Variance among ‘M’est. b/c multiply imputed data by not deleting any cases MI is the BEST!!

background Imputation Mechanisms (1) Imputation step of MI : imputation mechanisms for substituting missing values MCMC is NOT tested to Univariate

Data Data Simulated Data * 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous) ( Xs: observed variables and Z: partly missing var. ) * Z1, and X1,…,X6 are drawn from multivariate normal dist with Means = 0 and Correlation =

Data Data Example Data (“A Predictive Study of Coronary Heart Disease” ) * 3154 obs. (all variables are continuous) - Missing variable: Systolic Blood Pressure (Mean: 128.63) - Observed variables: DBP(82.02), height(69.78), weight(169.95), age(46.28), BMI(24.52), and Cholesterol (Mean: 226.37) * Correlation =

Method Method 1. Missing Mechanisms 1) MCAR: Randomly Z1(SBP) deleted 2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted 3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted 2. Biasness mainly measured by RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2) : captures estimates’ Accuracy and Variability and compares them in the same units. * True value= Mean of Z1 (SBP) at 0% missing * Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI to 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% When RMSE “smaller” → Estimation “better”

Method Method 3. The method to deal with missing values (to measure effectiveness of MI) Complete Case Analysis (CCA) Multiple Imputation (MI) 4. Imputation numbers M=10, 20, 30, 40, and 50 numbers 5. Imputation model (z1= x1 x2 x3 x4 x5x6), (z1= x1 x2 x5), (z1= x3 x4x6) all variable highly corr. var to z1 rarely corr. var z1=x1x2x5 model is best model b/c smallest RMSE

Method Method 6. Imputation Mechanisms 7. 500 repetitions on each MI (to reduce random variability of imputation) ex) M=10 *500 reps. → Average them→ … M=50 *500 reps. → Average them→ 8. Statistical Software STATA11 (Multiple Imputation) Mean of Est. for M=10 Regression method PMM MCMC Mean of Est. for M=50

Result Result (simulated data) 1. CCA vs. MI* by RMSE Proportion of missing data Proportion of missing data Proportion of missing data better Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under All missing mechanisms, MI is better than CCA. Percent of missing , RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, using Multiple Imputation

Result 2. imputation numbers (simulated data) Similar Proportion of missing data Proportion of missing data Proportion of missing data Under NMAR, MI biased est. at 80% missing b/c large RMSE≒ ( 1 SD of data=0.99 ) (Regardless of imputation #) Under MCAR and MAR, MIGood! 5 lines(M=10~M=50) go together and look like 1 line. > No difference among diff. Imputation numbers(m)= 10, 20, 30, 40, 50.

Result 3. Regression, PMM, MCMC(simulated data) MCMC/ Reg. Proportion of missing data Proportion of missing data Proportion of missing data *Normal assumption may not be important under NMAR. *MCMC is good under all missing mechanisms. Thus, MCMC canbe used in univariate and continuous missing. 1. Under MCAR and MAR, theoretically Reg. should be better because of normality, but All methodaregood. However, Reg. method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. method is better than PMM.

Result Result (Example data) 1. CCA vs. MI* by RMSE Proportion of missing data better Proportion of missing data Proportion of missing data Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under MCAR, MAR, and NMAR, MI produced significantly unbiased values than CCA. Percent of missing , RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, Multiple Imputation is preferable

Result 2. imputation numbers (example data) Similar Proportion of missing data Proportion of missing data Proportion of missing data Under NMAR, MI did not well at 80% missing due to large RMSE≒ ( 1 SD of data=15.11 ) (Regardless of imputation # and percent of missing ) Under MCAR and MAR, MIproduces unbiased est. No difference among increased Imputation numbers = 10, 20, 30, 40, 50 > Increased Imputation numbers No sign. effect to correct bias in this data characteristics.

Result 3. Regression, PMM, MCMC(example data) MCMC/ Reg. Proportion of missing data Proportion of missing data Proportion of missing data *Normal assumption maybe important only under MAR. *MCMC is good to use under MCAR, MAR, and NMAR. Thus, MCMC can be used not only in multivariate and continuous missing, but also in univariate and continuous missing. 1.Under MCAR and MAR, theoretically PMM should be better because normal assumption is broken, but All methodaregood. However, PMM method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM.

Conclusion Conclusion 1. Multiple Imputation (MI) > Complete Case Analysis always. 2. No significant difference in imputation numbers in my data. 3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing. 4. However, under NMAR, the estimation by MI is alsobiased at high amount of missing. 5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR.

T h a n k y u

Multiple Imputation with large proportions of missing data :how much is too much?