150 likes | 348 Vues
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001. Imputation Techniques Implemented in SOLAS 3.0. SINGLE IMPUTATION Hot Decking Predicted Mean Imputation Last Value Carried Forward. MULTIPLE IMPUTATIONS Propensity Score Based Imputation Predictive Model Based Imputation.
E N D
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001
Imputation Techniques Implemented in SOLAS 3.0 • SINGLE IMPUTATION • Hot Decking • Predicted Mean Imputation • Last Value Carried Forward • MULTIPLE IMPUTATIONS • Propensity Score Based Imputation • Predictive Model Based Imputation
Method 1: Propensity Score Based Imputation • This was the only Method in Version 1. • Method similar to Lavori,Dawson,Shera (1995) “A multiple imputation strategy for clinical trials with truncation of patient data” • GOAL: To impute Missing values by minimal Distributional Assumptions
How it Works • Let R be the indicator for the missingness pattern (R=0 or 1) • Model R from X1, X2,..., XP using logistic regression • p=Prob(R=1| X1, X2,…,XP) for each case yielding N pi’s.
How it works…. (Approximate Bayesian bootstrap, Rubin, 1987) • Group (user specified) the units by the value of the quintiles of p. • Suppose that within a particular group there are n1 observed and n0 missing values. Quintiles of p
sample n1+n0 units with replacement from the observed values. • From the sampled pool, subsample n0 units with replacement • Use these n0 units as the imputed values for the n0 missing values • Repeat the procedure m times to get m imputations • with replacement with replacement • n1 obs n0+ n1 n0
Theoretical Justification • It produces an imputed distribution of Y that has been corrected for biases due to missingness related to X. • It's similar in spirit to reweighting but here we have a multiple imputation version of it. • The method produces unbiased estimates for marginal distribution of Y.
Problems/Drawbacks The method does not preserve the association between Y and individual Xi’s. Reasoning: The only aspect of Xi’s that is used here is the linear prediction for Y (b0+ b1X1+b2X2…. +bpXp) in the logistic model. This is the function that predicts missingness of Y (R) but not Y itself.
Problems/Drawbacks (Continued….) Suppose X1 is highly correlated with Y but is unrelated to P(R=1). X1 will drop out of the the logistic model and it is not used in the imputation. As a result, the model will misrepresent the correlation of X1 and Y. Also, by not using X1 in the imputation, we are failing to impute Y efficiently.
Simulation Results Using SOLAS 1.1 Data Generation Mechanism: Y=X+Z+e, where and e ~N(0,1) Source: Paul D. Allison “Multiple Imputation for Missing Data, A Cautionary Tale”
Some Comments About the Propensity Score Based Method • The method can provide valid but possibly inefficient inferences about Y (marginal). • The method can lead to very misleading inferences about the relationships between Y and other variables.
Method 2: Predictive Model Based Multiple Imputation This method is implemented in SOLAS 2.0 and 3.0 HOW IT WORKS: • Regress Y on X1, X2,…, Xp • Get the estimates of b0,b1,b2,….bp and s2 • Draw b0*,b1*,b2*….bp*, s2* from an approximate posterior distribution • Impute Y*= b0*+ b1* X1+b2* X2…. +bp* Xp+e* where e*~Normal(0, s2*) • Repeat m times to get the m imputed datasets
Good points • The method provides correct model based MI under the regression model and MAR • It also preserves the correlation between Xi's and Y What is the difference with NORM ? • NORM does the same thing with MCMC • Under multivariate normal model, both methods give the same results
Which Software is More General ? I work for arbitrary missingness pattern I work for non-linear relation of y on X But that’s probably very similar to norm with rounding
Concluding Remarks • SOLAS is the first commercial missing data software. • It has good graphical interface. • Easy data import and export to other softwares. • Performs well under monotone missingness pattern. • Estimates are not always unbiased.