1 / 40

Missing Data Analysis Multiple Imputation

2. Outline. Missing MechanismMultiple imputationPropensity scoresApplicationsSOLASStataSAS. 3. How to deal with missing data. Do nothingExclude subjects with missing values? Expand the results from the sub-sample to the whole sampleMake a guess, replace with the guessed valuesFill

laddie
Télécharger la présentation

Missing Data Analysis Multiple Imputation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 Missing Data Analysis Multiple Imputation Ming-Yu Fan, PhD April 30, 2008

    2. 2 Outline Missing Mechanism Multiple imputation Propensity scores Applications SOLAS Stata SAS

    3. 3 How to deal with missing data Do nothing Exclude subjects with missing values ? Expand the results from the sub-sample to the whole sample Make a guess, replace with the guessed values Fill in with simple guess, e.g. sample mean ? Expand the results from the sub-sample to the whole sample ? Similar to do nothing Fill in with better guessed values ? Imputation

    4. 4 Missing pattern (1)

    5. 5 Missing pattern (1) cont. Do nothing Figure 1.1 has the same mixed color as figure 1.2 Fill in with sample mean The two mixed colors are still identical Fill in with better guessed values Nice but not necessary

    6. 6 Missing pattern (2)

    7. 7 Missing pattern (2) cont. Do nothing Figure 1.1 and figure 1.2 have different mixed colors Fill in with sample mean Two figures will have different mixed color Fill in with better guessed values Necessary If we can correctly identify the slices, we can better guess a missing value according to the observed value in the same slice The final mixed colors might be similar

    8. 8 Missing pattern (3)

    9. 9 Missing pattern (3) cont. Do nothing Figures 3.1 and 3.2 have different mixed colors Fill in with sample mean Two figures will have different mixed colors Fill in with better guessed values Even if we can identify the slices, we wont be able to correctly guess the missing value Ex: we wont be able to guess the missing brown piece based on the grey observed piece

    10. 10 Missing Mechanism Missing Completely At Random (MCAR) The best scenario Simple approaches can yield unbiased results Missing At Random (MAR) The less ideal scenario More advanced approaches are necessary; can yield unbiased results Not Missing At Random (NMAR) The worse scenario No approaches can help with the biased results

    11. 11 Missing Mechanism - cont. How do we determine the missing mechanism? Since missing information is not observed, we really dont know how the complete sample looks like, and thus we cant say for sure the missing data are MCAR, MAR, or NMAR Can we guess? E.g. (1): income missing because the patients income is extremely high E.g. (2): gender missing because the reviewer forgets to fill in the information E.g. (3): SCL-20 items missing because older men dont like to answer some of the questions

    12. 12

    13. 13 Missing Mechanism - cont. Respondents and non-respondents are different in some baseline characteristics Probably not MCAR MAR or NMAR? The truth is, most of the time we cant really determine that A common approach: unless a missing value is clearly NMAR (e.g. income), we would assume MAR on the missing data and apply methods that are based on this assumption (e.g. propensity score weighting, multiple imputation) In reality, it is not common to have MCAR, hence do nothing and fill in with sample mean approaches are likely to introduce bias

    14. 14 Imputation Assumption: MAR Challenges: How to identify the slices How to guess the missing values

    15. 15 Imputation example How to impute the missing SCL for patient # 5? Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7 By age: (3.8+0.6)/2 = 2.2 By sex: 1.1 By education: 1.3 By race: (3.8 + 0.6 + 1.3)/3 = 1.9 By ADL: (1.1 + 1.3)/2 = 1.2 Who is/are in the same slice with #5?

    16. 16 Propensity score Measure the similarity by the likelihood of being observed/missing Use logistic regression models to estimate this likelihood Dependent variable Z = 1 if a subjects outcome is observed 0 if a subjects outcome is missing Independent variables = anything that might be associated with the outcome being missing (Z=1) Demographic information Baseline characteristics

    17. 17 Propensity score cont. Model: p = prob(Z=1) log(p/(1-p)) = 0 + 1X1 + 2X2 + + kXk Z has no missing values X1~Xk all must have non-missing values Statistical significance of ߒs is not important The predicted ps derived from the model are the propensity scores

    18. 18 Propensity score example Dependent variable: Y = 12-month SCL score Z = 1 if Y is observed, Z = 0 if Y is missing Independent variable: X1 = age = Age X2 = sex ( = 1 if male, = 0 if female) = Sex X3 = number of chronic conditions = NumC X4 = baseline SCL score = SCL00 Model: log(p/(1-p)) = 0 + 1X1 + 2X2 + 3X3 + 4X4 Result: 0 = 0.31; 1 = 0.003; 2 = -0.58; 3 = -0.25; 4 = 0.25

    19. 19 Propensity score example log(p/(1-p)) = (0.31) + (0.003)Age + (-0.58)Sex + (-0.25)NumC + (0.25)SCL00 Derive the propensity scores for subject A & B: Subject A: 70-year-old male, 3 chronic conditions, SCL00 = 1.7 (0.31)+(0.003)*70+(-0.58)*1+(-0.25)*3+(0.25)*1.7 = -0.385 log(p/(1-p)) = - 0.385 ? p = 0.405 Subject B: 85-year-old female, 4 chronic conditions, SCL00 = 0.7 (0.31)+(0.003)*85+(-0.58)*0+(-0.25)*4+0.25*0.7 = -0.26 log(p/(1-p)) = -0.26 ? p = 0.435

    20. 20 Propensity score cont. We can compute the propensity score for every subject, including those with missing outcome We already know whether a subjects outcome is observed or missing Propensity scores do not predict the probability of missing outcome in the sample They estimate the likelihood/probability of having the outcome observed for ANY subject with a similar background measured by the independent variables Subjects with close propensity scores are considered similar (in the same slice)

    21. 21 Imputation hot-deck How to impute the missing SCL for patient # 5? 4 strata ? closest to #2 ? impute with 0.6 2 strata ? closest to both #2 and #3 ? impute with a randomly selected value from (0.6, 1.1) The method is called Hot-Deck; #2, #3 are called donors Common approach: Stratify the sample by the propensity scores (e.g. 5 strata) Randomly select a donor from the same stratum and impute the missing value with the donors observed value

    22. 22 Imputation regression Model: SCL = b0 + b1Age + b2Sex + b3Edu + b4Race + b5ADL + b6Pain + b7Comorb Fit the model to the observed data: b0=-0.8, b1=0.02, b2=-0.5, b3=0.05, b4=-0.6, b5=0.1, b6=0.1, b7=0.05 Plug in the information of #5 to derive the predicted value: (-0.8) + (0.02)70 + (-0.5)0 + (0.05)21 + (-0.6)1 + (0.1)2 + (0.1)4 + (0.05)3 = 1.8 = predicted SCL Notes: Predicted values might be out of the natural range of the outcome (e.g. SCL > 4 or SCL < 0) For ordinal outcomes, the predicted values might not be plausible (e.g. number of people living in the house = 2.7)

    23. 23 Multiple imputation For each missing value, impute m data points m >1, usually m = 5 For single imputation m = 1 Whats wrong with single imputation? Imputed values are derived from the observed sample, and thus the imputed sample is more homogeneous Variances are under-estimated (10, 20, 30) ? mean = 20, variance = 100 (10, 20, 30, 20, 20, 20) ? mean = 20, variance = 40 More likely to yield biased result Advantage of multiple imputation Add the variation across the m data sets back to the estimation of variance Result is less likely to be biased

    24. 24 Multiple imputation cont. To impute is easy repeat for m times To analyze is more complicated Suppose m = 5 ? (mean, median, proportion, etc) s = squared standard error = se2 = sd2/N Derived (?1, ?2, ?3, ?4, ?5 ) and (s1, s2, s3, s4, s5) from the 5 imputed data sets Rubin 1987 The combined ? = (?1 + ?2 + ?3+ ?4 + ?5)/5 The combined s = v1 + [1+(1/m)]v2 v1 = (s1 + s2 + s3+ s4 + s5)/5 v2 = variance across (?1, ?2, ?3, ?4, ?5) = {(?1-?)2 + (?2-?)2 + (?3-?)2 + (?4-?)2 + (?5-?)2}/ (5-1) For more complicated analyses we need statistical software

    25. 25 Multiple imputation - SOLAS SOLAS 3.2 (Statistical Solutions Ltd.) ~ $1000, no need for renewal Recommended by Dr. Rubin Can impute longitudinal data with both item missing and wave missing Can impute many variables with missing data simultaneously (internal algorithm to form monotone missing pattern)

    26. 26

    27. 27

    28. 28

    29. 29

    30. 30 Multiple imputation - Stata Stata version 7.0 and above (Stata Corporation, College Station TX) ~ $100 for UW faculty/students, no need for renewal Free download: ice (or mice for version 7+) to impute missing values and micombine to analyze multiple imputed data sets (macros written by Dr. Patrick Royston) Help ? Search ? choose Search all and type keywords multiple imputation ? click the links to download the macros

    31. 31

    32. 32

    33. 33

    34. 34 Multiple imputation - SAS SAS version 9.1+ (SAS Institute Inc., Cary, NC) ~ $100 for UW faculty/students, need to renew every year PROC MI to impute missing values PROC MIANALYZE to analyze multiple imputed data sets

    35. 35

    36. 36

    37. 37

    38. 38

    39. 39

    40. 40

    41. 41 Summary Missing data results might be biased Multiple imputation needs additional works but generally yields better results Many statistical software have programs available for imputing missing values and analyzing imputed data

More Related