450 likes | 1.28k Vues
2. Outline. Missing MechanismMultiple imputationPropensity scoresApplicationsSOLASStataSAS. 3. How to deal with missing data. Do nothingExclude subjects with missing values? Expand the results from the sub-sample to the whole sampleMake a guess, replace with the guessed valuesFill
E N D
1. 1 Missing Data Analysis Multiple Imputation Ming-Yu Fan, PhD
April 30, 2008
2. 2 Outline Missing Mechanism
Multiple imputation
Propensity scores
Applications
SOLAS
Stata
SAS
3. 3 How to deal with missing data Do nothing
Exclude subjects with missing values
? Expand the results from the sub-sample to
the whole sample
Make a guess, replace with the guessed values
Fill in with simple guess, e.g. sample mean
? Expand the results from the sub-sample to
the whole sample
? Similar to do nothing
Fill in with better guessed values
? Imputation
4. 4 Missing pattern (1)
5. 5 Missing pattern (1) cont. Do nothing
Figure 1.1 has the same mixed color
as figure 1.2
Fill in with sample mean
The two mixed colors are still identical
Fill in with better guessed values
Nice but not necessary
6. 6 Missing pattern (2)
7. 7 Missing pattern (2) cont. Do nothing
Figure 1.1 and figure 1.2 have different mixed
colors
Fill in with sample mean
Two figures will have different mixed color
Fill in with better guessed values
Necessary
If we can correctly identify the slices, we can
better guess a missing value according to the
observed value in the same slice
The final mixed colors might be similar
8. 8 Missing pattern (3)
9. 9 Missing pattern (3) cont. Do nothing
Figures 3.1 and 3.2 have different mixed colors
Fill in with sample mean
Two figures will have different mixed colors
Fill in with better guessed values
Even if we can identify the slices, we wont be
able to correctly guess the missing value
Ex: we wont be able to guess the missing
brown piece based on the grey observed piece
10. 10 Missing Mechanism Missing Completely At Random (MCAR)
The best scenario
Simple approaches can yield unbiased results
Missing At Random (MAR)
The less ideal scenario
More advanced approaches are necessary; can yield
unbiased results
Not Missing At Random (NMAR)
The worse scenario
No approaches can help with the biased results
11. 11 Missing Mechanism - cont. How do we determine the missing mechanism?
Since missing information is not observed, we really
dont know how the complete sample looks like, and
thus we cant say for sure the missing data are MCAR,
MAR, or NMAR
Can we guess?
E.g. (1): income missing because the patients income is
extremely high
E.g. (2): gender missing because the reviewer forgets to
fill in the information
E.g. (3): SCL-20 items missing because older men dont
like to answer some of the questions
12. 12
13. 13 Missing Mechanism - cont. Respondents and non-respondents are different in some
baseline characteristics
Probably not MCAR
MAR or NMAR? The truth is, most of the time we cant really
determine that
A common approach: unless a missing value is clearly NMAR
(e.g. income), we would assume MAR on the missing data
and apply methods that are based on this assumption (e.g.
propensity score weighting, multiple imputation)
In reality, it is not common to have MCAR, hence do
nothing and fill in with sample mean approaches are likely
to introduce bias
14. 14 Imputation Assumption: MAR
Challenges:
How to identify the slices
How to guess the missing values
15. 15 Imputation example How to impute the missing SCL for patient # 5?
Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7
By age: (3.8+0.6)/2 = 2.2
By sex: 1.1
By education: 1.3
By race: (3.8 + 0.6 + 1.3)/3 = 1.9
By ADL: (1.1 + 1.3)/2 = 1.2
Who is/are in the same slice with #5?
16. 16 Propensity score Measure the similarity by the likelihood of being
observed/missing
Use logistic regression models to estimate this
likelihood
Dependent variable Z =
1 if a subjects outcome is observed
0 if a subjects outcome is missing
Independent variables = anything that might be
associated with the outcome being missing (Z=1)
Demographic information
Baseline characteristics
17. 17 Propensity score cont. Model:
p = prob(Z=1)
log(p/(1-p)) = 0 + 1X1 + 2X2 + + kXk
Z has no missing values
X1~Xk all must have non-missing values
Statistical significance of ߒs is not important
The predicted ps derived from the model are the
propensity scores
18. 18 Propensity score example Dependent variable:
Y = 12-month SCL score
Z = 1 if Y is observed, Z = 0 if Y is missing
Independent variable:
X1 = age = Age
X2 = sex ( = 1 if male, = 0 if female) = Sex
X3 = number of chronic conditions = NumC
X4 = baseline SCL score = SCL00
Model:
log(p/(1-p)) = 0 + 1X1 + 2X2 + 3X3 + 4X4
Result:
0 = 0.31; 1 = 0.003; 2 = -0.58; 3 = -0.25; 4 = 0.25
19. 19 Propensity score example log(p/(1-p)) =
(0.31) + (0.003)Age + (-0.58)Sex + (-0.25)NumC + (0.25)SCL00
Derive the propensity scores for subject A & B:
Subject A: 70-year-old male, 3 chronic conditions, SCL00 = 1.7
(0.31)+(0.003)*70+(-0.58)*1+(-0.25)*3+(0.25)*1.7 = -0.385
log(p/(1-p)) = - 0.385 ? p = 0.405
Subject B: 85-year-old female, 4 chronic conditions, SCL00 = 0.7
(0.31)+(0.003)*85+(-0.58)*0+(-0.25)*4+0.25*0.7 = -0.26
log(p/(1-p)) = -0.26 ? p = 0.435
20. 20 Propensity score cont. We can compute the propensity score for every
subject, including those with missing outcome
We already know whether a subjects outcome is
observed or missing
Propensity scores do not predict the probability of
missing outcome in the sample
They estimate the likelihood/probability of having
the outcome observed for ANY subject with a similar
background measured by the independent variables
Subjects with close propensity scores are considered
similar (in the same slice)
21. 21 Imputation hot-deck How to impute the missing SCL for patient # 5?
4 strata ? closest to #2 ? impute with 0.6
2 strata ? closest to both #2 and #3 ? impute with a randomly
selected value from (0.6, 1.1)
The method is called Hot-Deck; #2, #3 are called donors
Common approach:
Stratify the sample by the propensity scores (e.g. 5 strata)
Randomly select a donor from the same stratum and impute the
missing value with the donors observed value
22. 22 Imputation regression Model:
SCL = b0 + b1Age + b2Sex + b3Edu + b4Race + b5ADL + b6Pain + b7Comorb
Fit the model to the observed data:
b0=-0.8, b1=0.02, b2=-0.5, b3=0.05, b4=-0.6, b5=0.1, b6=0.1, b7=0.05
Plug in the information of #5 to derive the predicted value:
(-0.8) + (0.02)70 + (-0.5)0 + (0.05)21 + (-0.6)1 + (0.1)2
+ (0.1)4 + (0.05)3
= 1.8 = predicted SCL
Notes:
Predicted values might be out of the natural range of the outcome
(e.g. SCL > 4 or SCL < 0)
For ordinal outcomes, the predicted values might not be plausible
(e.g. number of people living in the house = 2.7)
23. 23 Multiple imputation For each missing value, impute m data points
m >1, usually m = 5
For single imputation m = 1
Whats wrong with single imputation?
Imputed values are derived from the observed sample,
and thus the imputed sample is more homogeneous
Variances are under-estimated
(10, 20, 30) ? mean = 20, variance = 100
(10, 20, 30, 20, 20, 20) ? mean = 20, variance = 40
More likely to yield biased result
Advantage of multiple imputation
Add the variation across the m data sets back to the
estimation of variance
Result is less likely to be biased
24. 24 Multiple imputation cont. To impute is easy repeat for m times
To analyze is more complicated
Suppose m = 5
? (mean, median, proportion, etc)
s = squared standard error = se2 = sd2/N
Derived (?1, ?2, ?3, ?4, ?5 ) and (s1, s2, s3, s4, s5) from the 5
imputed data sets
Rubin 1987
The combined ? = (?1 + ?2 + ?3+ ?4 + ?5)/5
The combined s = v1 + [1+(1/m)]v2
v1 = (s1 + s2 + s3+ s4 + s5)/5
v2 = variance across (?1, ?2, ?3, ?4, ?5)
= {(?1-?)2 + (?2-?)2 + (?3-?)2 + (?4-?)2 + (?5-?)2}/ (5-1)
For more complicated analyses we need statistical
software
25. 25 Multiple imputation - SOLAS SOLAS 3.2 (Statistical Solutions Ltd.)
~ $1000, no need for renewal
Recommended by Dr. Rubin
Can impute longitudinal data with both item
missing and wave missing
Can impute many variables with missing data
simultaneously (internal algorithm to form
monotone missing pattern)
26. 26
27. 27
28. 28
29. 29
30. 30 Multiple imputation - Stata Stata version 7.0 and above
(Stata Corporation, College Station TX)
~ $100 for UW faculty/students, no need for renewal
Free download: ice (or mice for version 7+) to
impute missing values and micombine to analyze
multiple imputed data sets
(macros written by Dr. Patrick Royston)
Help ? Search ? choose Search all and type
keywords multiple imputation ? click the links to
download the macros
31. 31
32. 32
33. 33
34. 34 Multiple imputation - SAS SAS version 9.1+
(SAS Institute Inc., Cary, NC)
~ $100 for UW faculty/students, need to
renew every year
PROC MI to impute missing values
PROC MIANALYZE to analyze multiple
imputed data sets
35. 35
36. 36
37. 37
38. 38
39. 39
40. 40
41. 41 Summary Missing data results might be biased
Multiple imputation needs additional
works but generally yields better results
Many statistical software have programs
available for imputing missing values and
analyzing imputed data