Handling Missing Data

Handling Missing Data Estie Hudes Tor Neilands UCSF Center for AIDS Prevention Studies March 16, 2007

Presentation Overview • Overview of concepts and approaches to handling missing data • Missing data mechanisms - how data came to be missing • Problems with popular ad-hoc missing data handling methods • A more modern, better approach: Maximum likelihood (FIML/ Direct ML) • More on modern approaches: the EM algorithm • Another modern approach: Multiple Imputation (MI) • Extensions and Conclusions

Types of Missing Data • Item-missing: respondent is retained in the study, but does not answer all questions • Wave-missing: respondent is observed at intermittent waves • Drop-out: respondent ceases participation and is never observed again • Combinations of the above

Methods of Handling Missing Data • First method: Prevention of missing cases (e.g., loss to follow-up) and individual item non-response • Second method: Ad-hoc approaches (e.g., listwise/casewise deletion) • Third method: Maximum likelihood-based approaches (e.g., direct ML) and related approaches (e.g., restricted ML)

Prevention of Missing Data • Minimize individual item non-response • CASI and A-CASI may prove helpful • Interviewer-administered surveys • Avoid self-administered surveys where possible • Minimize loss to follow-up in longitudinal studies by incorporating good participant tracking protocols, appropriate use of incentives, and reducing respondent burden

Ad-hoc Approaches to Handling Missing Data • Listwise deletion (a.k.a. complete-case analysis) • Pairwise deletion (a.k.a. available-case analysis) • Dummy variable adjustment (Cohen & Cohen) • Single imputation • Replacement with variable or participant means • Regression • Hot deck

Modern Approaches of Handling Missing Data • Maximum likelihood (FIML/direct ML) • EM algorithm • Multiple imputation (MI) • Selection models and pattern-mixture models for non-ignorable data • Weighting • We will confine our discussion to Direct ML, EM algorithm and Multiple Imputation

A Tour of Missing Data Mechanisms • How did the data become incomplete or missing? • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Not Ignorable Non-Response (NMAR; non-ignorable missingness; informative missingness) • Influential article: Rubin (1976) in Biometrika

Missing Data Mechanisms: Missing Completely at Random • Pr(Y is missing|X,Y) = Pr(Y is missing) • If incomplete data are MCAR, the cases with complete data are then a random subset of the original sample. • A good situation to be in if you have missing data because listwise deletion of the cases with incomplete data is generally justified. • A down side is loss of statistical power, especially if there are many cases, and the number of cases with complete data is a small fraction of the original number of cases.

Missing Data Mechanisms: Missing at Random • Pr(Y is missing|X,Y) = Pr(Y missing|X) • Within each level of X, the probability that Y is missing does not depend on the numerical value of Y. • Data are MCAR within each level of X. • MAR is a much less restrictive assumption than MCAR.

Missing Data Mechanisms: Not Missing at Random • If incomplete data are neither MCAR nor MAR, the data are considered NMAR or non-ignorable. • Missing data mechanism must be modeled to obtain good parameter estimates. • Heckman’s selection model is one example of NMAR modeling. Pattern mixture models are another NMAR approach. • Disadvantages of NMAR modeling: Requires high level of knowledge about missingness mechanism; results often highly sensitive to the choice of NMAR model selected.

Missing Data Mechanisms: Examples (1) • Scenario: Measuring systolic blood pressure (SBP) in January and February (Schafer and Graham, 2002, Psychological Methods, 7(2), 147-177) • MCAR: Data missing in February at random, unrelated to SBP level in January or February or any other variable in the study - missing cases are a random subset of the original sample’s cases. • MAR: Data missing in February because the January measurement did not exceed 140 - cases are randomly missing data within the two groups: SBP > 140 and SBP <= 140. • NMAR: Data missing in February because the February SBP measurement did not exceed 140. (SBP taken, but not recorded if it is <= 140.) Cases’ data are not missing at random.

Missing Data Mechanisms: Examples (2 ) • Scenario: Measuring Body Mass Index (BMI) of ambulance drivers in a longitudinal context (Heitjan, 1997, AJPH, 87(4), 548-550). • MCAR: Data missing at follow-up because participants were out on call at time of scheduled measurement, i.e., reason for data missingness is unrelated to outcome or other measured variables - missing cases are a random subset of the population of all cases. • MAR: Data missing at follow-up because of high BMI and embarrassment at initial visit, regardless of whether participant gained or lost weight since baseline, i.e., reason for data missingness is related to BMI, a measured variable in the study. • NMAR: Data missing at follow-up because of weight gain since last visit (assuming weight gain is unrelated to other measured variables in the study).

More on Missing Data Mechanisms • Ignorable data missingness - occurs when data are incomplete due to MCAR or MAR process • If incomplete data arise from an MCAR or MAR data missingness mechanism, there is no need for the analyst to explicitly model the missing data mechanism (in the likelihood function), as long as the analyst uses software programs that take the missingness mechanism into account internally (several of these will be mentioned later) • Even if data missingness is not fully MAR, methods that assume MAR usually (though not always) offer lower expected parameter estimate bias than methods that assume MCAR (Muthén, Kaplan, & Hollis, Psychometrika, 1987).

Ad-hoc Methods Unraveled (1) • Listwise deletion: delete all cases with missing value on any of the variables in the analysis. Only use complete cases. • OK if missing data are MCAR • Parameter estimates unbiased • Standard errors appropriate • But, can result in substantial loss of statistical power • Biased parameter estimates if data are MAR • Robust to NMAR for predictor variables • Robust to NMAR for predictor variables OR outcome variable in logistic regression models (slopes only)

Ad-hoc Methods Unraveled (2) • Pairwise deletion: use all available cases for computation of any sample moment • For computation of means, use all available data for each variable; • For computation of covariances, use all available data on pairs of variables. • Can lead to non-positive definite var-cov matrices because it uses different pairs of cases for each entry. • Can lead to biased standard errors under MAR.

Ad-hoc Methods Unraveled (3) • Dummy variable adjustment • Advocated by Cohen & Cohen (1985) 1. When X has missing values, create a dummy variable D to indicate complete case versus case with missing data. 2. When X is missing, fill in a constant c 3. Regress Y on X and D (and other non-missing predictors). • Produces biased coefficient estimates (see Jones’ 1996 JASA article)

Ad-hoc Methods Unraveled (4) • Single imputation (of missing values) • Mean substitution - by variable or by observation • Regression imputation (i.e., replacement with conditional means) • Hot deck: Pick “donor” cases within homogeneous strata of observed data to provide data for cases with unobserved values. • These methods lead to biased parameter estimates (e.g., means, regression coefficients); variance and standard error estimates that are biased downwards.One exception: Rubin (1987) provides a hot-deck based method of multiple imputation that may return unbiased parameter estimates under MAR. • Otherwise, these methods are not recommended.

Modern Methods: Maximum Likelihood (1) When there are no missing data: • Uses the likelihood function to express the probability of the observed data, given the parameters, as a function of the unknown parameter values. • Example: where p(x,y|θ) is the (joint) probability of observing (x,y) given a parameter θ, for a sample of n independent observations. The likelihood function is the product of the separate contributions to the likelihood from each observation. • MLEs are the values of the parameters which maximize the probability of the observed data (the likelihood).

Modern Methods: Maximum Likelihood (2) • Under ordinary conditions, ML estimates are: • consistent (approximately unbiased in large samples) • asymptotically efficient (have the smallest possible variance) • asymptotically normal (one can use normal theory to construct confidence intervals and p-values). • The ML approach can be easily extended to MAR situations: The contribution to the likelihood from an observation with X missing is the marginal: g(yj|θ) = xp(x,yj|θ) • This likelihood may be maximized like any other likelihood function. Often labeled FIML or direct ML.

Modern Methods: Maximum Likelihood (3) • Available software to perform FIML estimation: • AMOS - Analysis of Moment Structures • Commercial program licensed as part of SPSS (CAPS has a 10-user license for this product) • Fits a wide variety of univariate and multivariate linear regression, ANOVA, ANCOVA, and structural equation (SEM) models. • http://www.smallwaters.com • Mx- Similar to AMOS in capabilities, less user-friendly • Freeware: http://views.vcu.edu/mx • LISREL - Similar to AMOS, more features, less user-friendly • Commercial program: http://www.ssicentral.com

Modern Methods: Maximum Likelihood (4) • Available software: • lEMLoglinear & Event history analysis w/ Missing data (Jeroen Vermunt) • Freeware DOS program downloadable from the Internet • http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html • Fits log-linear, logit, latent class, and event history models with categorical predictors. • Mplus • Similar capabilities to AMOS (commercial) • Less easy to use than AMOS, but more general modeling features. • http://www.statmodel.com

Modern Methods: Maximum Likelihood (5) • Longitudinal data analysis software options (not discussed): • Normally distributed outcomes • SAS PROC MIXED • S-PLUS LME • Stata XTREG and XTREGAR and XTMIXED • Poisson • Stata XTPOIS • Negative Binomial • Stata XTNBREG • Logistic • Stata XTLOGIT

Modern Methods: Maximum Likelihood (6) • Software for longitudinal analyses (continued) • General modeling of clustered and longitudinal data • Stata GLLAMM add-on command • SAS PROC NLMIXED • S-PLUS NLME • What about Generalized Estimating Equations (GEE) for analysis of longitudinal or clustered data with missing observations? • Assumes incomplete data are MCAR. See Hedeker & Gibbons, 1997, Psychological Methods, p. 65. & Heitjan, AJPH, 1997, 87(4), 548-550. • Can be extended to accommodate the MAR assumption via aweighting approach developed by Robbins, Rodnitzky, & Zhao (JASA, 1995), but it has limited applicability.

Vote (Y=V) Sex (X=S) Yes No . Y N Male 28 45 10(73) p11 p12 Female 22 52 15(74) p21 p22 Total 50 97 25 (147) 1 Likelihood function: L(p11, p12, p21, p22) = (p11)28(p12)45 (p21)22 (p22)52 (p11+p12)10 (p21+p22)15 Maximum Likelihood Example (1) 2 x 2 Table with missing data

Maximum Likelihood Example (2) 2 x 2 Table with missing data

Input (partial) * R = response (NM) indicator * S = sex; V = vote; man 2 * 2 manifest variables res 1 * 1 response indicator dim 2 2 2 * with two levels lab R S V * and label R sub SV S * defines these two * subgroups mod SV * model for complete dat [28 45 22 52 * subgroup SV 10 15] * subgroup S Output (partial) *** (CONDITIONAL) PROBABILITIES *** * P(SV) * complete data only 1 1 0.1851 (0.0311)0.1905 (0.0324) 1 2 0.2975 (0.0361)0.3061 (0.0380) 2 1 0.1538 (0.0297)0.1497 (0.0294) 2 2 0.3636 (0.0384)0.3537 (0.0394) * P(R) * 1 0.8547 2 0.1453 Maximum Likelihood Example (3)Using lEMfor 2 x 2 Table

Maximum Likelihood Example (1)Continuous outcome & multiple predictors • Data on American colleges and universities through US News and World Report • N = 1302 colleges • Available from http://lib.stat.cmu.edu/datasets/colleges • Described on p. 21 of Allison (2001)

Maximum Likelihood Example (2)Continuous outcome & multiple predictors • Outcome: gradrat - graduation rate (1,204 non-missing cases) • Predictors • csat - combined average scores on verbal and math SAT (779 non-missing cases) • lenroll - natural log of the number of enrolling freshmen (1,297 non-missing cases) • private - 1 = private; 0 = public (1,302 non-missing cases) • stufac - ratio of students to faculty (x 100; 1,300 non-missing cases) • rmbrd - total annual cost of room and board (thousands of dollars; 1,300 non-missing cases) • act - Mean ACT scores (714 non-missing cases)

Maximum Likelihood Example (3) Continuous outcome & multiple predictors • Predict graduation rate from • Combined SAT • Number of enrolling freshmen on log scale • Student-faculty ratio • Private or public institution classification • Room and board costs • Use a linear regression model • ACT score included as an auxiliary variable • Use AMOS and Mplus to illustrate direct ML

Maximum Likelihood Example (4) Continuous outcome & multiple predictors AMOS: Two methods for model specification • Graphical user interface • AMOS BASIC programming language Results (assuming joint MVN) Regression Weights Estimate S.E. C.R. P GradRat <-- CSAT 0.0669 0.0048 13.9488 0.0000 GradRat <-- LEnroll 2.0832 0.5953 3.4995 0.0005 GradRat <-- StuFac -0.1814 0.0922 -1.9678 0.0491 GradRat <-- Private 12.9144 1.2769 10.1142 0.0000 GradRat <-- RMBRD 2.4040 0.5481 4.3856 0.0000

Maximum Likelihood Example (5) Continuous outcome & multiple predictors Mplus example (assuming joint MVN) INPUT INSTRUCTIONS TITLE: P. Allison 6/2002 Oakland, CA Missing Data Workshop non-normal example DATA: FILE IS D:\My Documents\Papers\Allison-Paul\usnews.txt; VARIABLE: NAMES ARE csat act stufac gradrat rmbrd private lenroll; USEVARIABLES ARE csat act stufac gradrat rmbrd private lenroll; MISSING ARE ALL . ; ANALYSIS: TYPE = general missing h1 ; ESTIMATOR = ML ; MODEL: gradrat ON csat lenroll stufac private rmbrd ; gradrat WITH act ; csat WITH lenroll stufac private rmbrd act ; lenroll WITH stufac private rmbrd act ; stufac WITH private rmbrd act ; private WITH rmbrd act ; rmbrd WITH act ; OUTPUT: patterns ;

Maximum Likelihood Example (6) Continuous outcome & multiple predictors Mplus results (assuming joint MVN) MODEL RESULTS Estimates S.E. Est./S.E. GRADRAT ON CSAT 0.067 0.005 13.954 LENROLL 2.083 0.595 3.501 STUFAC -0.181 0.092 -1.969 PRIVATE 12.914 1.276 10.118 RMBRD 2.404 0.548 4.387

Maximum Likelihood Example (7)Continuous outcome & multiple predictors • Mplus example for continuous, non-normal data • Uses sandwich estimator robust to non-normality • Specify MLR instead of ML as the estimator • Mplus MLR estimator assumes MCAR missingness and finite fourth-order moments (i.e., kurtosis is non-zero); initial simulation studies show low bias with MAR data Estimates S.E. Est./S.E. GRADRAT ON CSAT 0.067 0.005 13.312 LENROLL 2.083 0.676 3.083 STUFAC -0.181 0.093 -1.950 PRIVATE 12.914 1.327 9.735 RMBRD 2.404 0.570 4.215

Maximum Likelihood Summary • ML advantages: • Provides a single, deterministic set of results appropriate under MAR data missingness. • Well-accepted method for handling missing values (e.g., for grant writing). • Generally fast and convenient. • ML disadvantages: • Parametric: may not always be robust to violations of distributional assumptions (e.g., multivariate normality). • Only available for some models via canned software (would need to program other models). • Most readily available for continuous outcomes and ordered categorical outcomes. • Available for Poisson or Cox regression with continuous predictors in Mplus, but requires numerical integration, which is time-consuming and can be challenging to use, especially with large numbers of variables.

Modern Methods: EM Algorithm (1) • EM algorithm proceeds in two steps to generate ML estimates for incomplete data: Expectation and Maximization. The steps alternate iteratively until convergence is attained. • Seminal article by Dempster, Laird, & Rubin (1977), Journal of the Royal Statistical Society, Series B, 39, 1-38. Early treatment by H.O. Hartley (1958), Biometrics, 14(2), 174-194. • Goal is to estimate sufficient statistics that can then be used for substantive analyses. In normal theory applications these would be the means, variances and covariances of the variables (first and second moments of the normal distributions of the variables). • Example from Allison, pp. 19-20: For a normal theory regression scenario, consider four variables X1 - X4 that have some missing data on X3 and X4.

Modern Methods: EM Algorithm (2) • Starting Step (0): • Generate starting values for the means and covariance matrix. Can use the usual formulas with listwise or pairwise deletion. • Use these values to calculate the linear regression of X3 on X1 and X2. Similarly for X4. • Expectation Step (1): • Use the linear regression coefficients and the observed data for X1 and X2 to generate imputed values of X3 and X4.

Modern Methods: EM Algorithm (3) • Maximization Step (2): • Use the newly imputed data along with the original data to compute new estimates of the sufficient statistics (e.g., means, variances, and covariances) • Use the usual formula to compute the mean • Use modified formulas to compute variances and covariances that correct for the usual underestimation of variances that occurs in single imputation approaches. • Cycle through the expectation and maximization steps until convergence is attained (sufficient statistic values change slightly from one iteration to the next).

Modern Methods: EM Algorithm (4) • EM Advantages: • Only needs to assume incomplete data arise from MAR process, not MCAR • Fast (relative to MCMC-based multiple imputation approaches) • Applicable to a wide range of data analysis scenarios • Uses all available data to estimate sufficient statistics • Fairly robust to non-joint MVN data • Provides a single, deterministic set of results • May be all that is needed for non-inferential analyses (e.g., Cronbach’s alpha or exploratory factor analysis) • Lots of software (commercial and freeware)

Modern Methods: EM Algorithm (5) • EM Disadvantage: • Produces correct parameter estimates, but standard errors for inferential analyses will be biased downward because analyses of EM-generated data assume all data arise from a complete data set without missing information. The analyses of the EM-based data do not properly account for the uncertainty inherent in imputing missing data. • Recent work by Meng provides a method by which appropriate standard errors may be generated for EM-based parameter estimates • Bootstrapping may also be used to overcome this limitation

Modern Methods: Multiple Imputation (1) • What is unique about MI: We impute multiple data sets to analyze, not a single data set as in single imputation approaches • Use the EM algorithm to obtain starting values for MI • The differences between the imputed data sets capture the uncertainty due to imputing values • The actual values in the imputed data sets are less important than analysis results combined across all data sets • Several MI advantages: • MI yields consistent, asymptotically efficient, and asymptotically normal estimators under MAR (same as direct ML) • MI-generated data sets may be used with any kind of software or model

Modern Methods: Multiple Imputation (2) • The MI point estimate is the mean: • The MI variance estimate is the sum of Within and Between imputation variation: where • (Qi and Vi are the parameter estimate and its variance in the ith imputed dataset)

Modern Methods: Multiple Imputation (3) • Imputation model vs. analysis model • Imputation model should include any auxiliary variables (i.e., variables that are correlated with other variables that have incomplete data; variables that predict data missingness) • Analysis model should contain a subset of the variables from the imputation model and address issues of categorical data, non-normal data • Texts that discuss MI in detail: • Little & Rubin (2002, John Wiley and Sons): A seminal classic • Rubin (1987, John Wiley and Sons): Non-response in surveys • J. L. Schafer (1997, Chapman & Hall): Modern and updated • P. Allison (2001, Sage Publications series # 136): A readable and practical overview of and introduction to MI and missing data handling approaches

Modern Methods: Multiple Imputation (4) • Multivariate normal imputation approach • MI approaches exist for multivariate normal data, categorical data, mixed categorical and normal variables, and longitudinal/clustered/panel data. • The MV normal approach is most popular because it performs well in most applications, even with somewhat non-normal input variables (Schafer, 1997) • Variable transformations can further improve imputations • For each variable with missing data, estimate the linear regression of that variable on all other variables in the data set. • Using a Bayesian prior distribution for the parameters, typically noninformative, regression parameters are drawn from the posterior Bayesian distribution. Estimated regression equations are used to generate predicted values for missing data points.

Modern Methods: Multiple Imputation (5) • Multivariate normal imputation approach (continued) • Add to each predicted value a random draw from the residual normal distribution to reflect uncertainty due to incomplete data. • Obtaining Bayesian posterior random draws is the most complex part of the procedure. Two approaches: • Data augmentation - implemented in NORM and PROC MI • Uses a Markov-Chain Monte Carlo (MCMC) approach to generate the imputed values • A variant of Data augmentation - implemented in ice (and MICE) • Uses a Gibbs sampler and switching regressions approach (Fully Conditional Specification - FCS) to generate the imputed values (van Buuren) • Sampling Importance/Resampling (SIR) - implemented in Amelia and a user-written macro in SAS (sirnorm.sas); claimed to be faster than data augmentation-based approaches. • “The relative superiority of these methods is far from settled” (Allison, 2001, p. 34)

Modern Methods: Multiple Imputation (6) • Steps in using MI • Select variables for the imputation model - use all variables in the analysis model, including any dependent variable(s), and any variables that are associated with variables that have missing data or the probability of those variables having missing data (auxiliary variables), in part or in whole. • Transform non-normal continuous variables to attain normality (e.g., skewed variables) • Select a random number seed for imputations (if possible) • Choose number of imputations to generate • Typically 5 to 10: > 90% coverage & efficiency with 90% or less missing information in large sample scenarios with M = 5 imputations (Rubin, 1987) • Sometimes, however, you may need more imputations (e.g., 20 or more for some longitudinal scenarios). • You can compute the relative efficiency of parameter estimates as: relative efficiency = (1 / (1 + rate of missing information / number of imputations)) X 100. Several MI software programs output the missing information rates for parameters, allowing the analyst to easily compute relative efficiencies

Modern Methods: Multiple Imputation (7) • Steps in using MI (continued): • Produce the multiply imputed data sets • Estimated parameters must be independent of initial values • Assess independence via autocorrelation and time series plots (when using MCMC-based MI programs) • Back-transform any previously transformed variables and round imputations for discrete variables. • Analyze each imputed data set using standard statistical approaches. If you generated M imputations (e.g., 5), you would perform M separate, but identical analyses (e.g., 5). • Combine results from the M multiply imputed analyses (using NORM, SAS PROC MIANALYZE, or Stata miest or micombine) using Rubin’s (1987) formulas to obtain a single set of parameter estimates and standard errors. Both p-values and confidence intervals may be generated.

Modern Methods: Multiple Imputation (8) • Steps in using MI (continued) • Rules for combining parameter estimates and standard errors • A parameter estimate is the mean of the parameter estimates from the multiple analyses you performed. • The standard error is computed as follows: • Square the standard errors from the individual analyses. • Calculate the variance from the squared SEs across the M imputations. • Add the results of the previous two steps together, applying a small correction factor to the variance in the second step, and take the square root. • There is a separate F-statistic available for multiparameter inference (i.e., multi-DF tests of several parameters at once). • It is also possible to combine chi-square tests from the analysis of multiply imputed data sets.

Modern Methods: Multiple Imputation (9) • Is it wrong to impute the DV? • Yes, if performing single, deterministic imputation (methods historically used by econometricians) • No, if using the random draw approach of Rubin. In fact, leaving out the DV will cause bias (it will bias the coefficients towards zero). • Given that the goal of MI is to reproduce all the relationships in the data as closely as possible, this can only be accomplished if all the dependent variable(s) are included in the imputation process.

Modern Methods: Multiple Imputation (10) • Available imputation software for data augmentation: • SAS: PROC MI and PROC MIANALYZE (demonstrated) • MI produces imputations • MIANALYZE combines results from analyses of imputed data into a single set of hypothesis tests • NORM - for MV normal data (J. L. Schafer) • Windows freeware • S-Plus MISSING library • R (add-in file) • CAT, MIX, and PAN - for categorical data, mixed categorical/normal data, and longitudinal or clustered panel data respectively (J. L. Schafer) • S-Plus MISSING library • R (add-in file) • LISREL - http://www.ssicentral.com (Windows, commercial)

Handling Missing Data