Todd D. Little University of Kansas Director, Quantitative Training Program

On the Merits of Planning and Planning for Missing Data* • *You’re a fool for not using planned missing data design Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis Director, Undergraduate Social and Behavioral Sciences Methodology Minor Member, Developmental Psychology Training Program crmda.KU.edu Workshop presented 05-21-2012 @ Max Planck Institute for Human Development in Berlin, Germany Very Special Thanks to: Mijke Rhemtulla & Wei Wu crmda.KU.edu

University of Kansas crmda.KU.edu

Learn about the different types of missing data • Learn about ways in which the missing data process can be recovered • Understand why imputing missing data is not cheating • Learn why NOT imputing missing data is more likely to lead to errors in generalization! • Learn about intentionally missing designs • Discuss imputation with large longitudinal datasets • Introduce a simple method for significance testing Road Map crmda.KU.edu

Key Considerations • Recoverability • Is it possible to recover what the sufficient statistics would have been if there was no missing data? • (sufficient statistics = means, variances, and covariances) • Is it possible to recover what the parameter estimates of a model would have been if there was no missing data. • Bias • Are the sufficient statistics/parameter estimates systematically different than what they would have been had there not been any missing data? • Power • Do we have the same or similar rates of power (1 – Type II error rate) as we would without missing data? crmda.KU.edu

Types of Missing Data • Missing Completely at Random (MCAR) • No association with unobserved variables (selective process) and no association with observed variables • Missing at Random (MAR) • No association with unobserved variables, but maybe related to observed variables • Random in the statistical sense of predictable • Non-random (Selective) Missing (MNAR) • Some association with unobserved variables and maybe with observed variables crmda.KU.edu

Effects of imputing missing data crmda.KU.edu

Effects of imputing missing data Statistical Power: Will always be greater when missing data is imputed! crmda.KU.edu

Bad Missing Data Corrections • List-wise Deletion • If a single data point is missing, delete subject • N is uniform but small • Variances biased, means biased • Acceptable only if power is not an issue and the incomplete data is MCAR • Pair-wise Deletion • If a data point is missing, delete paired data points when calculating the correlation • N varies per correlation • Variances biased, means biased • Matrix often non-positive definite • Acceptable only if power is not an issue and the incomplete data is MCAR www.crmda.ku.edu

Bad Imputation Techniques • Sample-wise Mean Substitution • Use the mean of the sample for any missing value of a given individual • Variances reduced • Correlations biased • Subject-wise Mean Substitution • Use the mean score of other items for a given missing value • Depends on the homogeneity of the items used • Is like regression imputation with regression weights fixed at 1.0 www.crmda.ku.edu

Questionable Imputation Techniques • Regression Imputation – Focal Item Pool • Regress the variable with missing data on to other items selected for a given analysis • Variances reduced • Assumes MCAR and MAR • Regression Imputation – Full Item Pool • Variances reduced • Attempts to account for NMAR in as much as items in the pool correlate with the unobserved variables responsible for the missingness www.crmda.ku.edu

Modern Missing Data Analysis MI or FIML • In 1978, Rubin proposed Multiple Imputation (MI) • An approach especially well suited for use with large public-use databases. • First suggested in 1978 and developed more fully in 1987. • MI primarily uses the Expectation Maximization (EM) algorithm and/or the Markov Chain Monte Carlo (MCMC) algorithm. • Beginning in the 1980’s, likelihood approaches developed. • Multiple group SEM • Full Information Maximum Likelihood (FIML). • An approach well suited to more circumscribed models crmda.KU.edu

Full Information Maximum Likelihood • FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation. • Since each observation’s mean vector and covariance matrix is based on its own unique response pattern, there is no need to fill in the missing data. • Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame. • Individual likelihood functions with greater amounts of missing are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information. • Formally, the function that FIML is maximizing is where crmda.KU.edu

Multiple Imputation • Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences. • By filling in m separate estimates for each missing value we can account for the uncertainty in that datum’s true population value. • Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentation algorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II. • SAS uses data augmentation to pull random draws from a specified posterior distribution (i.e., stationary distribution of EM estimates). • After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are commonly combined with Rubin’s Rules (Rubin, 1987). crmda.KU.edu

Good Data Imputation Techniques • (But only if variables related to missingness are included in analysis, or missingness is MCAR) • EM Imputation • Imputes the missing data values a number of times starting with the E step • The E(stimate)-step is a stochastic regression-based imputation • The M(aximize)-step is to calculate a complete covariance matrix based on the estimated values. • The E-step is repeated for each variable but the regression is now on the covariance matrix estimated from the first E-step. • The M-step is repeated until the imputed estimates don’t differ from one iteration to the other • MCMC imputation is a more flexible (but computer-intensive) algorithm. crmda.KU.edu

Good Data Imputation Techniques • (But only if variables related to missingness are included in analysis, or missingness is MCAR) • Multiple (EM or MCMC) Imputation • Impute N (say 20) datasets • Each data set is based on a resampling plan of the original sample • Mimics a random selection of another sample from the population • Run your analyses N times • Calculate the mean and standard deviation of the N analyses crmda.KU.edu

Fraction Missing • Fraction Missing is a measure of efficiency lost due to missing data. It is the extent to which parameter estimates have greater standard errors than they would have had all data been observed. • It is a ratio of variances: Estimated parameter variance in the complete data set Between-imputation variance crmda.KU.edu

Fraction Missing • Fraction of Missing Information (asymptotic formula) • Varies by parameter in the model • Is typically smaller for MCAR than MAR data crmda.KU.edu

Estimate Missing Data With SAS Obs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6 1 65 95 95 100 23 25 25 27 2 10 10 40 25 25 27 28 27 3 95 100 100 100 27 29 29 28 4 90 100 100 100 30 30 27 29 5 30 80 90 100 23 29 29 30 6 40 50 . . 28 27 3 3 7 40 70 100 95 29 29 30 30 8 95 100 100 100 28 30 29 30 9 50 80 75 85 26 29 27 25 10 55 100 100 100 30 30 30 30 11 50 100 100 100 30 27 30 24 12 70 95 100 100 28 28 28 29 13 100 100 100 100 30 30 30 30 14 75 90 100 100 30 30 29 30 15 0 5 10 . 3 3 3 . crmda.KU.edu

PROC MI data=sample out=outmi seed = 37851 nimpute=100 EM maxiter = 1000; MCMC initial=em (maxiter=1000); Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6; run; out= Designates output file for imputed data nimpute = # of imputed datasets Default is 5 Var Variables to use in imputation PROC MI crmda.KU.edu

PROC MI output: Imputed dataset Obs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6 1 1 65 95 95 100 23 25 25 27 2 1 10 10 40 25 25 27 28 27 3 1 95 100 100 100 27 29 29 28 4 1 90 100 100 100 30 30 27 29 5 1 30 80 90 100 23 29 29 30 6 1 40 50 21 12 28 27 3 3 7 1 40 70 100 95 29 29 30 30 8 1 95 100 100 100 28 30 29 30 9 1 50 80 75 85 26 29 27 25 10 1 55 100 100 100 30 30 30 30 11 1 50 100 100 100 30 27 30 24 12 1 70 95 100 100 28 28 28 29 13 1 100 100 100 100 30 30 30 30 14 1 75 90 100 100 30 30 29 30 15 1 0 5 10 8 3 3 3 2 crmda.KU.edu

What to Say to Reviewers: • I pity the fool who does not impute • Mr. T • If you compute you must impute • Johnny Cochran • Go forth and impute with impunity • Todd Little • If math is God’s poetry, then statistics are God’s elegantly reasoned prose • Bill Bukowski crmda.KU.edu

Planned missing data designs • In planned missing data designs, participants are randomly assigned to conditions in which they do not respond to all items, all measures, and/or all measurement occasions • Why would you want to do this? • Long assessments can reduce data quality • Repeated assessments can induce practice effects • Collecting data can be time- and cost-intensive • Less taxing assessments may reduce unplanned missingness crmda.KU.edu

Planned missing data designs • Cross-Sectional Designs • Matrix sampling (brief) • Three-Form Design (and Variations) • Two-Method Measurement (very cool) • Longitudinal Designs • Developmental Time-Lag • Wave- to Age-based designs • Monotonic Sample Reduction • Growth-Curve Planned Missing crmda.KU.edu

Multiple matrix sampling crmda.KU.edu

Multiple matrix sampling Test a few participants on full item bank crmda.KU.edu

Multiple matrix sampling Or, randomly sample items and people… crmda.KU.edu

Multiple matrix sampling • Assumptions • The K items are a random sample from a population of items (just as N participants are a random sample from a population) • Limitations • Properties of individual items or relations between items are not of interest • Not used much outside of ability testing domain. crmda.KU.edu

3-Form Intentionally Missing Design • Graham Graham, Taylor, Olchowski, & Cumsille(2006) • Raghunathan & Grizzle (1995) “split questionnaire design” • Wacholder et al. (1994) “partial questionnaire design” crmda.KU.edu

3-form design • What goes in the Common Set? crmda.KU.edu

3-form design: Example • 21 questions made up of 7 3-question subtests crmda.KU.edu

3-form design: Example • Common Set (X) crmda.KU.edu

3-form design: Example • Set A I start conversations. I get stressed out easily. I am always prepared. I have a rich vocabulary. I am interested in people. crmda.KU.edu

3-form design: Example • Set B I am the life of the party. I get irritated easily. I like order. I have excellent ideas. I have a soft heart. crmda.KU.edu

3-form design: Example • Set C I am comfortable around people. I have frequent mood swings. I pay attention to details. I have a vivid imagination. I take time out for others. crmda.KU.edu

crmda.KU.edu

Expansions of 3-Form Design • (Graham, Taylor, Olchowski, & Cumsille, 2006) crmda.KU.edu

2-Method Planned Missing Design crmda.KU.edu

2-Method Measurement • Expensive Measure 1 • Gold standard– highly valid (unbiased) measure of the construct under investigation • Problem: Measure 1 is time-consuming and/or costly to collect, so it is not feasible to collect from a large sample • Inexpenseive Measure 2 • Practical– inexpensive and/or quick to collect on a large sample • Problem: Measure 2 is systematically biased so not ideal crmda.KU.edu

2-Method Measurement • e.g., measuring stress • Expensive Measure 1 = collect spit samples, measure cortisol • Inexpensive Measure 2 = survey querying stressful thoughts • e.g., measuring intelligence • Expensive Measure 1 = WAIS IQ scale • Inexpensive Measure 2 = multiple choice IQ test • e.g., measuring smoking • Expensive Measure 1 = carbon monoxide measure • Inexpensive Measure 2 = self-report • e.g., Student Attention • Expensive Measure 1 = Classroom observations • Inexpensive Measure 2 = Teacher report crmda.KU.edu

2-Method Measurement • How it works • ALL participants receive Measure 2 (the cheap one) • A subset of participants also receive Measure 1 (the gold standard) • Using both measures (on a subset of participants) enables us to estimate and remove the bias from the inexpensive measure (for all participants) using a latent variable model crmda.KU.edu

Todd D. Little University of Kansas Director, Quantitative Training Program