500 likes | 606 Vues
Missing Data in Epidemiology: Issues & Approaches. N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course). There are known knowns ; There are things we know we know. We also know there are known unknowns; That is to say we know there are some things we do not know.
E N D
Missing Data in Epidemiology: Issues & Approaches N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)
There are known knowns; • There are things we know we know. • We also know there are known unknowns; • That is to say we know there are some things we do not know. • But there are also unknown unknowns; • The ones we don’t know we don’t know. U.S. Secretary of Defense, Donald H. Rumsfeld Department of Defense news briefing, February 12, 2002
Example (1) RR = 1.5 Now, assume 50% of the females refuse to give you information about their final outcome (decline that question but continue in the study). RR = 1.5
Example (2) • We are missing the outcome status on 50% of the females • Using available data, we find: • Overall estimate of the rate of disease is biased • The RR for risk in females compared to males is OK • Why? • Subjects missing the outcome status are a random subset of all females • Female-specific incidence risk is correct • Prevalence of female sex is lower in study ‘complete cases’ • fails to reflect the 50:50 distribution of sex in the target population • External validity
Example (3) RR = 1.5 Now, assume 50% of the females refuse to give you information about their final outcome. BUT only people not getting outcome refuse. RR = 3.0
Example (4) • The chance the outcome data is missing depends on the true status of the outcome • Using available data, we find: • Overall estimate of the rate of disease is biased • The RR for risk in females compared to males is biased • Why? • Female-specific incidence risk is biased • Over-estimated • Prevalence of female sex is lower in study ‘complete cases’ • Fails to reflect the 50:50 distribution of sex in the target population
Why missing data matters (1) • All studies have missing data • People drop out of studies • People decline one of several questionnaires • People decline to complete certain questions (e.g. income) • People miss questions (pages get stuck together) • Lab tests fail • biological levels are ‘below threshold of detection’ • Missing data is usually not the focus of a study • In many cases, missing data is just ignored
Why missing data matters • Failing to adjust properly for missing data can causes serious problems. • Introduce potential bias in parameter estimation • Weaken the generalizability of the results • Ignoring cases with missing data leads to the loss of information • Decreases statistical power • Increases standard errors • Failing to adjust data properly for missing values can make the data unsuitable for a statistical procedure • Can also make the statistical analyses vulnerable to violations of assumptions
Levels of missing data • Data can be missing at two ‘levels’ • Unit-level non-response • A subject included in the study declines to take part and provides no information at all. • Serious issue in much research • Mainly affects external generalizibility • Not the focus of further discussions • Item-level non-response • Subject participates in the study • Fails to provide information for some items • Applies a skip sequence wrongly • Two pages get stuck together
Types of missing data patterns (1) • Three patterns are generally recognized: • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Missing not at Random (MNAR or NMAR)
Types of missing data patterns (2) • Missing Completely at Random (MCAR) • The probability of a data value being missing is independent of all observed and non-observed data. • Missing data is a random sample of all data • Observed data is an unbiased estimator of the results from total data • Complete-case (listwise deletion) methods work fine • Can identify MCAR by comparing cases with and without missing data • Example • Biosamples collected for genotyping • Some results are missing because the instrument failed for one batch of samples
Types of missing data patterns (3) • Missing at Random (MAR) • The probability of a data value being missing is related to observed data but not to non-observed data. • Can be analyzed using Multiple Imputation methods or likelihood-based methods • Example • Looking at prognostic value of SNPs for sub-types of breast cancer • Eligible subjects with advanced stage breast cancer (III/IV) were more likely to be missing SNP information • Subjects with advanced disease are less cooperative with the study. • Conditional on disease stage, the probability of missing the SNP is unrelated to the value of the SNP.
Types of missing data patterns (4) • Missing Not at Random (MNAR or NMAR) • The probability of a data value being missing is related to the unobserved values. • e.g. high values are more likely to be missing than low values • Can be analyzed using Multiple Imputation methods or likelihood-based methods • much more complex to use • requires modeling the process yielding the missing values • Example • Looking at study which requires measurement of tumor size. • Smaller tumors are less likely to have size recorded • Harder to measure size of small tumors • Requires more complex methods (e.g. MRI or PET scanning). • Probability of size being missing relates to the size of the tumor
Another classification of Patterns • Univariate missing data • Data are missing on only one variable in the analysis set • Monotonic missing data • You can rearrange the data so the following is true: • If a subject is missing data on variable ‘i’, then they are missing data on all variables after that • Longitudinal study with drop-outs. • Arbitrary missing data • Doesn’t meet the above conditions.
Ignorability • And now, some confusing terminology • Rubin introduced the term ‘ignorability’ • If data is MCAR or MAR, then the mechanism which produces the missing data is not important and can be ignored in analysis. • He called this ‘Ignorability’ • This does not mean that the missing data can be ignored!
Missing data in the literature (1) • Peng et al (2006) • Education & psychology journals • 36% had no missing data • 48% had missing data • 16% were unclear • 97% used listwise deletion or pairwise deletion methods.
Missing data in the literature (2) • Klebanoff & Cole (2008) • Looked at the use of multiple imputation methods • 2 years of articles from Amer J Epidem, Annals Epi, Epidemiology& Int J Epidem • 1,105 original research articles • 16 papers (1.4%) used one of • Multiple Imputation (n=12) • Inverse probability weighing • Expectation-minimization algorithm • 99 papers had imput as text
Missing data in the literature (3) • Desai et al (2011) • Focused on molecular epidemiology studies in Cancer Epidemiology, Biomakers and Prevention • 15 month period (2009-2010) • 278 eligible articles • 95% either had missing data or excluded cases with missing data • Only 23 papers (13%) used missing data methods for analysis • 9 dealt with ‘assays below detection limit’ • Single imputation • 7 used ‘missing data indicators’ • 26 (14%) reported differences between subjects with and without missing data.
Methods to handle missing data (1) • Need to decide on a model for missing data • MCAR • MAR • MNAR • If MNAR, how is the data related to the unobserved value? • Set a statistical model for the full data • Commonly assumed to be multivariate normal • Limiting, especially for categorical data • Some other form
Methods to handle missing data (2) • Complete Case (Listwise deletion) • Pairwise deletion (e.g.. Proc Corr) • Corrected complete case method • Weighted regression model with complete cases • Weights related to inverse of probability that a case is complete • Fill the contingency table • Allocate subjects with missing values of a row/column to cells in proportion to the complete cases. • Replacement with the frequency or mean of complete cases • For categorical variables, create multiple variables (one per level) • Impute the percent of the group at each level • Indicator variable for missing data
Methods to handle missing data (3) • Simple/Single imputation • Multiple imputation • Full MLE methods • SAS can use FIMR (Full information Maximum Likelihood) • Assumes multivariate normality and MAR • Linked to Structural Equation Modeling (PROC CALIS) • Reweighting estimation equations • Used in complex survey studies • Sample weights are adjusted to reflect missing data patterns.
Complete Case (listwise deletion) • Subject missing any values for any variable included in analysis or model are excluded. • Most commonly used method (‘the default’) • Usually used without any thought to missing data patterns, etc. • Acceptable if data is MCAR • Leads to lose of sample size and reduced power/precision • Often produces reasonable results • especially if amount if missing data is small • Can be strongly biased is data is MAR • Methodological results from • multiple papers and • theory
Pairwise deletion • Similar to casewise deletion BUT, only subjects with missing data for variables involved in the specific analysis are subject to exclusion. • Consider a case where x1 is missing some data but x2 and x3 are complete. Suppose the analysis looks at these two models: • Y = B2* x2 + B3 * x3 • Y = B1 * x1 + B2 * x2 + B3 * x3 • In the complete case method, subjects missing x1 will be excluded for both models. • Pairwise deletion: • All subjects would be used in model 1; • Some cases would be excluded in model 2. • Leads to different sub-sets being used for different analyses • Complicates interpretation. • PROC CORR in SAS uses this approach
Corrected Complete Case Method • Subjects missing any values for any variable included in analysis or model are excluded. • Regression models use weighted regression. • Weights are computed to reflect the inverse of the probability that a subject will have complete data. • Works OK if data is MAR but can be seriously biased if not true. • Figuring out the weights is difficult • Finding SE’s can be difficult • Results from Vach et al, 1991
Fill the Contingency Table • Under MAR, the distribution of subjects with missing data across the 4 cells in a contingency table is the same as the distribution of the complete cases. • Modify the Contingency table by allocating ‘counts’ of missing subjects to the table • Similar to the ‘corrected complete case’ method. • Leads to non-integer counts in the cells • Computing variance is tricky because standard formulae don’t work • Logistic Regression needs integer counts in the tables. • Results from Vach et al, 1991
Replacement with the frequency or mean of complete cases • Really a type of single imputation • For each subject with missing data, replace the missing value by the mean of the complete cases • For categorical data, define indicator variables • 0/1 if there is valid data • If data is missing, use the proportion of the complete cases with that level of the variable. • Leads to indicator variables which have non-integer components. • Strongly biased method, even with MAR • more biased than Complete Case method • Henry et al, 2013
Indicator variable for missing data • Treat ‘missing values’ as if they are a valid response to the questionnaire • Assign them a code value • Example (Do you drink alcohol?): • Yes: 1 • No: 2 • Missing: 3 • Analysis is done using three levels • 2 dummy variables • This is a very bad method which is strongly biased.
Indicator variable for missing data • Commonly used and commonly taught in epidemiology courses. • Studied by multiple authors (Vach, Greenland) • Very strongly biased in every study, including theoretical analyses • Consider two situations: • Variable is the main effect of interest:
Full Population data OR = 5.44 Now, assume 30% of data is missing, MCAR. Define the ‘missing data’ indicator variable What is OR of Exp +ve to Exp –ve? It is still 5.44=
Confounding example (1) • So, we gain nothing by defining the missing category. • But, suppose the missing data is in a confounder. • Here is the population data. Crude table is as before (OR=5.44): Level 1 Level 2 OR = 9.0 OR = 9.0 Adjusted OR would be 9.0. strong confounding
Confounding example (2) • Now, 30% of the data on the confounder is missing. We create the missing value indicator level. • Means we now have three 2x2 tables for our confounding analysis. Level 1 Level 2 OR = 9.0 OR = 9.0 Level 3: Missing OR = 5.44
Confounding example (3) • When there is no missing data, the OR’s are as follows. Clearly, there is confounding with the adjusted OR being 9.0 • When we have the missing indicator in the data, the adjusted OR is not 9.0 but around 8. Very strongly biased.
Indicator Variable for Missing Data • This method has no role in handling missing data • Is strongly biased, even with MCAR data. • One core requirement for any method to address missing data is that it gives the ‘right’ answer for MCAR data.
Single Imputation (1) • Replace a missing value with an estimate of what the value should have been • Various methods are possible • Overall mean • Group-specific mean • Last observation carried forward (in follow-up studies) • An extreme value (e.g. missing = heavy alcohol use) • Regression modeling • Works best with monotonic missing data. • To impute Yj, regress Y1 to Yj-1 for all subjects with valid data for Yj • This gives a group of Betas with SE’s. • Select a value of each beta at random from the distributions. • For single imputation, you often use the actual estimated Beta values • Use the regression equation to estimate the mean value of Yj for subjects with missing data. • Hot-deck imputation • MCMC methods
Single Imputation (2) • Hard to generate validate variance estimates • Greenland found regression-based single imputation to be subject to serious errors in the face of mis-specified models.
Multiple Imputation (1) • MI handles missing data in three steps: • Impute missing data ‘m’times to produce ‘m’ complete data sets; • Analyze each data set using a standard statistical procedure; • Combine the ‘m’results into one using formulae from Rubin (1987) or Schafer (1997). • Most MI methods assume • MAR • Multivariate normality • If the assumptions are met, and if these three steps are done correctly, multiple imputation produces estimates that have nearly optimal statistical properties. They are: • Consistent (and, hence, approximately unbiased in large samples), • Asymptotically efficient (almost), and • Asymptotically normal.
Multiple Imputation (2) • One common method uses regression models in step #1 • Three kinds of variables are included in an imputation regression model: • Variables that are of theoretical interest, • Variables that are associated with the missing mechanism, & • Variables that are correlated with the variables with missing data. • Consider adding interactions terms for continuous variables. • Bayesian ideas can be used in step #1 • Regression based • Set a prior distribution for the regression parameters and error term • Fit model to generate posterior distribution • Select at random from posterior distribution to generate several imputation equations
Multiple Imputation (2) • Bayesian ideas can be used in step #1 • MCMC (Markov Chain Monte Carlo) • Divide sample into subsets with the same missing data for variables • e.g. Group #1: missing x1 & x2 Group #2: missing x1, x3 & x4 • Fit regression models within each pattern of missingness • Impute using these models • Uses full data set to update means, variances and covariances • Make a random selection from the posterior distribution of these parameters • Update the regression models • Repeat • FCS (Fully Conditional Specification) • Similar to above but handles categorical data better • No strong theoretical justification
Multiple Imputation (3) • Most MI models assume variables are multivariate normal • Issues arise with categorical variables • Can treat as continuous and then round to generate a suitable categorical value • Round based on the normal approximation to the binomial distribution • Most studies find MI methods to be the most valid of missing variable methods • Some issues/questions • How many replicate (multiples) to include? • What variables to include in model? • How to handle non-normal variables, including categorical variables? • Software limitations
Full MLE methods (1) • Suppose we have a data set and we want to fit a regression model (could be linear, logistic, etc.) • With no missing data, we use Maximum Likelihood methods • n observations on k variables: • Based on regression model assumptions, the likelihood of the data can be given as: • θ is the set of parameters to estimate • We find the values of θ to make ‘L’ as big as possible
Full MLE methods (2) • What if we have some missing data? • Suppose y1 & y2 have missing data which is MAR • For a subject with missing values, we can not generate the likelihood contribution since we don’t know y1 & y2 • Instead, consider all possible values which they might have, combined with the probability of those values. • Add up the likelihood contribution for every possible value: • Substitute this into the MLE equation and estimate ‘θ’
Full MLE methods (3) • FIML is one way to do this in SAS • Part of PROC CALIS • Assumes multivariate normality • MPlus • Software which can handle non-linear models • Can us various regression models • Logistic • Poisson • Tobit • Cox • Etc.
Reweighting estimation equations • Discussed by Henry et al (2013) • Applies to complex surveys • Differential probability of selection from target population • Analysis requires ‘weights’ to adjust for this • Standard weights are proportional to the inverse of the probability of selection • With missing data, complete case analysis leads to different subsets for each set of variables • weights are incorrect • Adjust each weight to account for probability of being a complete case • Do analysis using new weights and complete cases only • Henry shows it produces very good estimates • Limited area of application
Summary • Missing data can be very important • More than 5-10% of data missing is considered a potential source of serious bias • Need to consider the model which produces the missing data • ‘ad hoc’ methods are poor and should not be used • Multiple Imputation or Full MLE methods give excellent results in most situations • If missing data is MNAR, need to consider the model which gives rise to the missing data • If missingness is strongly related to value of variable, problem is complex
One suggested approach (1) § • Describe target population • Clearly describe derivation of analytic data set • Describe population characteristics of analytic data set, including missing values • Describe differences in population characteristics for subjects with valid and missing data for key variables §adapted from Desai et al Cancer Epidemiol Biomarkers Prev; 20(8), 2011
One suggested approach (2) • Investigate possible assumptions for missing data • Assume MCAR if • no data to suggest it is violated & • no mechanism to generate MNAR • Assume MAR if • MCAR is not acceptable, • no mechanism to generate MNAR & • candidate ancillary variables exist • Assume MNAR if • a priori knowledge exists that missing data are related to unknown values • Conduct a CC analysis
One suggested approach (3) • Choose an additional analysis as appropriate • For MAR, • Use Multiple Imputation with suitable ancillary variables • For MNAR, • Use Multiple Imputation, • Need to model the method which generated the missing data. • If a variable is limited by sensitivity of a lab detection device, • Use a likelihood-based method • Implement the additional analysis • Include all potential ancillary variables • Use SAS if you can postulate a joint distribution for ancillary variables • Use STATA or R (fully conditional method).
One suggested approach (4) • Perform sensitivity analyses • Do both CC & MI • Use different subsets of ancillary variables for MI • Use different models for MNAR missing generation • Interpret the results • If all analyses give same results, this is easy • If they differ, need to present a more complex result in the paper.