Missing data – issues and extensions

Missing data – issues and extensions For multilevel data we need to impute missing data for variables defined at higher levels We need to have a valid procedure for discrete variables Useful to include sampling weights Can we deal with partially missing data?

Consider the imputation stage with a set of multivariate responses • We illustrate first with a simple model where the response joint distribution is MVN and there are responses at 2 levels • To illustrate how such a model is specified consider repeated measures of childrens’ heights: level 2 is the child’s adult height.

Child heights + adult height Child height as a cubic polynomial with intercept + slope random at level 2 and both correlated with adult height random effect to give 3-variate normal. This allows us jointly to model level1 and level 2 variables with missing data. (see Goldstein and Kounali, JRSSA, 2009)

Results: Thus, if data are missing at either level 1 or level 2 they will get imputed via the MCMC algorithm.

Mixed response types • For ordered, or unordered categorical data we can specify corresponding ‘latent normal’ distributions. • For ordered response we can consider a ‘probit’ threshold model s.t. • the cumulative probability of being in one of the categories 1,…,s is and the associated latent normal model is • For a p – category unordered response we can define a latent p-1 variate normal We can define MCMC steps to sample form observed categorical responses an underlying normal or MVN. Note that these are further conditioned on the remaining set of (correlated) normal variables. For details see Multilevel models with multivariate mixed response types (2009) Goldstein, H, Carpenter, J., Kenward, M., Levin, K. Statistical Modelling (to appear)

Imputation • So now with any mixture of categorical and normal variables at any level, we sample, for each MCMC iteration, a MVN set of variables including imputed values. • Thus imputation is standard and the reverse transformation is used to obtain imputed variables on the categorical scales. • For non-normal continuous data we can use e.g. a Box-Cox normalising transformation to sample a latent normal. Further extensions for Poisson and other discrete distributions are also available. • Release 2.10 of MLwiN has a link to REALCOM that allows these extensions.

Partially observed (coarsened) data: • Where we have a prior (estimated) probability distribution (PD) for a missing discrete (or continuous) variable value we simply insert an extra MCMC step that accepts the ‘standard’ MI value with a probability that is just the probability given by the PD. A corresponding step is used for normal data. • This thus uses all of the data efficiently. No data are discarded so long as it is possible to assign a PD. • Applications in record matching, rating scales with uncertain responses etc. • Several completed data sets are produced and combined as in standard MI

Sampling weights- briefly • Consider a 2-level model: • Write level 2 weights as • Level 1 weights for j-th level 2 unit as Final level 1 weights We use as the level 1 random part explanatory variable instead of the constant =1 This will be used for imputation and for MOI Ongoing work to incorporate this into MLwiN-REALCOM

Missing data – issues and extensions