1 / 34

UNECE workshop on data editing and imputation, Vienna 22. April 2008

Evaluating Different Approaches for Multiple Imputation Under Linear Constrains. Jörg Drechsler (Institute for Employment Research, Germany) & Trivellore Raghunathan (University of Michigan). UNECE workshop on data editing and imputation, Vienna 22. April 2008. Overview. The Problem

pillan
Télécharger la présentation

UNECE workshop on data editing and imputation, Vienna 22. April 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Different Approaches for Multiple Imputation Under Linear Constrains Jörg Drechsler (Institute for Employment Research, Germany) & Trivellore Raghunathan (University of Michigan) UNECE workshop on data editing and imputation, Vienna 22. April 2008

  2. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  3. The Problem • Some Variables Y1, Y2,…, Yk have to some up to a given total Yt • Examples • - turnover in different regions • - number of employees with different qualification levels • - Investment in different subcategories

  4. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  5. The Data • The IAB Establishment Panel • The number of employees • with • - Yttotal number of employees • - Yworknumber of blue collar + white collar workers • - Ytrainnumber of trainees • - Yexecnumber of executives • - Yownnumber of owners + working family members • - Ymargnumber of “marginal” workers not covered by social security • - Yothernumber of other employees

  6. The Data • Summary Statistics • - data is heavily skewed • - most variables are semi-continuous • - low variation for the number of owners • - additional constrain: all variables >=0

  7. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  8. A Little Background on Multiple Imputation • Generate random draws from • Imputation in two steps • 1. Generate random draws for θ from its posterior distribution given the observed values • 2. Generate random draws for the missing values from the conditional predictive distribution given the drawn parameters • Drawing from 1. can be difficult • Solution MCMC-Techniques

  9. Gibbs Sampling • Generate random draws from conditional univariate distributions • P(Y1|Y-1,θ1) • P(Yk|Y-k,θk) • Iteration provides draws from the joint distribution • Imputation in two steps for every univariate distribution • Imputation model can vary for different variable types

  10. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  11. The Methodology • Five imputation methods • - simple imputation of all variables • - independent imputation considering semi-continuity • - nested imputation of the proportions • - non-Bayesian Dirichlet imputation • - Bayesian Dirichlet/Multinomial imputation

  12. Simple Imputation • Impute all variables independently • Transform all continuous variables by taking the cubic root • Ignore semi-continuity • Use simple linear models • Use same models as for independent imputation under semi-continuity • Fulfill constrains by: • setting if • Down weighting all imputed subcategories if Yt is observed or

  13. Independent imputation • Impute all variables independently • Run a logit regression for all variables to address semi-continuity • Outcome: 1 if Yij>0, 0 otherwise • Run a linear regression only for the units with Yij>0 and impute only for missing units with positive outcome in the logit regression • set all other values to 0 • Depending on number of units with Yij>0 stratify for Western/Eastern Germany and two quantiles for establishment size • Use only 20 explanatory variables for number of executives and other workers, ≈ 100 variables for all other dependent variables • Use same correction methods afterwards

  14. Nested Imputation of Proportions • Address semi-continuity with logit-model • Caculate proportions of the total for all subcategories with positive outcome • Use a logit transformation on the proportions • Variables are distributed between ]-Inf;Inf[ • Impute variables with linear models • Use almost the same models as for independent imputation under semi-continuity • Nested Imputation: after imputing number of workers define proportions as • After imputation transform variables back and multiply with totals • Use same correction methods afterwards

  15. Non Bayesian Dirichlet Distribution • Following an idea by Tempelman (2007) • Ignore semi-continuity • Calculate nested proportions again • Assume Dirichlet distribution for the proportions • Generate starting values using the EM-Algorithm for the Dirichlet Distribution

  16. Non Bayesian Dirichlet Distribution II • Imputation Algorithm (Data Augmentation): - draw new values for from obtained by Maximum-Likelihood-Estimation - draw new values for mi number of observations to impute for unit i - Calculate • Not fully Bayesian since the distribution of is only approximated • Use same correction methods afterwards

  17. Bayesian Dirichlet/Multinomial Imputation • Generate starting values using the simple imputation approach • For each unit generate a random draw from the Dirichlet distribution with • For each unit generate a random draw from a multinomial distribution with and • weighted vector p for missing obs, • Use same correction methods afterwards

  18. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  19. The Simulation Design • Use fully observed survey data (n=11536) • Generate a random sample with replacement of size n • Generate ≈30% missings for each variable (MAR) • Impute missings with different approaches (m=10, iterations=20) • Calculate different quantities of interest • Repeat whole process of sampling and imputation 100 times

  20. Generating missing values • X1 expected development for the number of employees in the next five years (6 categories) • X2 number of unskilled workers • X3 industry-wide wage agreement (1=Yes) • Increase for any X leads to decrease of pmis

  21. Quality measures • For all estimates of interest: • Compute the estimate from the original survey • Compute the average estimate across the 100 samples • Compute the average estimate across the 100 imputed samples • Compute the 95% coverage rate for the fully observed samples and the imputed samples • Compute • Compute • Compute the average confidence interval overlap for the fully observed sample and the imputed sample

  22. Confidence interval overlap • Suggested by Karr et al. (2006) • Measure the overlap of CIs from the original data and CIs from the imputed data • The higher the overlap, the higher the data utility • Compute the average relative CI overlap for any CI for the imputed data CI for the original data

  23. Estimates of Interest • Mean (Yi) in the 16 German Länder • Logit regression to explain collective wage agreements by establishment size • Use number of employees covered by social security in 6 categories (employees covered by social security = workers + trainees): • Y~emp<10+emp<50+emp<100+emp<250+emp<750+emp>750+industry.dummies • Compare the estimates for the establishment size from the different imputation methods

  24. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  25. Example for the results

  26. Results Averaged Over Different Regions

  27. Results Averaged Over Different Regions

  28. Results Averaged Over Different Regions

  29. Average absolute deviation

  30. Results for the regression

  31. Results for the regression II

  32. Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work

  33. Conclusions • All methods provide good repeated sampling properties • Differences between the approaches are relatively small • Dirichlet and proportions approach tend to introduce more variability • Dirichlet and proportions approach don’t work very well for owners and others • The simple approach seems to work best with high coverage and low additional variability Future Work • Compare same approaches for more equally distributed subcategories

  34. Thank you for your attention

More Related