1 / 22

Calibrated imputation of numerical data under linear edit restrictions

Calibrated imputation of numerical data under linear edit restrictions. Jeroen Pannekoek Natalie Shlomo Ton de Waal. Missing data. Data may be missing from collected data sets Unit non-response Data from entire units are missing Often dealt with by means of weighting Item non-response

thina
Télécharger la présentation

Calibrated imputation of numerical data under linear edit restrictions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal

  2. Missing data • Data may be missing from collected data sets • Unit non-response • Data from entire units are missing • Often dealt with by means of weighting • Item non-response • Some items from units are missing • Usually dealt with by means of imputation

  3. Linear edit restrictions • Data often have to satisfy edit restrictions • For numerical data most edits are linear • Balance equations: a1x1 + a2x2 + … + anxn + b = 0 • Inequalities: a1x1 + a2x2 + ... + anxn + b ≥ 0

  4. Totals • Sometimes also totals are known

  5. Eliminating balance equations • We can “eliminate balance equations” • Example: set of edits • net + tax – gross = 0 • net ≥ tax • net ≥ 0 • Eliminating the balance equations • net = gross – tax • gross – tax ≥ tax • gross – tax ≥ 0

  6. Eliminating balance equations • We can “eliminate balance equations” • Example: set of edits • net + tax – gross = 0 • net ≥ tax • net ≥ 0 • Eliminating the balance equations • net = gross – tax • gross – tax ≥ tax • gross – tax ≥ 0

  7. Eliminating balance equations • By eliminating all balance equations we only have to deal with inequality edits • If we sequentially impute variables, we only have to ensure that imputed values lie in an interval • Li ≤ xi ≤ Ui • We can now focus on satisfying totals

  8. Imputation methods • Adjusted predicted mean imputation • Adjusted predicted mean imputation with random residuals • MCMC approach

  9. Adjusted predicted mean imputation • We use sequential imputation • All missing values for a variable (the target variable) are imputed simultaneously • We impute target column xt • We use the model xt = β0 + βxp + e • We impute xt = β0 + βxp • Imputed values do not satisfy edits nor totals

  10. Satisfying totals • The totals of missing data for target variable (Xt,mis) as well as predictor (Xp,mis) are known • We construct the following model for observed data • xt,obs = β0 + βxp,obs + e • Xt,mis = β1m + βXp,mis • m is the number of missing values • We apply OLS to estimate model parameters • We impute xt,mis = β1 + βxp,mis • Sum of imputed values then equals known value of this total

  11. Satisfying totals and intervals (edits) • We impute xt,mis = β1 + βxp,mis + at • at,i are chosen in such a way that • Imputed values lie in their feasible intervals • Σi at,i = 0 • Appropriate values for at,i can be found by means of operations research technique • For simple alternative technique, see paper

  12. Satisfying totals and intervals (edits) • Alternatively, draw m residuals by Acceptance/Rejection sampling from a Normal Distribution (zero mean and residual variance of the regression model) that satisfy interval constraints • Adjust random residuals to meet the sum constraints as carried out for at,i

  13. MCMC approach • Start with pre-imputed consistent dataset • Randomly select two records • We select a variable in these records. Note that we know the sum of these two values of this variable for the two records

  14. MCMC approach • We then apply following two steps • We determine intervals for the two values. • We then draw value for one missing value. Other value then immediately follows. • Now, repeat Steps 1 and 2 until “convergence”. • In Step 2 we draw a value from a posterior predictive distribution implied by a linear regression model under uninformative prior, conditional on the fact that it has to lie inside corresponding interval

  15. Evaluation study: methods • Evaluated imputation methods: • UPMA: unbenchmarked simple predictive mean imputation with adjustments to imputations that satisfy interval constraints • BPMA: benchmarked predictive mean imputation with adjustments to imputations that satisfy interval constraints and totals • MCMC: BPMA with adjustments was used as pre-imputed data set for MCMC approach

  16. Evaluation study: data set • 11,907 individuals aged 15 and over that responded to all questions in 2005 Israel Income Survey and earned more than 1000 Israel Shekels for their monthly gross income • Item non-response was introduced randomly to income variables • 20% of records were selected randomly and their net income variable deleted • 20% of records were selected randomly and their tax variable deleted while 10% of those records were in common with the missing net income variable • Totals of each of the income variables are known

  17. Evaluation study: data set • We focus on three variables from the Income Survey: • gross: gross income from earnings • net: net income from earnings • tax: tax paid • Edits: • net + tax = gross • net ≥ tax • gross ≥ 3 x tax • gross ≥ 0, net ≥ 0, tax ≥ 0 • Log transform was carried out on variables to ensure normality of data

  18. Evaluation criteria • dL1 • average distance between imputed and true values • Z • number of imputed records on boundary of feasible region defined by edits • K-S(Kolmogorov-Smirnov) • compares empirical distribution of original values to empirical distribution of imputed values • Sign • sign test carried out on difference between original value and imputed value • Kappa • Kappa statistic for 2-dimensional contingency table; compares agreement against that which might be expected by chance

  19. Results

  20. Results

  21. Conclusions • MCMC approach is doing worse than other methods on all criteria except number of records that lie on boundary • However, MCMC allows multiple imputation in order to take imputation uncertainty into account in variance estimation • BPMA appear to be slightly better compared to UPMA except for K-S statistic • Number of records that lie on boundary for UPMAis cause for concern • MCMC approach is doing slightly better than BPMA approach in this respect

  22. Future research • Improving MCMC approach • Carrying out multiple imputation using MCMC approach to obtain proper variance estimation • In our study a log transformation was carried out on variables to ensure normality of data • Correction factor was introduced into constant term of regression model to correct for this log transformation • Better approach to this problem will be investigated • Extending problem to situations where one has non-equal sampling weights

More Related