1 / 19

Treatment of Missing Data

Treatment of Missing Data. Wayne Jiang, FCAS Safeco Insurance Companies. Why missing handling is important. If not properly handled, missing data can lead to biased, invalid or insignificant results. Different kinds of missing data. Missing completely at random (MCAR).

bobbitt
Télécharger la présentation

Treatment of Missing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Treatment of Missing Data Wayne Jiang, FCAS Safeco Insurance Companies

  2. Why missing handling is important • If not properly handled, missing data can lead to biased, invalid or insignificant results.

  3. Different kinds of missing data • Missing completely at random (MCAR). • The probability that an observation is missing is unrelated to the value of the variable or to the value of any other variables, i.e. missing values are randomly distributed across all observations .

  4. Different kinds of missing data • Missing at random (MAR). • The probability of missing does not depend on the value of the variable after controlling for other variables. Or the missing is random after data is split into subgroups.

  5. Different kinds of missing data • Missing not at random • Neither MCAR nor MAR. • Very hard to analyze.

  6. Pattern of missing data • Monotone: • In the case of more than one variable can be missing, there is an order of variable can be missing.

  7. Dealing with missing data • If the data set is large and a few random points are missing the problem is not serious. • In a smaller data set with a non-random distribution of missing values the problem may be serious.

  8. Some ways to deal with the missing data problem (separate category) • Treat Missing as its own category • Could group very dissimilar classes together. • Severe bias could result.

  9. Some ways to deal with the missing data problem (deletion) • Listwise deletion. • Data line with any missing is deleted. • Yield unbiased parameter estimate if MCAR. • Sacrifices predictive power as less data points used. • In SAS Proc REG use that as default.

  10. Some ways to deal with the missing data problem (deletion) • Pairwise deletion • All available data used in calculation of correlation matrices. • Create sample size problem and possibly non-positive definite matrices problem. • In SAS Proc CORR use that as default.

  11. Some ways to deal with the missing data problem (substitution) • Mean substitution • Replace missing data with global mean. • Simple approach. • Underestimate the error. • Hot deck method • Simple approach. Replace missing with value from similar record. • Has randomness built in. • Still underestimate error.

  12. Some ways to deal with the missing data problem (imputation) • Regression • Replace missing data based on other variables. • Improvement over global mean. • Still underestimate the error.

  13. Multiple imputation • A Monte Carlo technique in which the missing values are replaced by 3-10 simulated versions, each of the simulated datasets is analyzed, and the results are combined to produce results that incorporate missing data uncertainty. • More complicated but a lot less bias. • SAS users can use Proc MI and Proc MIAnalyze.

  14. Three steps of multiple imputation • Impute data. • Data is assumed to be multivariate normal. Parameters are first estimated based on complete case. The imputed data is randomly picked from the distribution. Parameters are estimated again and another imputation follows. Do it until parameter converges. Then multiple sets of data are drawn randomly from the distribution.

  15. Three steps of multiple imputation • Analyze data • Each set of data is analyzed use any preferred methods. • Proc ####; BY _Imputation_; …;Run; • Save the parameters in a data sets.

  16. Three steps of multiple imputation • Combine results • Estimate = mean of all estimates. • Total variance = (Average within variance) + (1 + 1/m) (Between Variance). • Proc MIAnalyze parms =####; Run;

  17. Reference • SAS online manual: http://support.sas.com/rnd/app/papers/miv802.pdf • Carpenter, J and Kenward, M http://www.lshtm.ac.uk/msu/missingdata/start.html

  18. Questions?

  19. Thank you!

More Related