1 / 18

Sequential Regression – A Method for Multipe Imputations of Missing Data

Sequential Regression – A Method for Multipe Imputations of Missing Data. Seminar of the Working Group on Composite Indices 27 March 2002. Tanja Srebotnjak, United Nations Statistics Division, 2002. Procedure:. A multivariate technique for multiply imputing missing data

kayo
Télécharger la présentation

Sequential Regression – A Method for Multipe Imputations of Missing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequential Regression – A Method for Multipe Imputations of Missing Data Seminar of the Working Group on Composite Indices 27 March 2002 Tanja Srebotnjak, United Nations Statistics Division, 2002 WGCI - Seminar

  2. Procedure: • A multivariate technique for multiply imputing missing data • Uses a sequence of regression models • Developed by T. E. Raghunathan, J. M. Lepkowski, Peter Solenberger and John Van Hoewyk, [University of Michigan] WGCI - Seminar

  3. Procedure (contd.): • Assume a dataset of dimension (np) with item non-response/missingness • Partition the dataset into n1 variables with no missing obs, say X=(X1,X2,…,Xn1) and (n- n1) with missing values Y=(Y1,Y2,…,Yn-n1) • Y is ordered by degree of missingness, from least to most WGCI - Seminar

  4. Procedure (contd.): • Then, the conditional distribution of Y1, i=1,2,…,n-n1, given the observed values is modeled as a regression model of Yi on X, e.g. E(Y1|X)=X + e • Missing values are imputed using this model • Once, Y1 is imputed, it is used as a predictor for Y2, i.e. X=(X1, X2, …, Xn, Y1) WGCI - Seminar

  5. Procedure (contd.): • The algorithm continues cycling through this series of regression models (using updated predictor sets until X=(X1,X2,…,Xn1,Y1,…,Yn-n1) • Now, a new round begins, using the full dataset as predictor for Y1 again, thus updating the regression coefficients • The algorithm is repeated until convergence in the regression coefficients is achieved, i.e. change below a specified margin  WGCI - Seminar

  6. Procedure (contd.): • Finally, the missing values for each Yi are imputed using the corresponding converged regression model • In order to yield multiple imputations, the complete algorithm is repeated m times, resulting in m completed datasets • The m datasets are analyzed and the results combined to yield final parameter estimates WGCI - Seminar

  7. Flexibility: • Basically all model types can be fitted • Stepwise regressions possible to ensure that only most important predictors enter the model • Introduction of randomization in the imputation process through prior distribution on regression coefficients, and/or perturbation of imputed values • No assumption of specific distribution for joint distribution necessary WGCI - Seminar

  8. Computational realization: • SAS callable IVEware (Imputation, Variance Estimation software) developed by Raghunathan, Solenberger, Van Hoewyk • Allows for the modeling of linear, logistic, Poisson and generalized logit distributions • Prior distributions on coefficients • Stepwise regressions with specified R2 • Inclusion of interaction terms in model • Restrictions on imputed variables WGCI - Seminar

  9. Example: • The status of “Least Developed Country” is based on assessment of several requirements, including data containing 4 key indicators. • Missing values in these 4 key indicators render the assessment more difficult. • Hence, imputation of missing values would be desirable. • Objective: Compare estimated values from LDC dataset with imputations using a srmi approach. WGCI - Seminar

  10. Application to LDC dataset: • LDC dataset consists of 4 variables: • Child mortality rate per 1000, 1995-2000 [ChldMort] • Calory intake as % of requirements, 1995/1997 [CalInt] • 1st and 2nd level gross enrollment ratio, 1995 [EnrRatio] • Adult literacy rate in %, 1995 [LitRate] WGCI - Seminar

  11. Application to LDC dataset (contd.): • Distributions skewed and not normal, high intercorrelations >cor(ldcdata[,2:6],na.method="available") ChldMort CalInt EnrRatio AdtLitRate LDCInd ChldMort 1.00 -0.62 -0.73 -0.81 0.73 CalInt 1.00 0.59 0.45 -0.55 EnrRatio 1.00 0.88 -0.66 AdtLitRate 1.00 -0.67 LDCInd 1.00 WGCI - Seminar

  12. Application to LDC dataset (contd.): Literacy rate: Calory intake Child mortality Enrollment ratio WGCI - Seminar

  13. Application to LDC dataset (contd.): • Missingness distribution: Variable Observed Imputed Double counted EnrRatio 90 38 0 LitRate 92 36 0 ChldMort 118 10 0 CalInt 119 9 0 WGCI - Seminar

  14. Application to LDC dataset (contd.): • Model specifications: • All variables are continuous • Bounds on imputed values derived from observed values (to guarantee sensible imputations) • Min R2 for inclusion of a variable in model 0.15 • M=5 imputations • Convergence diagnostics: • Stabilization of coefficients and mean, variance of imputed variable WGCI - Seminar

  15. Application to LDC dataset (contd.): • Largest deviations observed between imputed and estimated values for ChldMort1, although only 10 values were missing. Sqrt(deviations^2) ChldMort CalInt EnrRatio LitRate 155.89 48.84 82.14 96.57 • This is reflected also in the standard deviation of ChldMort Note: the range of ChldMort is [0,500], thus larger deviations are possible compared to the other variables. WGCI - Seminar

  16. Application to LDC dataset (contd.): Variable ChldMort Obs Imputed Combined No 118 10 128 Min 6.06 4.40 4.40 Max 262.59 187.249 262.59 Mean 88.20 110.118 89.91 StdDev 65.68 66.2078 65.73 Variable EnrRatio Obs Imputed Combined No 90 38 128 Min 19 13.72 13.72 Max 104 107.73 107.73 Mean 71.81 65.11 69.82 StdDev 22.89 23.47 23.18 WGCI - Seminar

  17. Application to LDC dataset (contd.): Variable LitRate Obs Imputed Combined No 92 36 128 Min 13.6 11.85 11.85 Max 98.2 97.93 98.2 Mean 69.47 56.92 65.94 StdDev 22.74 23.84 23.65 Variable CalInt Obs Imputed Combined No 119 9 128 Min 75 100.80 75 Max 164 162.8 164 Mean 115.48 124.79 116.13 StdDev 20.26 21.85 20.42 WGCI - Seminar

  18. Application to LDC dataset (contd.): • Srmi provides a flexible method for multiply imputing missing data without requiring the assumption of a specific joint distribution of the data at hand. • The srmi approach reflects intuitively the idea to use regression models for imputation purposes. • However, it is more powerful than single regression through the full exploitation of the covariance structure. WGCI - Seminar

More Related