Sequential Regression – A Method for Multipe Imputations of Missing Data

Sequential Regression – A Method for Multipe Imputations of Missing Data Seminar of the Working Group on Composite Indices 27 March 2002 Tanja Srebotnjak, United Nations Statistics Division, 2002 WGCI - Seminar

Procedure: • A multivariate technique for multiply imputing missing data • Uses a sequence of regression models • Developed by T. E. Raghunathan, J. M. Lepkowski, Peter Solenberger and John Van Hoewyk, [University of Michigan] WGCI - Seminar

Procedure (contd.): • Assume a dataset of dimension (np) with item non-response/missingness • Partition the dataset into n1 variables with no missing obs, say X=(X1,X2,…,Xn1) and (n- n1) with missing values Y=(Y1,Y2,…,Yn-n1) • Y is ordered by degree of missingness, from least to most WGCI - Seminar

Procedure (contd.): • Then, the conditional distribution of Y1, i=1,2,…,n-n1, given the observed values is modeled as a regression model of Yi on X, e.g. E(Y1|X)=X + e • Missing values are imputed using this model • Once, Y1 is imputed, it is used as a predictor for Y2, i.e. X=(X1, X2, …, Xn, Y1) WGCI - Seminar

Procedure (contd.): • The algorithm continues cycling through this series of regression models (using updated predictor sets until X=(X1,X2,…,Xn1,Y1,…,Yn-n1) • Now, a new round begins, using the full dataset as predictor for Y1 again, thus updating the regression coefficients • The algorithm is repeated until convergence in the regression coefficients is achieved, i.e. change below a specified margin  WGCI - Seminar

Procedure (contd.): • Finally, the missing values for each Yi are imputed using the corresponding converged regression model • In order to yield multiple imputations, the complete algorithm is repeated m times, resulting in m completed datasets • The m datasets are analyzed and the results combined to yield final parameter estimates WGCI - Seminar

Flexibility: • Basically all model types can be fitted • Stepwise regressions possible to ensure that only most important predictors enter the model • Introduction of randomization in the imputation process through prior distribution on regression coefficients, and/or perturbation of imputed values • No assumption of specific distribution for joint distribution necessary WGCI - Seminar

Computational realization: • SAS callable IVEware (Imputation, Variance Estimation software) developed by Raghunathan, Solenberger, Van Hoewyk • Allows for the modeling of linear, logistic, Poisson and generalized logit distributions • Prior distributions on coefficients • Stepwise regressions with specified R2 • Inclusion of interaction terms in model • Restrictions on imputed variables WGCI - Seminar

Example: • The status of “Least Developed Country” is based on assessment of several requirements, including data containing 4 key indicators. • Missing values in these 4 key indicators render the assessment more difficult. • Hence, imputation of missing values would be desirable. • Objective: Compare estimated values from LDC dataset with imputations using a srmi approach. WGCI - Seminar

Application to LDC dataset: • LDC dataset consists of 4 variables: • Child mortality rate per 1000, 1995-2000 [ChldMort] • Calory intake as % of requirements, 1995/1997 [CalInt] • 1st and 2nd level gross enrollment ratio, 1995 [EnrRatio] • Adult literacy rate in %, 1995 [LitRate] WGCI - Seminar

Application to LDC dataset (contd.): • Distributions skewed and not normal, high intercorrelations >cor(ldcdata[,2:6],na.method="available") ChldMort CalInt EnrRatio AdtLitRate LDCInd ChldMort 1.00 -0.62 -0.73 -0.81 0.73 CalInt 1.00 0.59 0.45 -0.55 EnrRatio 1.00 0.88 -0.66 AdtLitRate 1.00 -0.67 LDCInd 1.00 WGCI - Seminar

Application to LDC dataset (contd.): Literacy rate: Calory intake Child mortality Enrollment ratio WGCI - Seminar

Application to LDC dataset (contd.): • Missingness distribution: Variable Observed Imputed Double counted EnrRatio 90 38 0 LitRate 92 36 0 ChldMort 118 10 0 CalInt 119 9 0 WGCI - Seminar

Application to LDC dataset (contd.): • Model specifications: • All variables are continuous • Bounds on imputed values derived from observed values (to guarantee sensible imputations) • Min R2 for inclusion of a variable in model 0.15 • M=5 imputations • Convergence diagnostics: • Stabilization of coefficients and mean, variance of imputed variable WGCI - Seminar

Application to LDC dataset (contd.): • Largest deviations observed between imputed and estimated values for ChldMort1, although only 10 values were missing. Sqrt(deviations^2) ChldMort CalInt EnrRatio LitRate 155.89 48.84 82.14 96.57 • This is reflected also in the standard deviation of ChldMort Note: the range of ChldMort is [0,500], thus larger deviations are possible compared to the other variables. WGCI - Seminar

Application to LDC dataset (contd.): Variable ChldMort Obs Imputed Combined No 118 10 128 Min 6.06 4.40 4.40 Max 262.59 187.249 262.59 Mean 88.20 110.118 89.91 StdDev 65.68 66.2078 65.73 Variable EnrRatio Obs Imputed Combined No 90 38 128 Min 19 13.72 13.72 Max 104 107.73 107.73 Mean 71.81 65.11 69.82 StdDev 22.89 23.47 23.18 WGCI - Seminar

Application to LDC dataset (contd.): Variable LitRate Obs Imputed Combined No 92 36 128 Min 13.6 11.85 11.85 Max 98.2 97.93 98.2 Mean 69.47 56.92 65.94 StdDev 22.74 23.84 23.65 Variable CalInt Obs Imputed Combined No 119 9 128 Min 75 100.80 75 Max 164 162.8 164 Mean 115.48 124.79 116.13 StdDev 20.26 21.85 20.42 WGCI - Seminar

Application to LDC dataset (contd.): • Srmi provides a flexible method for multiply imputing missing data without requiring the assumption of a specific joint distribution of the data at hand. • The srmi approach reflects intuitively the idea to use regression models for imputation purposes. • However, it is more powerful than single regression through the full exploitation of the covariance structure. WGCI - Seminar

Sequential Regression – A Method for Multipe Imputations of Missing Data

Sequential Regression – A Method for Multipe Imputations of Missing Data

Presentation Transcript

Logistic Regression

Statistics and Data Analysis

Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series

Multiple Linear Regression and the General Linear Model

Sequential sums of squares

Sequential MOS Logic Circuits

6-3 Multiple Regression

Approximate Mining of Consensus Sequential Patterns

FUNDAMENTALS of ENGINEERING SEISMOLOGY

Logistic Regression

Simple Linear Regression

Chapter 9

Lecture 11: Sequential Circuit Design

ECIV 301

FORECASTING WITH REGRESSION MODELS TREND ANALYSIS

Microarrays: Common Analysis Approaches

Prof. Qing Wang

Linear Regression

The Wealth of Nations

Regression Analysis and Multiple Regression

Missing Data in Randomized Control Trials