210 likes | 348 Vues
A Randomized Experiment Comparing Random to Nonrandom Assignment. William R Shadish University of California, Merced and M.H. Clark Southern Illinois University, Carbondale. Paper presented at the Society for Research on Educational Effectiveness, December 2006. Nonrandomized Experiments.
E N D
A Randomized Experiment Comparing Random to Nonrandom Assignment William R Shadish University of California, Merced and M.H. Clark Southern Illinois University, Carbondale Paper presented at the Society for Research on Educational Effectiveness, December 2006.
Nonrandomized Experiments • A central hypothesis about the use of nonrandomized experiments is that their results can well-approximate results from randomized experiments • especially when the results of the nonrandomized experiment are appropriately adjusted by, for example, selection bias modeling or propensity score analysis. • I take the goal of such adjustments to be: to estimate what the effect would have been if the nonrandomly assigned participants had instead been randomly assigned to the same conditions and assessed on the same outcome measures. • The latter is a counterfactual that cannot actually be observed • So how is it possible to study whether these adjustments work?
Kinds of Empirical Comparisons of Randomized to Nonrandomized Experiments • In general, there have been three different ways to study this question: • Computer Simulations • Single Study Comparisons • Meta-Analytic Comparisons • None of these methods provide a good estimate of that ideal counterfactual. • We can, however, estimate that counterfactual in the usual recommended way, with a randomized experiment:
N = 445 Undergrad Psych Students Random Assignment Randomized Experiment N = 235 Randomly Assigned to Nonrandomized Experiment N = 210 Self-Selected into Mathematics Training N = 79 Vocabulary Training N = 131 Mathematics Training N = 119 Vocabulary Training N = 116 All Participants Post-tested on both Vocabulary and Mathematics Outcomes
More on the Design • All participants pretested on a host of covariates • Chose math and vocab training because • Good analogue to educational interventions • Relevant to college students • Easy to control effect size with item difficulty • Math phobias cause plausible selection bias • All participants treated together without knowledge of the different conditions. • All participants posttested on both math and vocab outcomes.
Unadjusted Results: Effects of Math Training on Math Outcome Math Vocab Mean Absolute Tng Tng Diff Bias Mean Mean Unadjusted Randomized Experiment 11.35 7.16 4.19 Unadjusted Quasi-Experiment 12.38 7.37 5.01 .82 • Conclusions: • The effect of math training on math scores was larger when participants could self-select into math training. • The 4.19 point effect (out of 18 possible points) in the randomized experiment was overestimated by 19.6% (.82 points) in the nonrandomized experiment
Unadjusted Results: Effects of Vocab Training on Vocab Outcome Vocab Math Mean Absolute Tng Tng Diff Bias Mean Mean Unadjusted Randomized Experiment 16.19 8.08 8.11 Unadjusted Quasi-Experiment 16.75 7.75 9.00 .89 • Conclusions: • The effect of vocab training on vocab scores was larger (9 of 30 points) when participants could self-select into vocab training. • The 8.11 point effect (out of 30 possible points) in the randomized experiment was overestimated by 11% (.89 points) in the nonrandomized experiment.
Adjusted Quasi-Experiments • It is no surprise that randomized and nonrandomized experiments might yield different answers. • The more important question is whether statistical adjustments can improve the quasi-experimental estimate—in this case, propensity score adjustments
Estimation of Propensity Scores • Used stepwise logistic regression with subsequent forced entry of variables out of balance • For example: Math and vocabulary pretest scores, ACT, GPA, measures of previous exposure to math courses, math anxiety, Demographics • But also “Big 5” personality traits (extraversion, emotional stability, agreeableness, intellect, and conscientiousness) • Checked balance with 2 x 5 ANOVA, recomputed, rechecked. • Initially 10 of 30 covariates out of balance. • At the end, 1 of 30 interactions were significant and 0 of 30 main effects were significant.
Methods for Propensity Score Adjustments • ANCOVA • Sensitive to nonlinearities • Stratification on propensity score quintiles. • Weighting • Each observation is weighted by the inverse of its propensity score (tmt) or of (1 – ps) for control, and then a standard weighted average is computed • Here are results of three adjustments
Percent Bias Remaining After Adjustments There are a number of interesting features of these results:
Percent Bias Remaining After Adjustments: Stratification Rosenbaum and Rubin (1984) note that propensity score stratification should yield approximately a 90% reduction in bias in outcome due to each of the variables contributing to the propensity score. We obtained bias reduction of 90% and 91% for mathematics and vocabulary outcomes, respectively. (When was the last time you saw a point prediction like that be confirmed?)
Percent Bias Remaining After Adjustments: ANCOVA Bias reduction (in percentage terms) was approximately the same for each of the two outcomes for both propensity score stratification (90%, 91%) and propensity score weighting (70%, 74%). Bias reduction only differed over outcomes when we used propensity scores as a covariate. The reason is the relationship between propensity scores and mathematics outcome was not linear:
Relationship Between Propensity Scores and Outcomes For the vocabulary outcome, even a lowess smoother could not identify a function that fit much better than a linear one. For the mathematics outcome variable, the relationship had a cubic component. So, redoing the ANCOVA for math to include a quadratic and cubic component yielded much better adjustment:
Percent Bias Remaining After Adjustments: Nonlinear ANCOVA Bias in the mathematics outcome was reduced by 97.5%. Adding the same terms to the adjustment for vocabulary outcome had virtually no effect—though perhaps we did not add the “right” nonlinear terms. And now, each method yielded about the same bias reduction for each dependent variable. But why did weighting do comparatively poorly?
Extreme Propensity Scores The reason may be that n = 15 propensity scores were less than .01 (they were arbitrarily truncated to .01) and another n = 15 were greater than .9999 (they were arbitrarily truncated to .9999). Rubin (2001) notes that the estimation of propensity scores “can generate unrealistically extreme weights when an estimated propensity score is near zero or one” (p. 184). I’m open to suggestions and comments on this or other possible reasons for the poor performance of weighting.
What Is Impressive Is That Four Different Rubin Predictions were Supported • A specific point prediction that propensity score stratification should reduce bias by 90%, which it did, twice (math and vocab). • Indeed, if it were always 90%, we would be able to compute the exact effect (bias reduction/.9). • When propensity scores have no nonlinearities with outcome, ANCOVA will reduce bias (it did at 90% with vocab). • When propensity scores have nonlinearities that are not modelled, ANCOVA will perform more poorly (math 1) • When those nonlinearities are modeled well, ANCOVA will reduce bias (98% in math 2)
Predictors of Convenience • Bad practice: We also tested the effectiveness of propensity score adjustments based only on predictors of convenience (sex, age, ethnicity, marital status) • Depending on how we did the analyses bias reduction ranged from 12-30% at best. • The importance of thoughtful selection of covariates in the design of the study.
Discussion • Results suggest more optimistic conclusions than those reached by many recent researchers • Propensity score adjustments to nonrandomized experiments might yield a reasonable estimate of what the effect would have been if these same participants had instead been randomly assigned to these same conditions using these same measures. • Jason Luellen Dissertation • Methods for constructing propensity scores crucial • With large sample size, logistic regression nearly always got closer to the right answer.
Conclusions • The laboratory analogue method is a novel test of the effects of assignment method, and of adjustments to quasi-experiments, with some advantages over other methods. • Propensity score adjustments have promise