Estimating Causal Effects: Using Experimental and Observational Designs Anna Haskins, Nick Mader, and Hilary Shager ITP Seminar October 26, 2007
Overview • What are causal effects and how can we measure them? • Randomized controlled trials—benefits and limitations • Quasi-experimental methods for estimating causal effects • Propensity score matching, regression discontinuity design, fixed effects, instrumental variables, others • Applying what we’ve learned • Benefits and limitations of using large-scale databases • Practical resources for YOU
Why focus on causal effects? • There have been many *bad* research papers produced, and we need to do a better job of conducting high quality research in education • Within the world of researchers, educators, and policymakers there is a lack of clarity regarding which analytic or methodological approaches are most appropriate for making causal inferences about the effectiveness of educational interventions
Institutional support for RCTs • IES shows clear preference for Randomized Control Trials (RCTs) • NCLB • What Works Clearinghouse • ITP grant and other gov’t funded research • NRC reports • IERI
Randomized Control Trials (RCTs) “When correctly implemented, the randomized controlled experiment is the most powerful design for detecting treatment effects…” -Schneider et al., p. 11 • Why are they so great, you ask? When implemented correctly… • Assures treatment group assignment is independent of the pretreatment characteristics of group members • Measures the effect of an intervention or “cause” by washing out every other cause (no confounding of treatment effect)
However… “My opinion about RCTs is that they are underutilized in educational research and overemphasized in political discussions.” -Gerald E. Sroufe, director of government relations for AERA
Limitations of RCTs • What you can’t learn from RCTs… • Mechanisms (how/why the treatment worked) • External validity (generalizibility) often an issue • Ignores real world selection processes • Why not always feasible? • Logistic issues • Ethical issues • Time and money constraints
In reality… • Most of us will do at least some quasi-experimental research • Quasi-experiments are comparative studies that carefully attempt to isolate the effect of an intervention through means other than randomization • There is a need for other methods to inform randomized trials • Defining relevant outcomes • Identifying promising interventions • Targeting populations of interest • Suggesting causal mechanisms
Motivation behind AERA white paper • There is an important role for quasi-experimental methods in education research • Large-scale, longitudinal databases, like those available from NCES, are excellent resources for this work • But we need to remember that we still want to strive for causal inference
Criteria for Making Causal Inferences • Causal Relativity • The effect of a cause must always be evaluated relative to another cause (causal questions ask the effectiveness of a treatment relative to some control or other treatment) • Causal Manipulation • Each participant must be potentially exposable to the causes under consideration (this excludes attributes such as race or gender as cause since these are typically not manipulable)
Criteria for Making Causal Inferences • Temporal Ordering • Exposure to a cause must occur at a specific time or within a specific time period (so that pre and post exposure measurements can be taken to determine the magnitude of the effect) • Elimination of Alternative Explanations • Alternative explanations for the relationship between possible causes (treatments) and their effects must be ruled out. This is usually done through random assignment and ensures that any outcomes between the treatment and control groups are thus attributed to differences in treatment assignment.
Methods for Observational Data • Four methods approximating RCTs using observational data + assumptions • Propensity Score Matching • Regression Discontinuity • Fixed Effects • Instrumental Variables • Control Functions (my addition)
Observed Factor Unobserved Factor The Problem in Causal Inference Confounding Influence Treatment Outcome
Observed Factor Unobserved Factor RCT Solution Confounding Influence Treatment Outcome
Propensity Score Matching • Idea • Compares outcomes of similar units where the only difference is treatment; discards the rest • Example • Low ability students will have lower future achievement, and are also likely to be retained in grade • Naïve comparison of untreated/treated students creates bias, where the untreated do better in the post period • Matching methods make the proper comparison
Observed Factor Unobserved Factor Propensity Score Matching Confounding Influence Treatment Treatment Outcome
Propensity Score Matching • Advantages • Draws inference from only proper comparisons • Focuses on population of interest • Use of propensity score solves the dimensionality problem in matching • Limitations • Cannot correct for unobserved characteristics influencing the outcome
Propensity Score Matching • Implementation • First stage: regress treatment on observables • Second stage: form individual probabilities of treatment and save observations where there is overlap • Third stage: compare outcomes of treated observations to similar non-treated observations. Less weight is given, the less the similarity (that’s all the second equation is). This can be done with “bins” or kernel functions. I.e., the weight a comparison (between treated vs. controlled units) gets in the analysis decreases as the units get less similar
Regression Discontinuity Design • Idea • Focuses on a subsample for which assignment to the treatment is random • Example • Low ability students will have lower future achievement, and are also likely to be retained in grade • Naïve comparison of untreated/treated creates bias, where the untreated do better in the post period • RDD compares the outcomes of students whose characteristics are in the neighborhood of a sharp cutoff in a retention policy (e.g., held back if scoring 49, promoted if scoring 50)
Observed Factor Unobserved Factor Regression Discontinuity Design Sample at Policy Threshold Confounding Influence Treatment Outcome
Regression Discontinuity Design • Advantages • As with RCTs, randomization is used to eliminate confounding factors • Unlike RCTs, can give priority to certain units when phasing in treatment • Limitations • Selected subsample may not be the full population of interest • Focus on select subsample reduces sample size • Need for a sharp policy assignment cutoff
Regression Discontinuity Design • Implementation • Determine trade-off between tight “bandwidth” for arguing randomness, and wide bandwidth for statistical power • We can handle “fuzzy” design • Y is the outcome, p is the probability of treatment. + indicates an outcome or prob. just above the cutoff. - indicates the same, but just below
Fixed Effects • Idea • Eliminates alternative explanations that are “fixed” across units • Example • Students with good backgrounds (family, IQ) elect to attend college, and college increases wages • If student backgrounds are not perfectly observed, there will be a residual correlation between college attendance and wages leading to a biased finding • Using FE at the family level, we can “soak up” this influence to the extent that family quality is fixed
Observed Factor Unobserved Factor Fixed Effects Solution Fixed Influences Confounding Influence Treatment Outcome
Fixed Effects • Advantages • (As with RCTs) we do not need to observe these confounding influences • Limitations • Cannot control for varying (non-fixed) influences • Family example: parents get divorced, family finances change • Data Demands: can only control for fixed influences at a level higher than the level of treatment • May reduce sample size • Can be solved with better, often longitudinal, data • Sample may no longer be representative • Bias is towards no finding. FE eliminates portions of true effect, but not noise.
Fixed Effects • Implementation notes • Only requires insertion of dummy variables at the level of the effect. In our example, a dummy variable for each family. Other examples: district level, school level, individual level • Correlation between Tjt and fj represents the trend that good (or bad) families generally take the treatment. If fj is treated as an error term, there is endogeneity of Tjt. By controlling for it, the only potential correlation is between Tjt and eijt in the sense of time-varying unobservables.
Instrumental Variables • Idea • Determines observed versus unobserved explanations for taking treatment, and only uses observed portion • Example • Students with good backgrounds (family, IQ) elect to attend college, and college increases wages • If student backgrounds are not perfectly observed, there will be a residual correlation between college attendance and wages leading to a biased finding • IV substitutes actual college attendance with college attendance predicted by observables (i.e., with unobserved factors subtracted out)
Observed Factor Unobserved Factor Instrumental Variables Confounding Influence Treatment Outcome Instrumental Variable(s)
Instrumental Variables • Advantages • Relies on trustworthy (observed) variation in treatment • Can use prior RCTs to find valid instruments, e.g., Nye et al. (2004) • Limitations • Difficult to find valid instruments • Cannot determine whether a variable is truly exogenous • Works only to the extent that the instrument is exogenous and strongly correlated with treatment
Instrumental Variables • Implementation • First stage: regress treatment* on observables and the instrument(s) • Second stage: run outcome regression, substituting treatment variable with predicted (by only observables) treatment • If the treatment is not continuous, a Heckit procedure may be more appropriate. More on this later. * In the implementation presented, the treatment should be continuous, such as “hours tutoring received” or “hours instructed with particular curriculum”. Note, however, that the logic of Instrumental Variables methods can be modified to fit any application, such as binary treatment classes.
Control Function Approach (e.g. Heckit) • Idea • Determines observed versus unobserved explanations for taking treatment, and uses this to model the confounding influence • In approach, this is very similar to IV, but plumbs information about unobserved factors • Example • Students with good backgrounds (family, IQ) elect to attend college, and college increases wages • If student backgrounds are not perfectly observed, there will be a residual correlation between college attendance and wages leading to a biased finding • Control functions study how likely a student is to obtain treatment to determine whether there’s an important unobserved influence (poor, urban minority student attending Harvard), and adjusts expectations in outcome equation
Observed Factor Unobserved Factor Control Functions Confounding Influence Treatment Outcome Factors Determining Treatment
Control Function Approach (e.g. Heckit) • Advantages • We can understand the choice process (and confirm our prior expectations) • Limitations • Parametric assumptions may be inappropriate in drawing inference on unobserved factors • Non-parametric approaches are available
Control Function Approach (e.g. Heckit) • Implementation* • First stage: regress treatment on observables and the instrument(s) • Second stage: run outcome regression, adding substituting treatment variable with predicted (by only observables) treatment * This implementation is the seminal one considered in Heckman (1979) where outcomes are observed only for units who receive treatment. The logic of this method, can be broadly applied and is similar across applications.
Return to NRC’s original questions • Is there a systematic effect? • Maybe experiments are our “best” tool to answer this question • But concerns remain… • Practicality • Access • Ethics • Timeliness • External validity
And there are two other questions… 2) What is happening? 3) Why or how is it happening? • These questions are central to the design of experiments • Also important for development of theory • Also of great interest to policy makers and educational practitioners • Large-scale database research can help us answer these questions
What are the benefits and limitations of large-scale data sets? • Benefits • Widely accessible • Wealth of contextual information and ability to consider multiple counterfactuals • Large samples allow for comparisons across sub-groups of interest • Can be linked with other datasets • Limitations • Missing data • Design/instruments often developed based on precedent rather than need
“However, even with these data, which arguably are among the best we have, the findings have not consistently yielded information that could substantially improve our schools and change the educational opportunities of students, especially those who attend high-poverty schools and whose families have limited social resources.” --Schneider et al., p. 111
AERA board recommendations • Employ decision rules to assess strength of quasi-experimental designs • Move beyond simple OLS to get at causation • E.g., see p. 113-116 of Schneider et al. or What Works Clearinghouse guidelines • Strengthen future data collection efforts • Embed RCTs within longitudinal studies • Don’t rely on precedent to develop surveys • Don’t ignore processes in favor of products
Large-scale data set resources • Data sources on campus • DISC (3308 Sewell Social Science Building) • Data sources on-line • NCES (http://nces.ed.gov/) • ICPSR (http://www.icpsr.umich.edu/) • OPR (http://opr.princeton.edu/) • High quality research collections and guidelines • WWC (http://ies.ed.gov/ncee/wwc/) • C2 (http://www.campbellcollaboration.org/) • Product from today…
A turn to the practical… • What causal question (of interest to IES…) haunts your discipline? • What database(s) might be used to answer it? • Is there any existing RCT data that might be mined? • Which quasi-experimental method(s) might be used to approach causal inference?
References Heckman, J.J. and J.A. Smith. 1995. “Assessing the Case for Social Experiments.” The Journal of Economic Perspectives, 9(2): 85-110. Holland, P.W. 1986. “Statistics and Causal Inference.” Journal of American Statistics Association, 81: 945-970. Magnuson, K.A., Ruhm, C., and J. Waldfogel. 2007. “Does Prekindergarten Improve School Preparation and Performance?” Economics of Education Review, 26: 33-51. Morris, P., Gennetian, L., Duncan, G., and A. Huston. 2007. “How Welfare Policies Affect Child and Adolescent Development: Investigating Pathways of Influence with Experimental Data.” Presented at University of Kentucky Center for Poverty Research, 12 April. Nye, B., Konstantopoulos, S., and L.V. Hedges. 2000. “How Large Are Teacher Effects?” Educational Evaluation and Policy Analysis, 26: 237-257. Raudenbush, S.W. 2005. “Learning from Attempts to Improve Schooling: The Contribution of Methodological Diversity.” Educational Researcher, 34(5): 25-31. Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W.H., and R.J. Shavelson. 2007. Estimating Causal Effects: Using Experimental and Observational Designs. AERA: Washington, D.C. Todd, P.E. and K.I. Wolpin. 2006. “Assessing the Impact of a School Subsidy Program in Mexico: Using a Social Experiment to Validate a Dynamic Behavioral Model of Child Schooling and Fertility.” American Economic Review, 96: 1384-1417. Viadero, D. 2007. “’Scientific’ Label in Law Stirs Debate.” Education Week, 27(8): 1, 23.