Propensity Score Matching: A Primer for Educational Researchers

Propensity Score Matching: A Primer for Educational Researchers Forrest Lane, Ph.D. Department of Educational Studies & Research

Aims • Recognize the implications for self-selection and non-randomization in quasi-experimental research, • Understand key terms and theory behind the propensity score matching, • Identify strategies and resources for implementing propensity score matching into research.

Overview • Theoretical Framework • Propensity Score Matching Process • Implications & Practical Guidance

Introduction Experimental design has historically been considered the “gold standard” for causal inference (West, 2009).

Introduction The problem is that experimental design may not be possible in practice There are many ethical, political, or financial arguments against them (Cook, 2002). Some suggest experimental designs: • Can rarely be mounted in schools • Sacrifice internal for external validity • Creates a rational • decision-making model that does not describe how schools actually make decisions

Introduction “Interventions conducted under laboratory conditions with selective participant criteria do not necessarily generalized well in real world of human services” (Levant & Hasan, 2008, p. 658).

Quasi-Experiment Alternative Allow for group comparisons but do not allow for causal inferences Groups may systematically differ from one another based on number of covariates and therefore cannot be directly compared. • Non-randomized studies may lead to effect size bias when interpreting treatment effects.

Problem Increasing calls for evidence of a program’s or intervention’s effectiveness. • Psychology:Bauer (2007); Collins, Leffingwell, & Belar (2007); Levant & Hasan (2008) • Education: Rudd & Johnson (2008); Slavin (2002) Quasi-experiments may not meet this aim

Experimental • Better estimates of treatment effects with limited generalizability Quasi-Experimental • Biased estimates of treatment effects with greater generalizability

Counterfactuals • Is a conceptual framework for investigating causality. • Two well-known frameworks include the approaches taken by Campbell (1957) and Rubin (1974; 2005)

*Table taken from West and Thoemmes (2010)

Propensity Score Matching Propensity score matching (PSM) is a statistical technique that aims to controls for self-selection bias and thus extend causal inference into non-randomized or quasi-experimental studies (Rosenbaum & Rubin, 1983). Grounded in the Rubin (1794; 2005) counterfactual framework.

Propensity Score Matching The method uses statistical techniques to reduce differences in the likelihood of group assignment by matching participants on their likelihood of group assignment. PSM assumes, once groups are well matched, systematic differences between groups have been removed and causal inference can be extended.

Propensity Score Matching “For more than two decades, advanced statistical methods known as propensity score (PS) techniques, have been available to aid in the evaluation of cause-effect hypotheses in observational studies. None the less, PS techniques have not yet been used widely in psychological research” (Harder, Stuart, & Anthony, 2010).

Articles Using PSM Figure taken from Thoemmes & Kim (2011)

PSM in the Literature Grunwald & Mayhew (2008) examined the development of moral reasoning in young adults and demonstrated a significant reduction is the overestimation of effects. Morgan (2001) used propensity score matching and demonstrated the effect of private school education on math and reading achievement is actually larger than findings in non-matched samples. Other similar studies have been demonstrated in economics (Dehejia & Wahba, 2002), medicine (Schafer & Kang, 2008), and sociology (Morgan & Harding, 2006).

Defining a Propensity Score Defined as the conditional probability of assignment to a particular treatment or control given a set of covariates (Rosenbaum & Rubin, 1983).

Propensity Scores • Propensity scores incorporate covariates into a singular scalar variable ranging from 0 to 1 which can then be used to match participants in treatment groups. • Once matched, treatments effects should be more reflective of the true effect and analogous to interpretation of randomized designs

Propensity Score Matching Process

PSM Assumptions • Strongly ignorable treatment assignment • Assumes all systematic differences in group assignment have been removed (Rosenbaum, 2010). • matching techniques control only for systematic differences due to observable covariates, not unobservable covariates (Guo & Fraser, 2010)

Random Assignment • To apply the Rubin counterfactual model, the assumption of strongly ignorable treatment assignment must be met. • In other words, conditional on a set of covariates, the outcome for a participant must be independent of treatment assignment (Guo & Fraser, 2010)

Propensity Score Matching Process

Propensity Score Estimation The most commonly used method is logistic regression (Thoemmes & Kim, 2011). Other methods include probit regression, classification trees or ensemble methods such as bagging, boosted regression trees, and random forest (Shadish, Luellen, & Clark, 2006).

Modeling Strategy • Non-Parsimonious • All theoretically related variables included in PS estimation • Parsimonious • Some variables can be ignored as a source of potential bias • Hierarchical Regression • Stepwise Regression

Conditioning Strategy • Matching • One-to-one, One-to-many, Caliper • Stratification • stratification across quintiles may reduce approximately 90% of bias due to covariates (Shadish, Luellen, & Clark, 2005) • Regression Adjustment • The PS may be used as a covariate in ANCOVA but must meet assumptions of the analysis.

Balance Evaluation • The standardized difference in the mean propensity score in the two groups should be near zero (d < .20) • The ratio of the variance of the propensity score and continuous covariates in the two groups should be near one, preferably between 0.80 and 1.25

Balance Evaluation • Multivariate Measures • Hansen and Bowers (2008) provide one test that assesses simultaneously whether any variable or linear combination of variables was significantly unbalanced after matching” using a distribution (Thoemmes, 2012, p. 9). • A measure , may also be used which assesses the balance of all covariates including interaction effects (Iacus, King, & Porro, 2011)

Estimating Treatment Effects • Treatment effects can be estimated on the outcome variable(s) by testing in newly matched sample through a t-test or appropriate multi-group equivalent analysis.

Common Support Region • The shared overlap of between groups on the distribution of propensity scores • The common support region defines where the estimation of causal effects may be inferred.

Hidden Bias • Two participants measured on the same covariates (x), should have the same probability (P) of group assignment. • When true, the ratio of the probability for group assignment relative to non-group assignment should be close to one. • If false, probability of group assignment differs by a multiplier or factor of Γ

Hidden Bias • Rosenbaum (2010) suggested a Wilcoxon signed rank test may be used to statistically test the impact of various levels of on the interpretation of the treatment effect (i.e., sensitivity analysis).

Heuristic Scenario • The content area reading strategies program (CARS) was implement within Florida schools to improve basic reading levels skills. • Students were taught three animal science lessons from the state approved curriculum and included anatomy and physiology, nutrition, and reproduction. • The lessons were taught over the course of 23 school days, or nearly 1600 minutes of instruction” (Park & Osborne, 2007, p. 57).

Heuristic Scenario • The problem is that students could not be randomly assigned to treatment and comparison groups. • Park and Osborne (2007) also suggested student pre-test scores, grade level, grade point average, gender, ethnicity, and standardized reading levels were statistically significant predictors of agricultural posttest scores (= .67).

Arguments Against ANCOVA • ANCOVA is inappropriate when differences between groups on covariates are large (Hinkle, Wiersma, & Jurs, 2003). • The outcome variable in ACOVA is an adjusted score which makes interpretation difficult • Potential mismatch between the research question and analytic technique or Type IV error (Fraas, Newman, & Pool, 2007).

Arguments Against ANCOVA • The use of ANCOVA and propensity score matching may result in a different interpretation of the treatment effect (Fraas, Newman, & Pool, 2007).

Method • Logistic regression was used to estimate propensity scores • One-to-one matching was the conducted using a caliper width of 0.25 standard deviations of the logit transformation of the propensity score (Stuart & Rubin, 2007). • Matched pairs exceeding the caliper width were discarded from the analysis. • Balanced was then examined on continuous variables using NHST and effect sizes.

Pre-Matching Treatment Effect Biased Treatment Effect (0.06) Comparison 1 (0.64) Treatment 0

Likelihood of Receiving Treatment Amount of Bias (.36) Comparison (.59) Treatment 1 0 Unlikely to be in treatment group Likely to be in the treatment group

Matching Algorithms • R • MatchIt in R (Ho, Imai, King, and Stuart, 2007) • Matching (Sekhon, 2011) • Stata • PSMATCH2 (Leuven & Sianesi, 2004) • Pscore (Becker & Ichino, 2002) • SAS • SUGI 214-26 “GREEDY” (D’Agostino, 1998), • SPSS • PSM Matching_2.spd (Thoemmes, 2012)

Assessing Balance The standardized difference in the mean propensity score in the two groups should be near zero (d < .20) The ratio of the variance of the propensity score in the two groups should be near one, preferably between 0.80 and 1.25 (Rubin, 2001).

Pre-Matching Group Differences Amount of Bias (.36) Comparison (.59) Treatment 1 0 Unlikely to be in treatment group Likely to be in the treatment group

Post-Matching Group Differences Amount of Bias (.44) (.46) 1 0 Unlikely to be in treatment group Likely to be in the treatment group

Pre-Matching Treatment Effect Biased Treatment Effect (0.06) Comparison 1 (0.64) Treatment 0

Post-Matching Treatment Effect Unbiased Treatment Effect (0.14) (0.43) 1 0

Practical Guidance • Some participants will be discarded as a result of poor matching. • As a result, larger samples are generally needed for PSM (Luellen, Shadish, & Clark, 2005; Yanovitzky, Zanutto, & Hornik, 2005). • How many participants are needed is unclear (Luellenet al., 2005, p. 548). • N >100 may be too small (Akers, 2010), particularly as prediction of group assignment improves (Lane, 2011).

Practical Guidance • Examine improvement in prediction relative to the null as there is some evidence to suggest this reduces model sensitivity to hidden bias (Lane, 2011). • Pearson goodness of fit, Hosmer-Lemeshow goodness-of-fit test and pseudo have also been suggested for use in evaluating propensity scores (Guo & Fraser, 2010) • I index (Huberty & Holmes, 1983; Huberty & Lowman, 2000) may also provide a measure of effect size.

Practical Guidance • Other methods beyond logistic regression are available when estimating propensity scores including classification trees, bagging, and boosted regression trees(Austin, 2008; Shadish et al., 2006). • Each of these estimation methods were created to help better inform covariate selection.

Practical Guidance • Matching strategies seem to vary greatly in the literature. • However, other strategies exist (e.g., one-to-many matching) that may retain more participants, improving statistical power and perhaps generalizability of treatment results.

Useful Literature • Caliendo and Kopeinig (2008) and Stuart (2010) provide a thorough discussion on the implementation of different matching methods. • Thoemmesand Kim (2011) present a systematic review of the various strategies employed by social science researchers using PSM. • Guoand Fraser (2010) provide an entire text dedicated to propensity score matching.

Propensity Score Matching: A Primer for Educational Researchers