Propensity Score

Propensity Score • Overview: • What do we use a propensity score for? • How do we construct the propensity score? • How do we implement propensity score estimation in STATA?

Joke (kind of…) • Two heart surgeons (Jack and Jill) walk into a bar. • Jack: “I just finished my 100th heart surgery!” • Jill: “I finished my 100th heart surgery last week. Which probably means I’m a better heart surgeon. How many of your patients died within 3 months of surgery? I’ve only had 10 die.” • Jack: “Five. So I’m probably the better surgeon.” • Jill: “Or maybe mine are older and have a higher risk than your patients”. • There may be differences in the patients’ characteristics between Jack and Jill • We want to show the difference due to treatment (Jill) • We want to compare apples to apples – not apples to oranges

Purpose of propensity scores • It can produce apples-to-apples comparisons when treatment is non-random (non-ignorable treatment assignment) • Provides a way to summarize covariate information about treatment selection into a single number (scalar) • Can be used to adjust for differences via study design, or matching, or during estimation of the treatment effect (e.g., subclassification or regression)

Propensity score estimation • Some caveats • This is only relevant for selection on observables • If you cannot write down a conditioning strategy such that conditioning on X will satisfy the backdoor criterion,then this is not the research design you choose • You need to identify the confounders, X, that will block all back doors – based on economic theory – and you will need data on them

Better example: a case in which the propensity score is useful for causal inference • Suppose that we are interested in whether a scholarship program caused children in to spend more years in high school (9-12). • Suppose every 8th grade graduate is eligible for this program • You have data on every child, including test scores, family income, age, gender, etc. • Scholarships are awarded based on some combination of test scores, family income, gender, etc., but you don’t know the exactformula.

Motivation (cont.) • Ignorable treatment assignment: Scholarships are assigned to students randomly, independent of how a student is expected to perform in high school • Calculate ATE by estimating simple difference in mean outcomes: • But what if ignorability is violated? • For instance, assume you know that children with higher test scores are more likely to get the scholarship (positive selection), but you don’t know how important this and other factors are, you just know that the decision is based on information you have (X) and some randomness. • What can you do with this information?

Motivation (cont.) • In principle, you could estimate it using OLS controlling for X: Where X is a matrix of covariates that you think affect the probability of receiving a scholarship. • OLS consistently estimates the conditional mean, but if probability of getting a scholarship is not a linear function of X, this conditional mean estimate may not be informative. • Usually, we won’t know how the selection depended on X, only that it did. • For instance, they may use discrete cutoffs rather than a linear function

Motivation (cont.) • Suppose your variables are not continuous, but they are categories (somewhat arbitrarily). • E.g. family income above or below $50 per week, scores above or below the mean, sex, age, etc. • Now, you could put in dummy variables for each category and interaction between all dummies. This would distinguish every group formed by the categories. • Or you could run separate regressions for each group • This is more flexible since it allows the effect of the scholarship to differ by group. • These methods are in principle correct, but they are only feasible if you have a lot of data and few categories.

Constructing the Propensity Score • Estimation of average treatment effects based on propensity score estimation can handle sparseness and ignorance about the functional form associated with treatment assignment. • You will first need to have a selection into the treatment (in our case the scholarship) that is based on observables, or “selection on observables”. • The following gives a brief overview of how the propensity score is constructed. • In practice, you can download a canned Stata command that will do all of this for you.

Definition and General Idea • Definition: The propensity score is the conditional probabilityof being assigned to the treatment group (e.g., 9-12 grade scholarship), conditional on the particular covariates (X). • Pr(D=1|X) is some marginal probability (e.g., 55%) • The idea is to compare units who, based solely on their observables, had very similar probabilities of being placed into treatment • If conditional on X, two units have a similar probability of treatment, then we say they have similar propensity scores • We then think that all the difference in the outcome variable is due to the treatment. • If we compare a unit in the treatment group to a control group unit with two similar propensity scores, then conditional on the propensity score, all remaining variation between these two is randomness if selection on observables

First stage • Estimation using this method is a two-stage procedure • First stage: estimates the propensity score • Second stage: calculate the average causal effect of interest by averaging differences in outcomes over units with similar propensity scores • First stage: estimate the propensity score: • First, estimate the following equation with binary treatment (D) on the LHS, and covariates (X) that determine selection into treatment on RHS using logit or probit model: • Second, using estimated coefficients, calculate the predicted LHS • The propensity score is just the predicted conditional probability of treatment (using estimated coefficients on X) for each unit

Algorithm • Sort your data by the propensity score and divide it into blocks (groups) of observations with similar propensity sores. • Within each block, test (using a t-test), whether the means of the covariates are equal in the treatment and control group. If so  stop, you’re done with the first stage • If a particular block has one or more unbalanced covariates, divide that block into finer blocks and re-evaluate • If a particular covariate is unbalanced for multiple blocks, modify the initial logit or probit equation by including higher order terms and/or interactions with that covariate and start again.

Second Stage • In the second stage, we look at the effect of treatment on the outcome (in our example of getting the scholarship on years of schooling), using the propensity score. • Once you have determined your propensity score with the procedure above, there are several ways to use it. I’ll present two of them (canned version in Stata for both): • Stratifying on the propensity score • Divide the data into blocks based on the propensity score (blocks are determined with the algorithm). Run the second stage regression within each block. Calculate the weighted mean of the within-block estimates to get the average treatment effect. • Matching on the propensity score • Match each treatment observation with one or more control observations, based on similar propensity scores. You then include a dummy for each matched group, which controls for everything that is common within that group.

Balancing within blocks • Sort the data by the propensity score • Divide the data into groups called “blocks” that have similar propensity scores (e.g., 0.001 to 0.10, 0.10 to 0.20, etc.) • For each block, test whether the means of the covariates are equal for treatment and control using a t-test • If they are, you are done with the first stage • If a particular block has one or more unbalanced covariates (X), divide that block into finer blocks and re-evaluate • If a particular covariate is unbalanced for multiple blocks, modify the initial logit or probit equation by including higher order terms and/or interactions with that covariate and start again

Implementation in STATA Multiple methods for estimating the propensity score • Download “psmatch2” from ssc • ssc install psmatch2, replace • First stage: pscore treat X1 X2 X3…, pscore(scorename) • Second stage: attr (for matching) or atts (for stratifying): attr outcome treat, pscore(scorename)

General Remarks • The propensity score approach becomes more appropriate the more we have randomness determining who gets treatment (closer to randomized experiment). • The propensity score doesn’t work very well if almost everyone with a high propensity score gets treatment and almost everyone with a low score doesn’t: • we need to be able to compare people with similar propensities who did and did not get treatment. • The propensity score approach doesn’t correct for unobservable variables that affect whether observations receive treatment.

NSW example • Comparison of propensity score matching with experimental results

NSW program • During the mid-1970s, Manpower Demonstration Research Corporation (MDRC) operated the National Supported Work Demonstration (NSW) • NSW was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment • Unlike other federally sponsored employment and training programs, though, the NSW program assigned qualified applicants to training positions randomly • Treatment group: received all the benefits of the NSW program • Control group: left to fend for themselves • NSW admitted into the program AFDC women, ex-drug addicts, ex-criminal offenders, and high school dropouts of both sexes

NSW Program • Treatment group members were: • guaranteed a job for 9-18 months depending on the target group and site • divided into crews of 3-5 participants who worked together and met frequently with an NSW counselor to discuss grievances and performance • paid for their work • Wage schedule offered the trainees lower wage rates than they would’ve received on a regular job, but allowed their earnings to increase for satisfactory performance and attendance • After their term expired, they were forced to find regular employment • The type of work varied within sites – gas station attendant, working at a printer shop – and males and females were frequently performing different kinds of work • This was why the program costs varied across sites and target groups • The program cost $9,100 per AFDC participant and approximately $6,800 for other target groups’ trainees in 1982 dollars (US)

NSW Program • MDRC collected earnings and demographic information from both treatment and control at baseline and every 9 months thereafter • Conducted up to 4 post-baseline interviews

LaLonde (1986) study • LaLonde, Robert J. (1986). “Evaluating the Econometric Evaluations of Training Programs with Experimental Data”. American Economic Review. 76(4): 604-620. • LaLonde’s ideas: • Outcome variable: Annual earnings in 1978 • Get unbiased estimate of the job training program’s effects using randomized control group • Compare that with what you get by selecting a control group from the entire population that looks like the treatment group using various causal inference methods

Need for a control group • The fundamental problem of causal inference is causality is defined as the difference between two potential outcomes states, but for each individual, we only observe one of these. • We are missing data on each trainees counterfactual – what they would’ve earned had they not been in the NSW experiment

Choice of a control group • Best option: Randomize so that independence is satisfied • Control group and treatment group are different only by random chance • Eliminates bias due to baseline differences between the two groups and the heterogeneous treatment effects bias • Oftentimes these kinds of randomized controls aren’t available so labor economists would instead sample from various datasets to create (non-experimental) control groups • So LaLonde sampled a non-experimental control group from two surveys: the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID) • Sampled the entire working population • Sampled those not working in 1976 • Sampled those not working in 1975 or 1976

Similarity of treatment and control groups • Treatment and control groups need to be similar. But in what way should they be similar? • Most importantly, they need to be similar with regards to income pre-treatment since income is what we’ll be examining post-treatment. • So what did LaLonde find? • First column is treatment group earnings in 1978 • Second column is randomized control group • Everything else are the non-random control groups

Lessons • What were the take-aways? • Fairly pessimistic findings – observational data and causal inference methods available at that time performed poorly when trying to reproduce the known ATE from the randomization • What did he do? • Linear regression, fixed effects, latent variable selection modeling • His estimated treatment effect for women tended to overestimate the impact of the program – “positive self-selection” • But it tended to underestimate the impact of the program for men – “negative self-selection” • Why should you care? • Even though the control group might seem like a good guess for the treatment group, your answers may still be significantly biased

Dehija and Wahba (1999; 2002) Dehejia, Rajeev H. and SadekWahba (1999). “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs”. Journal of the American Statistical Association, vol. 94(448): 1053-1062 Dehejia, Rajeev H. and SadekWahba (2002). “Propensity Score-Matching Methods for Nonexperimental Causal Studies”. The Review ofEconomics and Statistics. February, 84(1): 151-161. These two studies introduce propensity score matching methods to economists and perform a kind of replication of LaLonde’s study

Dehejia and Wahba (1999) • DW (1999) re-analyze the data using propensity score matching and stratification • These were new at the time to economists, although the method was first established in Rosenbaum and Rubin (1983) • Identifying assumptions: • (Y0,Y1) ||D|p(X) – p(X) is “propensity score” • 0<Pr(D|X)<1 – “Common support” • Stable unit treatment value assumption (SUTVA) • The response of subject ito the treatment D doesn’t depend on the treatment given to anyone else except i

Assumptions • e(X) = Pr(D|X) which is the conditional probability of treatment. • Also called the “propensity score” • This is a scalar summary of all observed covariates, X • Key Result is that the propensity score is a balancing score • X || | e(X) • Pr[D|X, e(X)] = Pr[D|e(X)] • ATE at e(X) is the average difference between the observed responses in each treatment group at e(X) • E[Y1 – Y0) | e(X) ] = E[Y | e(X), D=1] – E[Y | e(X), D=0]

Interpretation • The overall estimated ATE from this method is the individual treatment effect averaged over the distribution of e(X)

Analytical use of propensity score • Matching – subsets consisting of both treatment and control subjects with the same propensity score are matched • Stratification – Data is divided into several “strata” (or “blocks”) based on the propensity score, then regular analysis is carried out within each strata

Implementation • Include as many observed pretreatment variables (“covariates”) as possible • The statistical significance of individual terms isn’t important • Functional form of covariates • Consider higher order polynomials as well as interaction terms. Why? • BALANCE BETWEEN TREATMENT AND CONTROL • Selection of the model • Probit or logit

Matching algorithm • Nearest neighbor algorithm • Iteratively find the pair of subjects with the shortest “distance” • Easy to understand and implement; offers good results in practice; fast running time; rarely offers the best matching results compared to some optimal matching procedure

Implementation • Choices of distance • Exact match not possible because propensity score is a continuous variable and the probability of having the same value of a continuous score is zero • Use one distance measure to summarize the information • Mahalanobis distance • Propensity score • Mahalanobis distance with propensity score caliper • Any distance with the requirement of exact match on a specific variable

Software • R functions by Ben Hansen • http://www.stat.lsa.umich.edu/~bbh# • STATA functions • STATA 13 has new “treatment effects” methods built into it which includes nearest neighbor matching as well as propensity score matching methods • Pre-STATA 13: psmatch2(); pscore; nnmatch

Procedures for PSM • Identify the propensity score model (e.g., logit or probit; covariates) • Estimate the propensity score with all the data • Compute the distance between any two subjects • Created matched pair/group using a specific matching algorithm • Check covariate balance between the treatment and control group among matched subjects; if not good enough, go back to improve the propensity score model • Contrast between treated and control subjects within each pair/group • Obtain the ATE by averaging over all pairs/groups

Why are we doing this? • Remember the goal of DW: • The goal is to investigate the credibility of the conventional analytical results from non-experimental data • So the authors compared the results from the experimental data to the results from the non-experimental data by combining the treatment group with a comparable control dataset

Checking the balance after matching

Comparison of the analytical results

Observations • The results after the propensity score matching/stratification was much closer to the truth (if we assume the randomized experiment is the correct benchmark) • The variances seem to be larger due to the loss of the data • The results aren’t very sensitive to the functional form of the chosen covariates in the propensity score model; however they are sensitive to the selection of covariates included in the propensity score model

Comments • Limitation of propensity score method • Relies on an unverified assumption – conditional independence, or “selection on observables” • Unlike randomization, propensity score matching cannot be used if there is unobserved counfounders, or “selection on unobservables” • Overlap • You need substantial overlap between the treatment and the control groups, otherwise, it may result in significant loss of the data in your analysis

Propensity Score

Propensity Score

Presentation Transcript

Propensity Score Matching: A technique for Program Evaluation

Propensity Score Matching and the EMA pilot evaluation

Introduction to Propensity Score Matching

Propensity Score Matching

Potential outcomes and propensity score methods for hospital performance comparisons

Propensity Score

Propensity Score Models

Propensity

Propensity Score Models for Nonresponse and Measurement Error

Propensity Score Matching: A Primer for Educational Researchers

Propensity

Introduction to Propensity Score Weighting

Is the Propensity Score an Inferior Version of Instrumental Variables?

Propensity Score Matching: A technique for Program Evaluation

Propensity Scores

Propensity Score Models

Propensity Score Matching

Experiences with multiple propensity score matching

Propensity Score Matching and Variations on the Balancing Test

Propensity Score Models for Nonresponse and Measurement Error

Using Propensity Score Matching in Observational Services Research

Experiences with multiple propensity score matching