The logic of C ounterfactual I mpact E valuation

The logic of CounterfactualImpactEvaluation

To understand counterfactuals It is necessary to understand impacts

Impactsdiffer in onefundamental way from outputs and results Outputs and results are observable quantities

Can weobserve an impact? No, we can’t

As output indicatorsmeasureoutputs, resultindicatorsmeasureresults, so impact indicatorsmeasureimpacts Sorry, they don’t

Almost everything about programmes can be observed (at least in principle): outputs (beneficiaries served, activities done, training courses offered, KM of roads built, sewages cleaned) outcomes/results (income levels, inequality, well-being of the population, pollution, congestion, inflation, unemployment, birth rate)

Whatisneeded for M&E of outputs and results are BITs (baselines, indicators, and targets)

Unlike outputs and results, to define, detect, understand, and measure impacts one needs to deal with causality

“Causality is in the mind” J.J. Heckman

Why this focus on causality? Because, unless we can attribute changes (or differences) to policies, we do not know whether the intervention “works”,“for whom” it works, and even less “why” it works (or does not) Causal questions represents a bigger challenge than non causal questions (descriptive, normative, exploratory)

The social science scientific community definesimpact/effectas “the difference between a situation observed after a stimulus has been applied and the situation that would have occurred without such stimulus”

A very intuitive example of the role of causality in producing credible evidence for policy decisions

Does playing chess have an impact on math learning?

Policy-relevant question: Should we make chess part of the regular curriculum in elementary schools, to improve mathematics achievement? Which kind of evidence do we need to make this decision in an informed way? We can think of three types of evidence, from the most naive to the most credible

1. The naive evidence:pre-post difference • Take a sample of pupils in fourth grade • Measuretheirachievement in mathat the beginningof the year • Teachthem to play chessduring the year • Test themagainat the end of the year

Results for the pre-post difference Pupils at the beginning of the year The same pupils at the end of the year Average score = 40 points Average score = 52 points Difference = 12 points = + 30% Question: what are the implications for making chess compulsory in schools? Have we proven anything?

Can weattribute the increase in test score to playingchess? OBVIOUSLY NOT The data tellusthat the effectisbetween zero and 12 points • Thereisnotdoubtthatmany more factors are at play • So we must dismiss the increase in 10 pointsasunable to tellusanythingabout impact.

The pre-post great temptation • The pre-post comparisons have a great advantage: they seem kind of obvious (the “pop” definition of impact coincides with the pre-post difference) • Particularly when the intervention is big, and the theory suggests that the outcomes should be affected • This is not the case here, but we should be careful in general to make causal inference based on pre-post comparisons

The risky alternative:with-without difference Impact = difference between treated and not treated? Compare math test scores for kids who have learned chess by themselves and kids who have not

Not really Average score of pupils who already play chess on their own (25% of the total) = 66 points Average score of pupils who DO NOT play chess on their own (75% of the total) = 45 points Difference = 21 points = + 47% This difference is OBJECTIVE, but what does it mean, really? Does it have any implication for policy?

This evidence tells us almost nothing about making chess compulsory for all students The data tellusthat the effect of playingchessisbetween zero and 21 points. Why? The observeddifferencecouldentirely be due to differences in mathematicalabilitythatexistbefore the courses, between the twogroups

66 – 45: real effect or the fruit of sorting? Math test scores CS Play chess Does it have an impact on? DIRDIRE Math innate ability DIRECT INFLUENCE SELECTION PROCESS Ignoring math ability could severly bias the results, if we intend to interpret them as causal effect

Counterfeit Counterfactual Both the raw difference between self-selected participants and non-participants, and the raw change between pre and post are a caricature of the counterfactual logic In the case of raw differences, the problem is selection bias (predetermined differences) In the case of raw changes, the problem is maturation bias (a.k.a. natural dynamics)

The modern way to understand causality is to think in terms ofPOTENTIAL OUTCOMES Let us imagine we know the score that kids would get if they played andthey would get if they did not

Let’s say there are three levels of ability Kids in the top quartile (top 25%) learn to play chess on their own Kids in the two middle quartiles learn if they are taught in school Kids in the bottom quartile (last 25%) never learn to play chess

High mathability 25% Play chess by themselves Midmathability 50% Unlesstaught in school Do not play chess Lowmathability 25% Neverlearn to play

Impact = gain from playingchess Ifthey do NOT play chess Ifthey do play chess Potential outcomes High mathability 66 54 40 56 40 48 10 0 6 Midmathability Lowmathability

Observed outcomes For thosewho play chess For thosewho do not play chess High mathability 66 the difference of 21 pointsis NOT an IMPACT, itis just an OBSERVED difference Midmathability 48 45 Mid/Lowmathabilitycombined Lowmathability 40

The problem: we do not observe the counterfactual(s) • For the treated, the counterfactualis 56, butwe do notseeit • The true impact is10, butwe do notseeit • Stillwecannot use 45, thatis the untreatedobservedoutcome Wecan think of decomposingthe 68-45 differenceas the sum of the true impact on the treated and the effect of sorting

Decomposing the observeddifference If do not play chess If play chess High mathability 56 66 = 10 Impact for players Low/midmathability 45 =21 Observeddifference = 11 preexistingdifferences 21 = 10 + 11

21 = 10 + 11 Observeddifferences = Impact + Preexistingdifferences (selectionbias) The heart of impact evaluationisgettingrid of selectionbias, by usingexperiments or by using some non-experimentalmethods

Experimental evidence to the rescue Schools get a free instructor to teachchess to oneclass, iftheyagree to selectthe classat random among the fourth grade classes Nowwehave the following situation

Results of the randomized experiment Pupils in the selected classes Pupils in the excluded classes Average score of NON chess players = 52 points Average score of randomized chess players = 60 points Difference = 8 points Question: what does this difference tell us?

The results tell us that teaching chess truly improves math performance (by 8 points, about 15%) Thus we are able to isolate the effect of chess from other factors (but some problems remain)

Ifthey do play chess Ifthey do NOT play chess ATE Composition of population High ability 66 54 54 40 48 48 56 40 100% 50% 25% 25% Midability Lowability Averages Impact Impact = 54 – 48 = 6 Average Treatment Effect

Math test scores DIRDIRE DIRDIRE Play chess Math ability Note that the experiment does solve all the cognitive problems related to policy design: for example, it does identify impact heterogeneity (“for whom it works”)

The ATE is the average effect if every member of the population is treated Generally there is more policy interest in Average Treatment Effect on the Treated ATT = 10 the chess example, while ATE = 6 (we ran an experiment and got an impact of 8. Can you think why this happens?)

Schools thatvounteered Schools that DID NOT vounteer True impact High ability 50% 10 50% 6 Midability 50% Lowability 50% 0 EXPERIMENTAL mean of 66 and 54 = 60 CONTROL mean of 56 and 48 = 52 Internalvalidity Little externalvalidity Impact = 60 – 52 = 8

Lessons learned Impacts are differences, but not all differences are impacts Differences (and changes) have many causes, but we do not need to undersand all the causes We are especially interested in one cause, the policy, and we would like to eliminate all the counfounding causes of the difference (or change) Internal vs. External validity

An example of a real ERDF policy Grants to small enterprises to invest in R&D

To design an impact evaluation, oneneeds to answerthreeimportantquestions Impact of what? 2. Impact for whom? 3. Impact on what?

R&D EXPENDITURES AMONG THE FIRMS RECEIVING GRANTS Is 10.000 the true average impact of the grant?

The fundamental challenge to this assumption is the well known fact that things change over time by “natural dynamics” How do we disentangle the change due to the policy from the myriad changes that would have occurred anyway?

IS 15.000 THE TRUE IMPACT OF THE POLICY?

WITH-WITHOUT (I.A.: NO PRE-INTERVENTION DIFFERENCES)

DECOMPOSITION OF WITH-WITHOUT DIFFERENCES

We cannot use experiments with firms, for obvious (?) political reasons The good news is that there are lots of non-experimental counterfactual methods

The logic of C ounterfactual I mpact E valuation