Doing good evaluations: what does it mean, what does it take?

Angus Deaton, Princeton University India International Center, October 15th, 2012 Doing good evaluations: what does it mean, what does it take?

Evidence for policy • Everyone agrees that policies should be based on evidence • Much less agreement about the nature of the evidence • What methods should be used? • Is there a hierarchy of evidence? • Are some kinds of evidence better than others? • Are randomized controlled trials the gold standard? • How do we move from evidence to policy? • Rigorous evidence is of limited value if the step to policy is not well-justified • Two steps: developing evidence, adapting to policy, and outcome depends on weakest link

Running examples • Building dams • Do dams lead to poverty reduction? • Sanitation • Total sanitation campaign (TSC) and its effects on child mortality and child health • How should such schemes be implemented? • Microfinance • Is MF an effective tool for poverty reduction? • Food subsidies • In kind versus cash? PDS versus CCTs • In general: “finding out what works” • “Rigorous evaluation of CCTs has shown that they work” • Is this true, and if so, what does it mean for India? Or anywhere else?

Background • The “failure” of development economics and the whole development project • Cycling fashions at the World Bank • Infrastructure, structural adjustment, education, health, women., political economy, governance . . . infrastructure • Not just the Bank, but the development community (or at least the community of “developers”) • Unconstrained by evidence • Bank unable to document its contribution, if any • Deep skepticism about its own internal evaluations • Many argued that there had been little or no progress • Much less so now, though remains unclear whether the development effort by rich countries was positive

Diagnosing the problem • Many possible stories for this state of affairs • One story is a failure to learn from experience • No systematic, “rigorous,” evaluation procedure for projects • Casual empirical evaluation does not give credible answers • We need “rigorous” and “credible” evidence on what works • If the Bank had done this on all of its projects in the past, we would know what works by now, and poverty would be history • Is this just the latest turn of the wheel of fashion, or is there some truth to this?

Better empirical analysis • Certainly true that the quality of empirical analysis was often weak • Correlations that were obviously not causation • Chinese railways • Randomized controlled trials seem to offer solutions to these issues • They establish causality • Solution to the statistical problems of bias, selection, omitted variables (confounding) etc. • These arguments have been very successful • In World Bank, among foundations • J-PAL and others doing many experiments

Chorus of approval • “The World Bank is finally embracing science” Lancet editorial, 2004 • “Creating a culture in which rigorous randomized evaluations are promoted, encouraged, and financed has the potential to revolutionize social policy during the 21st century, just as randomized trials revolutionized medicine during the 20th.” Esther Duflo, 2004 • Did RCTs revolutionize medicine? • “Britain has given the world Shakespeare, Newtonian physics, the theory of evolution, parliamentary democracy—and the randomized trial” BMJ editorial, 2001.

What is an RCT? • Trial population is randomly divided into two groups, experimentals and controls • Experimentals get treatment • Controls get none • Average outcome in experimental group minus average outcome in control group tells us if the treatment works, and by how much on average • An RCT estimates an average treatment effect • In general, each person (unit) will have a different treatment effect • We cannot observe these for each individual • But RCT gives us the average for the group, which is a lot! • Minimal assumptions, absence of bias, establishing causality are big advantages • But is this really the only “rigorous” evaluation?

Examples again • CCTs in Mexico (Progresa), some villages got CCTs, some did not • Better average outcomes for treatment villages • Random selection means it must have been the CCT, not something else • What do we learn? • Will it work in India? External validity. • Will it work for a specific village in Mexico? • Why did it work? If we knew, we could answer two questions? • Controls knew they were going to get CCTs later? Does that matter? • Mexico had a system of clinics: hard to take kids to a non-existent clinic • Big issue today for Santiago Levy at IADB today • Dams: not possible to do randomized dam construction!! • So RCTs cannot be done in all cases • Some have argued that policies should not be implemented in these cases • Do many things routinely for which there have been no RCTs!

Alternative methods • Rohini Pande and Esther Duflo’s work on dams used placement of dams and NSS data on poverty • Dean Spears’ work on TSC uses NFHS and other survey data on health in conjunction with administrative data • Alternative methods of estimating average treatment effects • Weaker than RCTs in some respects • Causality, selection, bias are not automatic and must be argued • More assumptions • Stronger in other respects • Access distribution of treatment effects, not just the average • Usually much larger samples • Triangulation helps to pin down mechanisms at work • RCTs good at saying what happened, not good at saying why • Ex post fairy stories (just-so stories) without evidence

Small RCTs • Are often not large enough to be reliable • Expensive to do, so this is not a matter that is easily fixed • In a small trial, a few outliers can wreak havoc • Example might be microfinance, where one or two women might be able to do really well, and the rest not at all • Get lots of weird and counterintuitive results • No idea if they are real, or method is just broken • Doubt one can learn anything from a trial of 10 experimental villages and 10 control villages in CCT experiment • Experiment is often conducted on a convenience sample • Not easy to get cooperation from all relevant units: e.g. in looking at CCT, those opposed to the idea might be less willing to cooperate • Results are correct only for the convenience population • Not for population that would be affected by the policy • Gold standard rhetoric protects results from questioning

Large scale RCTs • Use all of the units in a country • PDS/CCT experiment for all of rural India • Comparable to large social experiments in the US in the 70s • NJ income tax experiment, SIME/DIME • Rand Health experiment • Rand experiment is an important part of the debate today, others not • Ex post data mining • Null result is never acceptable to the sponsors • Enormous pressure on investigators to find something • Usually by subgroup analysis, or looking for other outcomes • MTO has now examined thousands of outcomes • Some of the statistically significant ones are spurious • And we are back to the small sample problem again • Large experiments not decisive either

Dynamic effects • Many policies take time to work out • Lots of things work as intended in the short-run, fail later • People learn to “work the system” • Food rationing in Britain during the war: • Excellent at first, big nutritional benefits, solidarity • Crooks (“spivs”) learned to exploit it and create a black market • Support eventually vanished, when it was continued too long • Old age pensions in South Africa: cash transfer • Burial insurers were allowed on site to get first access to recipients • Higher level corruption: banks? • Procurement and supply effects in food policy • What would an RCT show? • It works! Expensive and unethical to continue the experiment • We get the wrong answer, or only part of the answer • Issue in medicine too

Taking evidence to policy

Using a perfect evaluation • Suppose we have a result, e.g. • On average, CCTs make people happier than PDS • On average, dams increase poverty • On average, reducing open defecation improves child health and reduces mortality • Suppose also that these were all done perfectly, so there is no dispute about the conclusions • Which, of course, never happens! • What use can we make of those results in policy? • Should the Planning Commission ban new dams? • Should MRD encourage better sanitation? • Should we replace PDS by CCTs? • That dams don’t work on average tells us little about any individual dam • It is an individual dam that comes up for approval, not all dams! • We needs to know more, why dams cause poverty, under what circumstances, none of which comes from an RCT

What should a village do? • Or any local authority that decides • Given an RCT about CCT v PDS • Again, the average is useful but not decisive • Will it have the same effect for us? • We are not the average village • Again, we need to know why it works, not whether it works • Neighboring village tried and is happy with the outcome • Perhaps this is just an anecdote (“your uncle likes his new TV”) • But for the village, the average outcome is an anecdote too • Perhaps the authorities should visit their neighbors and see what is going on, see if it would work for them • Average is more useful for a public health policy that will be applied to the whole country • Sanitation?

Finding out what works? • A trial and error process • But T & E is NOT the same as an RCT • T & E, endless tinkering, is a good description of the Industrial Revolution • How to invent a steam engine, or a toaster • How medical science works, on procedures and devices • For which trials are close to irrelevant, and in many cases have never been done • T & E using knowledge and intelligence can solve the dimensionality problem

Seeing into the machine • Allows a village, the ministry, or the Planning Commission to make a better choice • It may be able to see whether it would work for them • It may be able to see places where they could adapt it and make it better • Hope to understand the process & how it would work in context • Trial and error, plus local knowledge, hard thought • Experimentation but not necessarily RCTs • What are the “helping factors” that made a trial work? • E.g. clinics in Mexico! • Can teach us why things work which is generalizable knowledge

Causality & helping factors • Do not RCTs reveal causality? • It was the treatment that did it! Not something else • Is this not particularly helpful in policy? Yes and no. • Causality, by itself, is not always useful • The house burned down because the TV was left on • Causal, but not general: TVs do not usually burn down houses • RCT would show this causal effect • But TVs need “helping factors” like bad wiring, or inflammable material left nearby • We have to think about what are the helping factors, how they work, and whether they will work for us • Will a CCT work in a particular village, or during food price inflation, or in a competent v a corrupt state • Does it need banks, or clinics to make it work? • Does it matter who gets it? Men and women: gender issues in India v Latin America • Replication of an RCT is not useful, because get different results in different contexts with or without helping factors • Causality is “local”

Cartwright: Local causality Open window A, and fly kite B, String C opens door D, which allows moths E to escape and eat shirt F. Lighter shirt lowers shoe G on to switch H which heats iron I which burns pants J. Smoke K enters tree L and smokes out possum M into basket N, pulling rope O, and lifting cage P, allowing woodpecker Q to chew pencil R. (Emergency knife S in case woodpecker or possum gets sick and can’t work.)

Expanding literature • We now have enough RCT papers to judge their quality and the evidence that they claim • Some excellent, some terrible • Just like other empirical papers in development • But they must be judged case by case, like all other empirical work • There is no free pass, just because they are RCTs • Using the word “rigorous evaluation” as a code word for RCT is without justification • Right now, in economics, and aid literature, they are being given a free pass. • Sometimes absurd generalizations based on small special RCTs • RCTs have no monopoly on rigour, there is no gold standard

Doing good evaluations: what does it mean, what does it take?