Building a Strong Outcome Portfolio: What Researchers Actually Do

Building a Strong Outcome Portfolio Section 4: What Researchers Actually Do Jeffrey A. Butts, Ph.D. Research and Evaluation Center John Jay College of Criminal Justice City University of New York September 2018

What it Takes to Build Evidence… • Evaluation Strategies: Process: Did we do What was Planned & Intended? Outcome: Did we See the Changes we Hoped to See? Impact: Can we Claim to Have Caused Those Changes?

What it Takes to Build Evidence… • Individual-level measurement of: Inputs: Services, activities, program efforts Outputs: Service participation, activities completed Outcomes: Youth behaviors and accomplishments

What it Takes to Build Evidence… Behaviors/Accomplishments Service Provision Service Participation +– +– ++– X XXXXX X XXX +– +– X X X X – – – X XXXX X ++++– X XXXXXX X XXX +– +++++– X XXXX X XXXX +– – X X X X +– + +– – X XXXX X XX ++– X XXXXXX X XXX

Designing Your Evaluation • A design must fit the local context and situation • Experimental designs are preferred, but rarely feasible • Instead, choose the most rigorous, realistic design • Key stakeholders should be involved early, to solicit their views and to gain their support for the eventual design • Criticisms should be anticipated and dealt with early

Time 1 2 3 4 5 6 OUTCOMES Data Collection Treatment Group Analyze Differences, Effect Size, etc. Random Assignment Process Control Group Data Collection OUTCOMES 1 2 3 4 5 6 Time Experimental, Random-Assignment Begin Services ClientReferrals Eligibility for Randomization Determined by Evaluators or Using Guidelines From Evaluators No Services or Different Services No Services Group? Eligibility? How to Randomize? Equivalent Data Collection? Issues:

Time 1 2 3 4 5 6 OUTCOMES Data Collection Analyze Differences, Effect Size, etc. Matching Process Pool of Potential Comparison Cases Data Collection OUTCOMES 1 2 3 4 5 6 Time Quasi-Experimental – Matched Comparison ClientReferrals Treatment Group According to: - sex, race, age - prior services - scope of problems - etc. Comparison Group Matched on What? Comparison Cases? Equivalent Data Collection? Control Services to Comparison Cases? Issues:

Data Collection Points Group 1 Group 2 Group 3 X 1 2 3 4 5 6 7 Intervention Time Quasi-Experimental Design – Staggered Start X OUTCOMES ClientReferrals X OUTCOMES X OUTCOMES Requires More Time and Program Cooperation Issues:

Process Evaluation “Now that this bill is the law of the land, let’s hope we can get our government to carry it out.” - President John F. Kennedy * Quoted in Rossi et al., p. 170

Avoiding the “Black Box” Problem CAR WASH Black Box

Process Evaluation • Describes how a program is operating at a specific moment (or over a specific time period) • Assesses how well a program performs its intended functions • Identifies the critical components, functions and relationships necessary for the program to be effective • Does not estimate outcomes or impact

Process Evaluation • Many steps required to take a program from concept to full operation • Much effort is needed to keep it true to its original design and purposes • Whether any program is fully carried out as envisioned by its sponsors and managers is always problematic

Process Evaluation Serves Multiple Purposes • Establish program crediblity • Feedback for managerial purposes • Demonstrate accountability to sponsors or decisionmakers • Provide a freestanding process evaluation • Augment an impact evaluation Key Words - Appropriate - Adequate - Sufficient - Satisfactory - Reasonable - Intended

Process Evaluation is Built on Theory • What “should” the program be doing? • Do critical events or service activities take place? - Often a matter of degree rather than all or none - Quality and appropriateness count too • Stakeholders should be deeply involved in defining program theory and, therefore, process evaluation goals Werner, Alan. A Guide to Implementation Research. 2004, p. 114-116

Outcome Evaluation • An outcome is the state of the target population or the social conditions that a program is expected to have changed. • Outcomes are observed characteristics of the target population or social conditions, not of the program, and the definition of an outcome makes no direct reference to program action • Must relate to benefits products and services might have, not simply their receipt (Rossi et al., pp. 204-205)

Outcome Evaluation • Challenge for evaluators is to assess not only outcomes but degree to which a change in outcomes may be attributable to a program or policy • Outcome Levelis that status of an outcome at some point in time (e.g., the amount of smoking among teenagers). • Outcome Change is the difference between outcome levels at different points in time or between groups • Program Effect is the portion of a change in outcome that can be attributed to a program or policy rather than to other factors

Answering the Question, “Did it Work?” • Evaluation is an effort to test the effects of a social program or policy change • We try to intervene in some problem or condition, and then review our success • What is the “effect” of the program or policy, and compared to what? • Statistical significance is a limited concept for this purpose.

Statistical Significance • Statistical significance is not a direct indicator of the size of the program or policy effect • Statistical significance is a function of - sample size - effect size - p level • A study’s ability to detect a difference is “Power” • Even a well-designed study can end up having low power… the program effect may be there but the study can’t see it Partially adapted from James Neill (2007). Why use Effect Sizes instead of Significance Testing in Program Evaluation? (http://wilderdom.com/research/effectsizes.html)

Difference Statistical Significance Score of Treatment Group 710 Score of the General Population 675 1000 900 800 700 600 500 400 300 200 Is it significant? Is it important? Is it worth the cost?

Statistical Significance • Statistical significance is BINARY • Significant or “not significant” • No way to say “how significant” or how much better a treatment was than a control group • p values are set in advance as a yes/no test • p < .01 is NOT “more significant” than p < .05

Effect Size • A more flexible measure of program effect • Effect size statistics account for the amount of variance in both the treatment-group and the control-group • Effect size is often stated in terms of percentages- the program accounted for 20% of the change - treatment reduced drug use 15% more than expected - probation was responsible for 25% of the improvement • Does not require measures from the general population • Is more easily applied to policy questions (Rossi et al., Chapter 10)

Planning Process to Create Evidence Base • Create Priority List of Topics and Questions • Open and inclusive brain-storming session • Identify uncertainties and unknowns • Group topics into areas (cost, impact, equity, etc.) • Sort List by Difficulty • Easiest administrative data already exist, no client contact required, outcomes measurable using one agency data system alone • Hardest have to create new data, data only available with client contact, follow-up period required (e.g. recidivism), data required from multiple agencies

Contact Jeffrey A. Butts, Ph.D. Director, Research and Evaluation Center John Jay College of Juvenile Justice City University of New York 524 W. 59th Street, Suite BMW605 New York, NY 10019 www.JohnJayREC.nyc

Building a Strong Outcome Portfolio: What Researchers Actually Do