Validity in an Era of Accountability

Validity in an Era of Accountability Daniel Koretz CRESST/ Harvard Graduate School of Education CRESST/ Harvard Graduate School of Education

Two types of issues raised bytesting for accountability (TFA) • Behavioral issues • Arise from behavioral responses to testing other than those that improve learning • Non-behavioral issues • Not stemming from behavioral responses to testing 2

Some non-behavioral issues raised by TFA • Error: • Sampling error in aggregate statistics used for accountability • Error in PACs • Error in value-added estimates for teachers or schools • Reporting issues: • Choice of aggregate reporting metric • Issues raised by standards-based reporting • Causal inference: ascertaining effectiveness of programs, teachers, and schools 3

Behavioral issues raised by TFA • “Right-hand side gaming:” affects who is tested • Exclusion • Reclassification of students • Retention in grade • “Left-hand side gaming:” affects the scores of tested students • Inappropriate test preparation and score inflation 4

Examples of Bob Linn’s work on TFA • Reasonableness and robustness of performance standards • Causal inference from test scores • Inconsistencies among performance metrics • Reliability of aggregate performance estimates • Numerous aspects of accountability system design, e.g., AYP • Score inflation from high stakes 5

Where does research on TFA fitin the measurement field? “Traditional psychometrics plus” • Veneer of research responding to TFA • Considerable amount on non-behavioral issues • Much less on behavioral issues • Practice has not changed sufficiently • Some areas extended, e.g., treatment of reliability • Impact of work on non-behavioral issues limited by client demands (e.g., standards) • Behavioral issues have been largely ignored 6

Why worry about behavioral issues? • Research shows a major threat to validity: bias of .50-.75 SD • Bias is inconsistent in size across schools • Little is known about the distribution of the bias • Cannot evaluate overall improvement • Kids are left behind, despite illusion of progress • Cannot evaluate relative improvement • to identify schools for reward, corrective action, emulation 7

One example of the threat to validity:grade 4 KIRIS reading Source: Hambleton et al., 1995 8

How are scores inflated? • RHS gaming: • Obvious: exclude low-scoring kids • LHS gaming: • Reallocation • Coaching • Cheating 9

Two characteristics of tests thatunderlie LHS gaming • Tests are like polls: small samples of a larger domain • Even if well aligned, tests omit relevant content • Scores only matter if they represent the domain • Tests have recurrences • In details of content (included and excluded) • In forms of presentation • In scoring rubrics 10

Reallocation • Shifting instructional resources among substantive areas • Within subject • Between subjects • Results in reallocating achievement • Within subjects, can lead to either meaningful change or inflation 11

Coaching • Focuses on details of the test • Unimportant substantive details • Non-substantive details, such as item formats and scoring rubrics • Includes test-taking tricks (e.g., POE, plug-in) 12

Two ways that validity is undermined • Coaching and cheating: performance on measured elements is biased upward • Test-taking tricks • “Teaching to the rubric” • Focusing on details of presentation • Reallocation: Performance on individual elements is accurately measured but no longer represents domain • If deemphasized material matters for inference 13

Biased estimates of element-level performance(Princeton Review’s Cracking the MCAS) • Plugging in: • “Rather than doing a problem like this in your head or trying to solve it algebraically, the easiest and fastest way to solve it is to plug in a number for x.” • Process of elimination • “Sometimes the best way to solve a problem is to figure out what the…wrong answers are and eliminate them….It’s often easier to identify the wrong answers than to find the correct one.” • Pythagorean theorem: • “Popular Pythagorean ratios include the 3:4:5 (and its multiples) and the 5:12:13 (and its multiples).” 14

Coaching or cheating? • The… review sheet…reads in part: “The average amount that each band member must raise is a function of the number of band members, b, with the rule f(b)=12000/b.” • The question on the actual test reads in part: “The average amount each cheerleader must pay is a function of the number of cheerleaders, n, with the rule f(n)=420/n.” • Source: Strauss, V., The Washington Post, July 10, 2001, p. A09 15

Homework • Download technical report for your state test • Find section on validity • Look for discussion of evidence relevant to these threats to validity 16

Why traditional validation is insufficient for TFA • Cross-sectional and generally correlational • Insensitive to changes in levels of performance • Assumes stability in relationships between tested and untested aspects of performance • Ignores omissions and recurrences in tests • Ignores behavioral responses to high-stakes testing 17

What needs to be done? • More research on TFA • Changes to the practice of measurement • Expanded approach to validation • New approaches to test design in response to issues of incentives and accountability • Possible changes to ‘operational’ procedures, such as linking 18

Additional research needed • More research on methods to disentangle inflation from meaningful gains • More research exploring the extent and distribution of inflation, e.g., across types of schools or students • More research exploring the variables shaping incentives in TFA, e.g., • Characteristics of tests • Measures of performance employed • Rate of expected change • Evaluations of new designs and for tests and accountability systems 19

Options for changes in test design • To better estimate true gain and to create better incentives (less incentive to narrow or coach) • Maximize breadth of coverage (matrix sample?) • Minimize unnecessary repetition, e.g., repetition of: • Details of content • Styles of presentation • Non-substantive task demands • Build in audit items to better estimate real gains 20

Expanded approach to validation • Cannot stop with initial quality of tests and inferences • Must consider validity of inferences about gains after stakes have been imposed • Will require expanded and more routine auditing of gains • Should be treated as a core aspect of validity, e.g., • In tech reports and texts 21

Validity in an Era of Accountability

Validity in an Era of Accountability

Presentation Transcript

An Era of Standards

An Era of Reform

An Era of Reform

COOPERATIVES in an ERA of CONSOLIDATION

Validity Issues for Accountability Systems

An Era of Exploration

An Era of Transformation

AN ERA OF FEAR

An Era of Reform

Integrating learning theory in an era of accountability testing

Dawn of an ERA

An Era of Change

An Era of Nationalism

An Era of Reform

Determining Validity For Oklahoma’s Educational Accountability System

An Era of Exploration

An Era of Nationalism

An Era of Activism

An Era of Expansion

Educational Accountability in an Era of Global Decentralization

Validity- An Overview

MI-SAAS: A New Era in School Accountability