210 likes | 337 Vues
Validity in an Era of Accountability. Daniel Koretz CRESST/ Harvard Graduate School of Education. CRESST/ Harvard Graduate School of Education. Two types of issues raised by testing for accountability (TFA). Behavioral issues
E N D
Validity in an Era of Accountability Daniel Koretz CRESST/ Harvard Graduate School of Education CRESST/ Harvard Graduate School of Education
Two types of issues raised bytesting for accountability (TFA) • Behavioral issues • Arise from behavioral responses to testing other than those that improve learning • Non-behavioral issues • Not stemming from behavioral responses to testing 2
Some non-behavioral issues raised by TFA • Error: • Sampling error in aggregate statistics used for accountability • Error in PACs • Error in value-added estimates for teachers or schools • Reporting issues: • Choice of aggregate reporting metric • Issues raised by standards-based reporting • Causal inference: ascertaining effectiveness of programs, teachers, and schools 3
Behavioral issues raised by TFA • “Right-hand side gaming:” affects who is tested • Exclusion • Reclassification of students • Retention in grade • “Left-hand side gaming:” affects the scores of tested students • Inappropriate test preparation and score inflation 4
Examples of Bob Linn’s work on TFA • Reasonableness and robustness of performance standards • Causal inference from test scores • Inconsistencies among performance metrics • Reliability of aggregate performance estimates • Numerous aspects of accountability system design, e.g., AYP • Score inflation from high stakes 5
Where does research on TFA fitin the measurement field? “Traditional psychometrics plus” • Veneer of research responding to TFA • Considerable amount on non-behavioral issues • Much less on behavioral issues • Practice has not changed sufficiently • Some areas extended, e.g., treatment of reliability • Impact of work on non-behavioral issues limited by client demands (e.g., standards) • Behavioral issues have been largely ignored 6
Why worry about behavioral issues? • Research shows a major threat to validity: bias of .50-.75 SD • Bias is inconsistent in size across schools • Little is known about the distribution of the bias • Cannot evaluate overall improvement • Kids are left behind, despite illusion of progress • Cannot evaluate relative improvement • to identify schools for reward, corrective action, emulation 7
One example of the threat to validity:grade 4 KIRIS reading Source: Hambleton et al., 1995 8
How are scores inflated? • RHS gaming: • Obvious: exclude low-scoring kids • LHS gaming: • Reallocation • Coaching • Cheating 9
Two characteristics of tests thatunderlie LHS gaming • Tests are like polls: small samples of a larger domain • Even if well aligned, tests omit relevant content • Scores only matter if they represent the domain • Tests have recurrences • In details of content (included and excluded) • In forms of presentation • In scoring rubrics 10
Reallocation • Shifting instructional resources among substantive areas • Within subject • Between subjects • Results in reallocating achievement • Within subjects, can lead to either meaningful change or inflation 11
Coaching • Focuses on details of the test • Unimportant substantive details • Non-substantive details, such as item formats and scoring rubrics • Includes test-taking tricks (e.g., POE, plug-in) 12
Two ways that validity is undermined • Coaching and cheating: performance on measured elements is biased upward • Test-taking tricks • “Teaching to the rubric” • Focusing on details of presentation • Reallocation: Performance on individual elements is accurately measured but no longer represents domain • If deemphasized material matters for inference 13
Biased estimates of element-level performance(Princeton Review’s Cracking the MCAS) • Plugging in: • “Rather than doing a problem like this in your head or trying to solve it algebraically, the easiest and fastest way to solve it is to plug in a number for x.” • Process of elimination • “Sometimes the best way to solve a problem is to figure out what the…wrong answers are and eliminate them….It’s often easier to identify the wrong answers than to find the correct one.” • Pythagorean theorem: • “Popular Pythagorean ratios include the 3:4:5 (and its multiples) and the 5:12:13 (and its multiples).” 14
Coaching or cheating? • The… review sheet…reads in part: “The average amount that each band member must raise is a function of the number of band members, b, with the rule f(b)=12000/b.” • The question on the actual test reads in part: “The average amount each cheerleader must pay is a function of the number of cheerleaders, n, with the rule f(n)=420/n.” • Source: Strauss, V., The Washington Post, July 10, 2001, p. A09 15
Homework • Download technical report for your state test • Find section on validity • Look for discussion of evidence relevant to these threats to validity 16
Why traditional validation is insufficient for TFA • Cross-sectional and generally correlational • Insensitive to changes in levels of performance • Assumes stability in relationships between tested and untested aspects of performance • Ignores omissions and recurrences in tests • Ignores behavioral responses to high-stakes testing 17
What needs to be done? • More research on TFA • Changes to the practice of measurement • Expanded approach to validation • New approaches to test design in response to issues of incentives and accountability • Possible changes to ‘operational’ procedures, such as linking 18
Additional research needed • More research on methods to disentangle inflation from meaningful gains • More research exploring the extent and distribution of inflation, e.g., across types of schools or students • More research exploring the variables shaping incentives in TFA, e.g., • Characteristics of tests • Measures of performance employed • Rate of expected change • Evaluations of new designs and for tests and accountability systems 19
Options for changes in test design • To better estimate true gain and to create better incentives (less incentive to narrow or coach) • Maximize breadth of coverage (matrix sample?) • Minimize unnecessary repetition, e.g., repetition of: • Details of content • Styles of presentation • Non-substantive task demands • Build in audit items to better estimate real gains 20
Expanded approach to validation • Cannot stop with initial quality of tests and inferences • Must consider validity of inferences about gains after stakes have been imposed • Will require expanded and more routine auditing of gains • Should be treated as a core aspect of validity, e.g., • In tech reports and texts 21