Large-scale testing: Uses and abuses

Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014

Large-scale testing: Uses and abuses 3 types of large-scale tests Measuring test quality A chronology of mistakes Economists misunderstand testing How SIMCE is affected

1. Three types of large-scale tests AchievementAptitudeNon-cognitive

Achievement tests J.M. Rice - systematically analyzed test structures & effects E.L. Thorndike - developed scoring scales Historically, were larger versions of classroom tests ~ 1900 - “scientific” achievement tests developed (Germany & USA) SOURCE: Phelps, Standardized Testing Primer, 2007

Achievement tests Purpose: to measure how much you know and can recall Developed using: content coverage analysis How validated: retrospective or concurrent validity (correlation with past measures, such as high school grades) Requires a mastery of content prior to test. Fairness assumes that all have same opportunity to learn content Coachable – specific content is known in advance SOURCE: Phelps, Standardized Testing Primer, 2007

1890s – A. Binet & T. Simon (France) • Pre-school children with mental disabilities • - achievement test not possible • - developed content-free test of mental abilities • (association, attention, memory, motor skills, reasoning) Aptitude tests 1917 – Adapted by U.S. Army to select, assign soldiers in World War 1 1930s – Harvard University president J. Conant wanted new admission test to identify students from lower social classes with the potential to succeed at Harvard developed the first Scholastic Aptitude Test (SAT) SOURCE: Phelps, Standardized Testing Primer, 2007

Aptitude tests Purpose: predict how much can be learned Developed using: skills/job analysis How validated: predictive validity, correlation with future activity (e.g., university or job evaluations) Content independent. Measures: … what student does with content provided … how student applies skills & abilities developed over a lifetime Not easily coachable – the content is either… … not known in advance, … basic, broad, commonly known by all, curriculum-free; … less dependent on the quality of schools SOURCE: Phelps, Standardized Testing Primer, 2007

Aptitude tests Aptitude tests can identify: - Students bored in school who study what interests them on their own - Students not well adapted to high school, but well adapted to university - Students of high ability stuck in poor schools SOURCE: Phelps, Standardized Testing Primer, 2007

Comparing Achievement & Aptitude tests

Non-cognitive tests More recently developed – measure values, attitudes, preferences Types: integrity tests career exploration matchmaking employment “fit”

Non-cognitive tests Purpose: to identify “fit” with others or a situation Developed using: surveys, personal interviews How validated? success rate in future activities Content is personal, not learned “Faking” can be an issue (e.g., “honesty” tests)

Comparing Achievement, Aptitude, & Non-Cognitive Tests

Test reports can be “data dumps” 2. Measuring test quality 3 measures are important: 1. Predictive validity 2. Content coverage 3. Sub-group differences

Predictive validity(values from -1.0 to +1.0) …measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion) A test with low predictive validity provides a little information.

A positive correlation between two measures Source: NIST, Engineering Statistics Handbook

A negative correlation between two measures Source: NIST, Engineering Statistics Handbook

No correlation between two measures Source: NIST, Engineering Statistics Handbook

Howdoesonemeasurepredictivecapacity?CorrelationCoefficient: I--------------------------------------------I-1 0 1

Predictive validities: SAT and PSU SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Predictive validities: SAT and PSU (faculty: Administracion) SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Large-scale testing: Uses and abuses