Co-calibration and Equating in Educational Testing: Definitions and Designs

Test co-calibration and equating Paul K. Crane, MD MPH General Internal Medicine University of Washington

Outline • Definitions and motivation • Educational testing literature • Concurrent administration designs • Separate administration designs • PARSCALE coding considerations • Illustration with CSI ‘D’ and CASI • Coming attractions; comments

Definition • Distinction between “equating” and “co-calibration” • We almost always mean “co-calibration” • General idea is to get all tests of a kind on the same metric • Error terms will likely differ, but tests are trying to measure the same thing

5 things needed for “equating” • Scale measures same concept • Scales have same level of precision • Procedures from scale A to B are inverse of scale B to A • Distribution of scores should be identical for individuals of a given level • Equating function should be population invariant • (Linn, 1993; Mislevy, 1992; Dorans, 2000)

Motivation for co-calibration • Many tests measure “the same thing” • MMSE, 3MS, CASI, CSI ‘D’, Hasegawa, Blessed…. • PRIME-MD, CESD, HAM-D, BDI, SCID…. • Literature only interpretable if one is familiar with the nuances of the test(s) used • Studies that employ multiple measures (such as the CHS) face difficulty in incorporating all their data into their analyses • In sum: facilitates interpretation and analysis

Educational literature • Distinct problems: • Multiple levels of same topic, e.g. 4th grade math, 5th grade math, etc. (“vertical” equating) • Multiple forms of same test, e.g. dozens of forms of SAT, GRE to prevent cheating (“horizontal” equating) • Making sure item difficulty is constant year to year (item drift analyses)

Strategies are the same • Either need to have common items in different populations, or common people with different tests • Analyze big dataset that contains all items and people • Verify that common (people or items) are acting as expected

Concurrent administration • Common population design:

Separate administration • Anchor test design – e.g., McHorney

Item bank development

Comments • Fairly simple; God is in the details! • Afternoon workgroup will address the details • Illustration to follow

PARSCALE code • For concurrent administration, it’s as if there is a single longer test • For separate administration, basically a lot of missing data • Once data are in correct format, PARSCALE does the rest

Illustration: CSI‘D’ and CASI

Information curves

SEM

Relative information

Coming attractions • Optimizing screening tests from a pool of items (on Friday) • Item banking and computer adaptive testing (PROMIS initiative) • Incorporation of DIF assessment (tomorrow) • Comments and questions

Co-calibration and Equating in Educational Testing: Definitions and Designs