A Balancing Act: Common Items Nonequivalent Groups (CING) Equating Item Selection

A Balancing Act:Common Items Nonequivalent Groups (CING) Equating Item Selection Tia Sukin Jennifer Dunn Wonsuk Kim Robert Keller July 24, 2009

Background • Equating using a CING design requires the creation of an anchor set • Angoff (1968) developed guidelines for developing the anchor set • Length: 20% of operational test (OT) or 20 items • Content: Proportionate to OT by strand • Statistical Properties: Same mean / S.D. • Contextual Effects: Same locations, formats, key, etc.

Background • Majority of the research provides support for these guidelines (e.g., Vale et al., 1981; Klein & Jarjoura, 1985; Kingston & Dorans, 1984) • Research has included robustness studies (e.g., Wingersky & Lord, 1984; Beguin, 2002; Sinharay & Holland, 2007)

Background • Most research has used placement (e.g., AP), admissions (e.g., SAT), and military (e.g., ASVAB) exams for empirical and informed simulation studies • Research using statewide accountability exams is limited (e.g., Haertel, 2004; Michaelides & Haertel, 2004)

Background • General Science tests are administered in all states for all grade levels except: • 19 states offer EOC Science exams in H.S. • 10 offer more than one EOC Science exam • 5 offer more than two

Research Questions • Do the long-established guidelines for maintaining content representation (i.e., proportion by number) hold in creating an anchor set across all major subject areas (i.e., Mathematics, Reading, Science)? • Are there significant changes between expected raw scores and proficiency classification when different methods for maintaining content representation are used?

Design 5 Methods of Anchor Set Construction • Operational • Proportion by Number of Items/Strand • G Theory • ICCs • Construct Underrepresentation 3 Subjects (2 States, 3 Grades) • Math • Reading • Science

Variance Calculation – G Theory Multivariate Design • p x i with content strand as a fixed facet Multivariate Benefit • Covariance components are calculated for every pair of strands Item Variance Component

Variance Calculation – ICC • Use the median P(θ) as the average in calculating within strand variability P(θ) θ

Equating Item Selection Example:

Equating Item Selection • Percentage of strands that differ by more than one item between selection methods (excluding the construct underrepresentation method): • Math: 13% • Reading: 52% • Science: 20%

Example Results – Scoring Category Distributions

Discussion • Equating is highly robust to the selection process used for creating anchor sets EXCEPT • Choosing equating items from 1-2 strands is discouraged • More caution may be needed with Science • Item selection mattered for 22% of the conditions • 2/18 for Math: Both were the under rep. condition • 3/18 for Reading: All were the under rep. condition • 7/18 for Science: 2 under rep. / 5 ICC and G • Content balance is important and can be conceptualized in different ways without impacting the equating

Future Study • A simulation study is needed so that raw score and proficiency categorizations using the different item selection methods can be compared to truth • Meta-analysis detailing published & unpublished studies that provide evidence for or against the robustness of CING equating designs

Thank you 

A Balancing Act: Common Items Nonequivalent Groups (CING) Equating Item Selection