Adventures in Equating Land:
260 likes | 432 Vues
Adventures in Equating Land:. Facing the Intra-Individual Consistency Index Monster *. *Louis Roussos retains all rights to the title. Overview of Equating Designs and Methods. Designs Single Group Random Groups Common Item Nonequivalent Groups (CING) Methods Mean Linear Equipercentile
Adventures in Equating Land:
E N D
Presentation Transcript
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster* *Louis Roussos retains all rights to the title
Overview of Equating Designs and Methods • Designs • Single Group • Random Groups • Common Item Nonequivalent Groups (CING) • Methods • Mean • Linear • Equipercentile • IRT True or Observed
Guidelines for Selecting Common Items for Multiple-Choice (MC) Only Exams • Representative of the total test (Kolen & Brennan, 2004) • 20% of the total test • Same item positions • Similar average/spread of item difficulties (Durans, Kubiak, & Melican, 1997) • Content representative (Klein & Jarjoura, 1985)
Challenges in Equating Mixed-Format Tests(Kolen & Brennan, 2004; Muraki, Hombo, & Lee, 2000) • Constructed Response (CR) scored by raters • Small number of tasks • Inadequate sampling of construct • Changes in construct across forms • Common Items • Content/difficulty balance of common items • MC only may result in inadequate representation of groups/construct • IRT • Small number of tasks may result in unstable parameter estimates • Typically assume a single dimension underlies both item types • Format Effects
Current Research • Number of CR Items • Smaller RMSD with larger numbers of items and/or score points (Li and Yin, 2008; Fitzpatrick and Yen, 2001) • Misclassification (Fitzpatrick and Yen, 2001) • Fewer than 12 items, more score points resulted in smaller error rates • Greater than 12 items, error rates less than 10% regardless of score points • Trend Scoring (Tate, 1999, 2000; Kim, Walker, McHale, 2008) • Rescoring samples of CR items • Smaller bias and equating error
Cont. • Format Effects (FE) • MC and CR measure similar constructs (Ercikan et al., 1993; Traub, 1993) • Males scored higher on MC; females higher on CR ( DeMars, 1998; Garner & Engelhard, 1999) • Kim and Kolen, 2006 • Narrow-range tests (e.g., credentialing) • Wide-range tests (e.g., achievement) • Individual Consistency Index (Tatsuoka & Tatsuoka, 1982) • Detecting aberrant response patterns • Not specifically in the context of mixed-format tests
Purpose and Research Questions Purpose: Examine the impact of equating mixed format tests when student subscores differ across item types. Specifically, • To what extent does the intra-individual consistency of examinee responses across item formats impact equating results? • How does the selection of common items differentially impact equating results with varying levels of intra-individual consistency?
Data • “Old Form” (OL) treated as “truth” • Large-scale 6th grade testing program • Mathematics • 54 point test • 34 multiple choice (MC) • 5 short answer (SA) • 5 constructed response (CR) worth 4 points each • Approx. 70,000 examinees • “New Form” (NE) • Exactly the same items as OL • Samples of examinees from OL
NE (new form) Samples of 3,000 Examinees OL (old form) All Examinees 2006-07 Scoring Test 39 Items 2006-07 Scoring Test 39 Items Both OL and NE contain the exact same items Only difference between the forms are the examinees
Intra-Individual Consistency • Consistency of student responses across formats • Regression of dichotomous item subscores (MC and SA) onto polytomous item subscores (CR) • Standardized residuals • Range from approximately -4.00 to +8.00 • Example: Index of +2.00 • Student subscores on CR under-predicted by two standard deviations based on MC subscores
Samples • Three groups of examinees based on intra-individual consistency index • Below -1.50 (NEG) • -1.50 to +1.50 (MID) • Above +1.50 (POS) • 3,000 examinees per sample • Sampled from each group based on percentages • Samples selected to have same quartiles and median as whole group of examinees
Sampling Conditions • 60/20/20 • 60% sampled from one of the groups (i.e., NEG, MID, POS) • 20% sample from each of the remaining groups • Repeated for each of the three groups • 40/30/30
Common Items • Six sets of common items • MC only (12 points) • CR only (12 points) • MC (4) and CR (8) • MC (8) and CR (4) • MC (4), CR (4), and SA (4) • MC (7), CR (4), and SA (1) • Representative of total test in terms of content, difficulty and length
Equating • Common-item nonequivalent groups design • Item parameters calibrated using Parscale 4.1 • 3-parameter logistic model (3PL) for MC items • 2PL model for SA items • Graded Response Model for CR items • IRT scale transformation • Mean/mean, mean/sigma, Stocking-Lord, and Haebara • IRT true score equating
Equating OL and NE All items shared in common “Common” Items OL NE Equating conducted using only a selection of items treated as common “Truth” established by equating NE to OL using all items as common items
Evaluation • Bias and RMSE • At each score point • Averaged over score points • Classification Consistency
Discussion • Different equating results based on sampling conditions • Differences more exaggerated when using common items sets with mostly CR items • Mid 60 most similar to data, small differences across common item selections
Limitations and Implications • Limitations • Sampling conditions • Common item selections • Only one equating method • Implications for future research • Sampling conditions, common item selections, additional equating methods • Other content areas and grade levels • Other testing programs • Simulation studies
Thanks! • Rob Keller • Mike, Louis, Won, Candy, and Jessalyn