Scaling and EquatingJoe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction
Overview • Scaling • Definition • Purposes • Equating • Definition • Purposes • Designs • Procedures • Vertical Scale
What is Scaling? • Scaling is the process of associating numbers with the performance of examinees • What does 400 mean in WASL? It is not a raw score but a scaled score.
Primary Score Scale • Many educational tests use one primary score scale for reporting scores • Raw scores, scaled scores, percentile • WASL and WLPT-II use scaled scores
Activity Grade 3 Mathematics Items
Why Use a Scaled Score? • Minimizing misinterpretations e.g. Emmy got 30 points last year and met the standard. I got 31points this year but did not meet the standard. Why? The cut score last year was 30 points and the cut score this year is 32points. Did you raise the standard?
Why Use a Scale Score? • Facilitate meaningful interpretation • Comparison of examinees’ performance on different forms • Tracking of trends in group performance over time • Comparison of examinees’ performance on different difficulty levels of a test
Raw Score and Scaled Score • Linearly (Monotonic) related • Based on Item Response Theory Ability Scale • Each observed performance is corresponding to an ability value (theta) • Scaled score = a + b *(theta)
Linear Transformation Simple linear trasformation: Scaled Score= a + b*(ability) Two parameters are used to describe that relationship: a and b. We obtain some sample data and find the values of a and b that best fit the data to the linear regression model.
WASL 400 = a + b*(theta 1) 375 = a + b*(theta 2) • Theta 1 and theta 2 are established by the standard setting committees. • a and b are determined by solving the equations above.
WLPT-II • Min Scaled Score = 300 • Max Scaled Score = 900 300 = a + b*(theta 1) 900 = a + b*(theta 2)
WASL Scaling • 375 is the cut between level 1 and level 2 for all grade levels and content areas • 400 is the cut between level 2 and level 3 for all grade levels and content areas. • Each grade/content has a separate scale (WASL) • All grade levels are in the same scale (WLPT-II) - vertically linked
WASL G 3 G 4 G 5 G 6 HS G 7 G 8 400 375
WLPT-II (Vertical Scale) 900 300 K 1 2 3 4 5 6 7 8 9 10 11 12
Purpose of Equating • Large scale testing programs use multiple forms of the same test • Differences in item and test difficulties across forms must be controlled • Equating is used to ensure that scale scores are equivalent across tests
Requirements of Equating Four necessary conditions for equating (Lord, 1980): • Ability - Equated tests must measure the same construct (ability) • Equity – After transformation, the conditional frequencies for each test are same • Population invariance • Symmetry
Ability - Equated Tests Must Measure the Same Construct (Ability) • Item and test specifications are based on definitions of the abilities to be assessed • Item specifications define how the abilities are shown • Test specifications ensure representation of all aspects of the construct • Tests to be equated should measure the same abilities in the same ways
Equity • Scales on the tests to be equated should be strictly parallel after equating • Frequency distributions should be roughly equivalent after transformation
Population Invariance • The outcome of the transformation must be the same regardless of which group is used as the anchor • If score Y1 on Y is equated to score X1 on X, the result should be the same as if score X1 is equated to score Y1 • If a score of 10 on 2007 Mathematics is equivalent to a score of 11 on 2006 Mathematics (when 2006 is used as the anchor), then a score of 11 on 2006 Mathematics should be equivalent to a score of 10 on 2007 Mathematics (when 2007 is used as the anchor)
Symmetry • The function used to transform the Y scale to the X scale is the inverse of the function used to transform the X scale to the Y scale • If the 2007 Mathematics scale is equated to 2006 Mathematics scale, the function used to do the equating should be the inverse of the function used when the 2006 Mathematics scale is equated to the 2007 Mathematics scale
Equating Design Used in WASL • Common-Item Nonequivalent Groups Design (Kolen & Brennan, 1995) • A set of items in common (anchor items) • Different groups of examinees (in different years)
Equating Method • Item Response Theory Equating uses a transformation from one scale to the other • to make score scales comparable • to make item parameters comparable
Equating of WASL • The items on a WASL test differ from year-to-year (within grade and content area) • Some items on the WASL have appeared in earlier forms of the test, and item calibrations (“b” difficulty/step values) were established. These are called “Anchor Items”. • Each year’s WASL is equated to the previous year’s scale using these anchor items.
Equating Procedure • Identify anchor item difficulties from bank. • Calibrate all items on current test form without fixing anchor item difficulties. • Calculate mean of anchor items using bank difficulties. • Calculate mean of anchor items using calibrated difficulties from current test • Add constant to current test difficulties so the mean equals mean from bank values.
Equating Procedure • For each anchor item, subtract current difficulty from the bank difficulty (after adding the constant). • Drop the item with largest absolute difference greater than 0.3 from consideration as an anchor item. • Repeat steps 3-7 using remaining anchor items.
Equating Example • Item Calibrations before equating (Anchor items flagged on right with “Y”
Equating Example • Item #17 was removed as an anchor item; other anchors were kept.
Equating Example • Item Calibrations after equating (Anchor items fixed with “A” in Measure column
Transformed ScoresRaw-to-Theta-to-Scale Procedures • Calibration software provides a Raw-to-Theta look-up table. • Theta-to-Scale Score transformation is applied (derived from Theta at 3 cut-points from Standard Setting committee: (L2) 375 (L3) 400 (L4) SS, obtained by solving for (L4) in SS=m*+b derived from (L2) and (L3)
Transformed Scores Example • In Grade 4 Mathematics, the Standard Setting Committee established the following cut-scores: • Setting (L2) = 375 and(L3) = 400, establishes this Theta-to-SS formula: SS = (37.76435 * ) + 378.3988 • Solving for (L4), SS(L4) = 427.115
Theta-to-SS Transformations • The current Theta-to-SS transformations:
Transformed Scores • Raw-to-Scale Score table from equating report
How to Determine Cut Score (Until 2006) • If there is 400, the cut score is 400 • If 400 does not exist, the nearest score becomes the cut score e.g. - 397, 400, 402: 400 is the cut score - 398, 401, 403: 401 is the cut score - 399, 402, 405: 399 is the cut score
How to Determine Cut Score (2007) • If there is 400, the cut score is 400 • If 400 does not exist, the next lowest score becomes the cut score e.g. - 397, 400, 402: 400 is the cut score - 398, 401, 403: 398 is the cut score - 399, 402, 405: 399 is the cut score
Vertical Scale • Examinee performance across grade levels on a single scale • Measure individual student growth • Locate all items across grade level on a single scale • Proficiency standard from different grade levels to a single scale
Vertical Scaling vs. Equating • Equating: scores on different test forms to be used interchangeably within grade level • Vertical scaling: • Performance across all grade levels on the same scale • Measure students’ growth • Not equating
Data Collection Design • Common item design • Common items between adjacent grade levels • Select appropriate level items to each grade • Equivalent group design • Same examinees • Take on-grade test or off-grade test (usually lower grade test)
Previous Vertical Linking Study • Math in Grades 3, 4, and 5 • Purpose of the study • How much are students growing over time? • What is the precision of these estimates?
Data • The data consists of items used in the pilot test for Grades 3 and 5 in 2004 and 2005 • Operational data for Grade 4 in 2005
Linking Design • Items across all forms in three grades • Each form within grade includes a common block of items • Common item non-equivalent groups design
Results • Comparing the p-values for the linking items across grades suggests some instability • Growth is larger from grades 3 to 4 than grades 4 to 5 • Pilot data vs. operational data • Motivation factor (G4 to G5) • Backward Equating
Future Plan • Vertical linking study will be conducted in January 2008 using the 2007 reading WASL. • The results will be presented next year.