150 likes | 268 Vues
Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time. Yeow Meng Thum Hye Sook Shin. UCLA Graduate School of Education & Information Studies National Center for Research on Evaluation, Standards, and Student Testing (CRESST)
E N D
Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng ThumHye Sook Shin UCLA Graduate School of Education & Information StudiesNational Center for Research on Evaluation,Standards, and Student Testing (CRESST) CRESST Conference 2004 Los Angeles
Rationale • Research shows that cut-scores vary as a function of many factors: raters, procedures, and over time. • How does one defend a particular cut-score? Averaging several values, use of collateral information are current options. • High-stakes accountability hinges on the comparability of performance standards over time. • Some method is required to monitor cut-scores for consistency across groups and over time. (Green, et al)
Purpose of Study • An approach for estimating the impact from procedural factors and rater characteristics and time. • Monitoring the consistency of cut-scores across several groups.
Transforming Judgments into Scale Scores • Figure 1: Working with the Grade 3 SAT-9 mathematics scale
Performance Distributionfor Four Urban Schools • Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools
Potential Impactof Revising a Cut-score • Table 1: Potential impact on school performance when cut-score changes
Data & Model • Simulate Data for a standard setting study design : a ramdomized block comfounded factorial design (Kirk, 1995) • Factors of standard setting study • Rater Dimensions (Teacher, Non-Teacher, etc.) • Procedural Factors/Treatments • Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc) • Item Sampling in Booklet (Number of items, etc) • Type of Task (A modified Angoff, a contrasting group approach, or Bookmark method, etc)
Treating Binary Outcomes • Binary outcome • (1) • (pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t) • Logit link function • (2)
IRT Model for Cut-score - I • Item Response Model (IRT) • (3) • Procedural Factors Impacting A Rater’s Cut-scores • (4) • Where • is the fixed effect due to session characteristics s • is random effect, which evolves over time ROUNDjt, and a function of rater characteristics, Xpj
IRT Model for Cut-score - II • Estimating Factors Impacting A Rater’s Cut-scores • (5) • are distributed bivariate normal with means (0, 0) and variance-covariances
Likelihood Condition on , y has probability Prior distribution of j • (6) Conditional posterior of the rater random effects j is • (7) where Joint marginal likelihood • (8)
Multiple StudiesConsistency & Stability • Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g) • (9) • Where • is the fixed effect due to session characteristics s • is random effect, which evolves over time SESSIONjt, and a function of rater characteristics, Xpj • Group Factors Impacting A Rater’s Severity • (10)
Simulation • SAS Proc NLMixed • 150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors. • Session Factor 1: Feedback type • Session Factor 2: Item Targeting in Booklet • Session Factor 3: Type of Standard Setting Task • Rater Characteristics: Teacher, Non-Teacher • Change over Round (time)
Selected Results • Model (reasonably) recovers parameters within sampling uncertainty across 3 studies. • Average cut-score (All Teachers) for each rater group at the last Roundis not significantly different from 619, while the first Round results were significantly different. • Results from the model for multiple studies are similarly encouraging.
Suggestions • Large-scale testing programs should monitor their cut-score estimates for consistency and stability. • For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time. • The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.