Hierarchical Linear Modeling for Detecting Cheating and Aberrance

Hierarchical Linear Modeling for Detecting Cheating and Aberrance Statistical Detection of Potential Test Fraud May, 2012 Lawrence, KS William Skorupski University of Kansas Karla Egan CTB/McGraw-Hill

Purpose of the Study • “Cheating” as a paradigm for psychometric research has focused on individuals. • Our purpose is to identify groups of cheaters, based on the premise that teachers and administrators may be motivated to inappropriately influence students’ scores.

Background • Importance of cheating detection • Cheating as classroom-, school-, or even district-wide phenomenon • Results of many large-scale educational assessments are tied to incentives, e.g., merit-based pay, accountability, AYP targets from NCLB • Teachers may be tempted to “teach to the test,” provide inappropriate materials, alter students’ answer sheets

Previous Study • Skorupski & Egan (2011) demonstrated a Bayesian hierarchical modeling approach for group-level aberrance (real data). • Cross-validation with external reports of impropriety. • Reasonable detection rates, difficult to verify results.

Findings • Relatively large aberrance for a few schools at certain Time points suggested that this approach may be useful for flagging potentially cheating schools. • The present simulation study was planned to evaluate detection power.

Two “Non-Aberrant” Schools

Two Flagged Schools

Goals of the study • Evaluate the robustness of the Bayesian HLM approach for detecting group-level cheating through Monte Carlo simulation. • Develop heuristics for flagging known “cheaters” from the analysis

Cheating & Aberrance • Certain kinds of aberrance may be evidence of cheating • Answer copying • Model-data misfit • In our analysis: unusually high group performance at given time, given marginal group & time effects • i.e., Large positive interaction effect

Important Note • No cheating/aberrance detection method can “prove” cheating, but merely flag unusual individuals or groups for further review. • Our goal is to demonstrate detection of known group-level cheating with adequate power while maintaining an acceptable Type I error rate.

Methods – Data Simulation • Data created to emulate a vertically scaled SWA • 3 linked administrations, means increasing 0.5s between each Time mt = 0, 0.5, 1 • 60 Groups, N(g) within ranging from 10 to 260 (Total N = 4,650)

51 of 60 means at Time 1 from m(g) ~ N(0,1) 3 x 3 = 9 groups: N(g) = 10, 60, 110 m(g) = -1,0,1 These 9 groups (3 at each Time, so 5% overall) will be the “cheaters”

Simulate Individual Scores • q ~ MVN(0,R): 0 vector of zeros, R correlation matrix, off-diagonals = 0.77 (based on real data study) • Each individual score Yigt was created by taking qigt and adding its respective Time and Group mean. • At this point, all scores are “non-aberrant;” main effects alone account for differences

Simulate “Cheating” • For cheating groups, additional interaction effect is added to Yigt • 3 at each Time, for m(g) = -1, 0, or 1 and N(g) = 10, 60, or 110 • Group-by-Time (60 x 3) matrix of effects. If GT=0  no cheating, GT>0  cheating. • GT=1 for simulated cheaters (i.e., Group mean is +1s above main effects)

Time 3 Cheating Time 2 Cheating Time 1 Cheating Each of these 3 patterns was crossed with 3 N = 10, 60, 110

Notes on Simulation • Forms must be linked over Time • In this analysis, scale scores were directly simulated (treating scores as measured without error), but in practice item response data would first be obtained, linked in a vertical scale. • Examinees are nested within groups, Time points nested within individuals

Groupg Groups(1,…,G) Individuals(1,…,N(g)) Person1g Personig PersonN(g)g Time (linked)(1,2,3) Yig1 Yig2 Yig3

Methods – Analysis Hierarchical Growth Model • Model: Scale scores for individuals (i) within groups (g) over time (t): Yigt = b0 + b1g + b2t + b3gt + eigt • eigt ~ N (0, s2) • Fully Bayesian estimation (MCMC) using WinBUGS (Lunn et al, 2000) • 50 replications

Baseline Model • Only Time- and Group-level effects are estimated as differences in intercepts (plus interaction term) • With real data, other models could also incorporate covariates (SES, etc.) at any level of the model

Outcomes • The parameter estimates b3gt(Group-by-Time interactions) are used to infer aberrant group performance at a given Time. • b1g (main effect for Group) could also be used to detect systematic aberrance • Delta values for parameter estimates, plus “Posterior Probability of Cheating” (PPoC).

Outcomes • PPoC = proportion of posterior draws (samples from the posterior in MCMC output) above zero. • Criterion for flagging: PPoC≥0.75 Standardized effect size for Interaction. Previous study found d≥0.5 as a reasonable criterion

Cross-validation • Any Group/Time interaction effect with d≥0.5 and PPoC≥0.75 was considered flagged as aberrant (i.e., potentially cheating). • Over replications, correctly identified groups were part of the Power calculation, false positive flags were part of the Type I error rate.

Results • MCMC: 2 chains, 30,000 iterations each, burn-in=25,000 • Very good convergence of solutions • Main effects for Time and Group were well recovered. • Detection power was very good at Times 2 & 3, quite low for Time 1 • Acceptable Type I error rate

Flag Criteria: d ≥ .5 PPoC ≥ .75 Marginal Power = .59 Type1 = .04

Flag Criteria: d ≥ .5 PPoC ≥ .75 Time 1 Power = .07 Type1 = .04

Flag Criteria: d ≥ .5 PPoC ≥ .75 Time 2 Power = .71 Type1 = .04

Flag Criteria: d ≥ .5 PPoC ≥ .75 Time 3 Power = 1 Type1 = .05

Discussion • Overall power is quite good, very poor at Time 1 • Type I error rate acceptable • Pretty encouraging results; more simulations, replications planned • More conditions with various effect sizes, sample sizes, non-linear trends, etc.

How might this method be used in practice? • Flagged groups may be compared to the Overall growth trajectory to infer aberrance of performance. • Groups flagged must then be investigated further. • Unusual performance could be caused by cheating, or it could indicate something exemplary! • Commend or Condemn?

Thanks! wps@ku.edu

Hierarchical Linear Modeling for Detecting Cheating and Aberrance