220 likes | 382 Vues
The Influence of Item calibration Error on variable-Length Computerized Adaptive testing. Ying Cheng 2/5/2012. Outline. Introduction Review prior research investigating the effects of item calibration error on related measurement procedures. Purpose Method Results Conclusions.
E N D
The Influence of Item calibration Error on variable-Length Computerized Adaptive testing Ying Cheng 2/5/2012
Outline • Introduction • Review prior research investigating the effects of item calibration error on related measurement procedures. • Purpose • Method • Results • Conclusions
Introduction • Variable-length computerized adaptive testing (VL-CAT): adaptive in terms of both item selection and length. • Any CAT program requires a bank of previously calibrated item parameters, but these are often assumed to be the true values. • However, only estimates of the item parameters are available, and because adaptive item selection involves optimization, capitalization on chance may occur. • van der Linden and Glas (2000): how the effects of the capitalization on item calibration error in fixed-length CAT.
Calibration error: the magnitude of sampling variability in item parameter estimates as determined by the size of the calibration sample for a given method of calibration and distribution of the latent trait.
Termination criteria in VL-CAT • Conditional standard error (CSE): a test ends when the standard error of latent trait estimate falls below a predetermined threshold. • Achieving roughly uniform measurement precision across the range of ability. • Test length largely depends on the examinee’s latent trait level. • Examinees with extreme true θ values will tend to have long test.
Ability confidence interval (ACI): a test stops when the (i.e., 95%) confidence interval for θ falls entirely above or below the cut point. • Test length depends on the relative location of true ability to the cut point. • Examinees with true θ values near the cut will tend to have very long test.
Calibration Error and Latent Trait Estimation • In item response theory (IRT): assume the true item parameters are known, thus, the SE of latent trait estimate reflects only measurement error. • In practice, only estimates of item parameters can be obtained, hence, the SE will be underestimated when the additional source of error is ignored.
Cheng and Yuan (2010): “upward correction” to the asymptotic SE of the maximum likelihood ability estimate. • SE* will be larger than the SE based on test information alone.
Capitalization on Calibration Error via item Selection • Items with large a values are generally preferred in two- or three-parameter logistic models (2PLM or 3PLM) when the maximum item information was used for item selection. • Calibration sample: the larger the error, the larger the effects of the capitalization on calibration error. • The ratio of item bank size to test length: the larger the ratio, the larger the likelihood of selecting items only from those with the larger estimation error.
Purpose of Study • Manipulate the magnitude of calibration error via the calibration sample size, and examine the effects on average test length and classification accuracy in several realistic VL-CAT scenarios.
Method • Independent variables: • IRT models: 2PLM or 3PLM • Termination criteria: CSE (threshold of .316) or ACI (95%) • Calibration sample sizes: N = ∞, 2500, 1000, or 500. • Dependent variable • Average test length • Empirical bias of latent trait estimate • The percentage of correctly classified examinees at each true value of θ
Results • Relative Test Efficiency and Test Length • Whether the maximum information criterion capitalized on calibration error ACI: Figure 2 • Implications of capitalization on chance for test length Figure 3 • Ability Recovery and Classification Accuracy • Conditional bias Figure 4 • The effect of calibration error on classification accuracy Table 3 & 4
Discussion • CSE Termination Rule • Test length was sensitive to the magnitude of item calibration error. • The maximum likelihood ability estimator may exhibit non-negligible bias in the presence of calibration error. • Classification accuracy tended to suffer for small calibration samples, but because the magnitude of bias in estimate of latent trait was not large in the vicinity of the cut, the reduction in classification accuracy was not large (no more than 5%).
ACI Termination Rule • Test length was clearly robust to the magnitude of calibration error. • The pattern and magnitude of bias was similar for all values of N, and so there was no strong or systematic effect of N on classification accuracy. • Because the ACI rule is sensitive to the cut location we suspect that the robustness of bias and classification accuracy to the magnitude of calibration error may hold even for more extreme cut locations.
Limitations of the Current Study • Whether alternative criteria would also be sensitive to capitalization on chance. • Whether alternative stopping rules might also be sensitive to capitalization on chance. • Impose non-statistical constraints on item selection • i.e., exposure control and content balancing. • Use the upward-corrected SE, this method is not currently feasible in adaptive testing scenarios.
Figure 2 • Regardless of IRT model, the true and “estimated” item parameters are identical in the N = ∞ conditions, so relative efficiency is equal to one for all values of θ. • As N decreases, relative efficiency steadily increases overestimation of item information becomes more severe as N decreases. • The problem of capitalization on chance is greater for smaller calibration samples and the more complex model.
Figure 3a & 3b • Tests tend to be spuriously short for small values of N, regardless of IRT model. • The effect of N on test length is relatively uniform for the 2PLM conditions, whereas the effect varies quite a bit for the 3PLM conditions.
Figure 3c & 3d • Tests are quite long near the cut, whereas only the minimum 15 items are required farther from the cut. • There is only a negligible effect of N on the average test length, save for a small region near the cut in the 3PLM conditions.
Figure 4a & 4b • As N decreases, there emerges a systematic relationship between bias and ability. In particular, the magnitude of bias is greatest at the extremes.
Figure 4c & 4d θ= 0.5 θ= 0.5 • Bias is negative below the cut ( θ = 0.5) and positive above it, and this trend is most apparent in the region near the cut (i.e., 0 < θ < 1).
Table 3 • In general, classification accuracy decreases as N decreases, but this is not always the case.
Table 4 • There is no consistent relationship between N and classification accuracy. • The relationship between bias and true ability depends on the location of the cut.