Automated Scoring is a Policy and Psychometric Decision

Automated Scoring is a Policy and Psychometric Decision Christina Schneider The National Center for the Improvement of Educational Assessment cschneider@nciea.org 803-530-8136

Automated Scoring is Complex • For the same models to work in different administrations • the general ability levels of examinees must be constant, • the features of submissions must be constant, and • the human rating standards must be constant (Trapani, Bridgeman, and Breyer, 2011) Which population should be used to train engines – Field test or OP? Should models that work for a consortium average be applied to individual states? To subgroups within states?

Understanding system flags is essential to understanding a system • Always ask how and where in the process a system flags responses for quality control purposes and scoring purposes (gaming or unscorable) • Flags occur by examining different combinations of features for outliers: • sophisticated words, good organization and content, but many grammar and spelling errors could be indicative of child who is dyslexic. Response routed to a human. Is the system set up for administrators to pre-flag students for human scoring? • sophisticated words, good organization, overly long development, grammar and other issues could be a gaming attempt. Response routed to a human • Redundant word choice could be a gaming attempt. Response routed to a human

Formative Systems • Flagging rules for gaming are a quality control method after model building has occurred.There is tension between policy and psychometric needs on the formative side because teachers are not always happy to score many papers by hand, yet students will often practice how to game using a formative system. • For young students, errors in spelling can trigger large numbers of flags • Teacher training on how to use the system wisely is essential.

Hybrid-Either Engine or Human • Understanding under what conditions scoring a paper via an automated system is not optimal is important to establishing and providing validity evidence for the scoring process but this is not a component of many technical reports. • Investigate which students (i.e., the demographic characteristics) are flagged and the demographic percentages of students that are flagged compared to the population. • Comparability focus is the accuracy of the score, not whether a human or engine is scoring. Requires good communication with stakeholders. • This is a good study for states interested in moving to all automated scoring sometime in the future. The best scoring was obtained when 20% of responses were routed to humans. • May be best to plan for up to 20% human scoring when AS is planned as Reader 1

Using a Weighted Approach • Weighted hybrid approach means using both the automated score and the human score, with one counting more than the other depending on the task. • Promising approach for improving the accuracy of automated scoring for writing based on the writing genre. • Results were for an ELL population. Needs to be studied with Common Core items

Hybrid Scoring • Hybrid scoring is likely the best practice of the future. • Cost savings from using engines • Hybrid scoring includes humans reading too. • Begin with a program of research that can improve human scoring as well as automated scoring • Look at work related to human scorer interaction with rubrics (Leacock, 2013 & 2014) and how this influences engines • Early indications show hybrid approach improved reliability abovethat of two humans (Kieftenbeld & Barrett, 2014) for AS-scorable prompts • Not all prompts can be AS scored • Need to investigate why – engine functioning is often related to the size of the training and validation set

Automated Scoring is a Policy and Psychometric Decision