Lessons Learned from Computer-Based Assessment in the Military: A Retrospective Analysis

What can we learn from the application of computer based assessment to the military? Daniel O. Segall Kathleen E. Moreno Defense Manpower Data Center • Invited presentation at the conference on Computers and Their Impact on State Assessment: Recent History and Predictions for the Future, University of Maryland, October 18–19, 2010 • Views expressed here are those of the authors and not necessarily those of the DoD or U.S. Government.

Presentation Outline • Provide some history of CBT Research and Operational use in the Military. • Talk about some lessons learned over the past three decades. • Many of these lessons deal not only with computer-based testing but with computerized adaptive testing (CAT). • End with some expectations about what lessons are yet to be learned.

ASVAB History • Armed Services Vocational Aptitude Battery (ASVAB) • Before 1976, each military Service administered their own battery. • Starting in 1976, a single ASVAB was administered to all Military applicants. • Used to qualify applicants for entry into the military and for select jobs within each Service. • The ASVAB – Good predictor of training success.

ASVAB Compromise • Early ASVAB (Late 1970’s) – Prone to compromise and coaching • On-demand scheduling at over 500 testing locations • Cheating was suspected from both applicants and recruiters • Congressional hearings were held on the topic of ASVAB compromise • One proposed solution: Computerized adaptive testing version of the ASVAB • Physical loss of CAT items less likely than P&P test booklets • Sharing information about the test items less profitable for CAT than P&P

Initiation of CAT-ASVAB Research • Marine Corps Exploratory Development Project – 1977 • Research Questions • First, could a suitable adaptive-testing delivery system be developed? • Second, would empirical data confirm the anticipated benefits? • Findings • Data from recruits confirmed CAT’s increased measurement efficiency • Hardware suitability? • Minicomputers slow and expensive

Joint Service CAT-ASVAB Project • Initiated in 1979 • Provide additional information about CAT • Anticipated benefits: • Test Compromise • Shorter tests • Greater precision • Flexible start/stop times • Online calibration • Standardized test administration (instructions/time-limits) • Reduced scoring errors (from hand or scanner scoring) • Possibility of administering new types of tests

Early CAT-ASVAB Development • Early development (1979) divided into two projects: • Contractor delivery system development (hardware and software to administer CAT-ASVAB) • Commercially available hardware was inadequate for CAT-ASVAB • There was competition among vendors to develop suitable hardware • Competition abandoned by mid 1980’s because by then commercially available computers were suitable for CAT-ASVAB • Psychometric development and evaluation of CAT-ASVAB

ASVAB Miscalibration • A faulty equating of the first ASVAB in 1976 led to the enlistment of over 350,000 unqualified recruits over a five year period. • As a result, a congressionally mandated oversight committee was commissioned: The Defense Advisory Committee on Military Personnel Testing. • A central focus of the committee and military test development was to implement sound equating and score scaling methodologies. • Random equivalent groups equating methodology was implemented for the development of ASVAB forms and was used as the “gold standard” for all future ASVAB equatings. • This heightened sensitivity to score-scale and context effects formed the backdrop for next three decades of computer-based test development. • Mode of administration • CAT to paper-equating • Effects of different computer hardware on test scores

Experimental CAT-ASVAB System • Developed 1979 – 1985 • The experimental CAT-ASVAB – Study adaptive testing algorithms and test development procedures • Full battery CAT version of the P&P-ASVAB for experimental use • Development Efforts • Psychometric development • Item pool development • Delivery system development • Experimental system used Bayesian ability estimation, maximum likelihood item selection band, and a rudimentary exposure control algorithm

Joint-Service Validity Study • Large Scale Validity Study: 1982-1984 • Sample • Predictor and training success data • N = 7,500 recruits training in one of 23 military jobs. • Results showed that • CAT-ASVAB and P&P-ASVAB predict training success equally well. • Equivalent validity could be obtained by CAT which administered about 40 percent fewer items than it’s P&P counterpart. • Strong support for the operational implementation of CAT-ASVAB.

Operational CAT System Development • 1985 – Present • Addressed a number of Issues: • Item Pools • Exposure Control • Calibration Medium • Item Selection • Time Limits • Penalty for Incomplete Tests • Seeding Tryout Items • Hardware Requirements • Usability Considerations • Reliability and Construct Validity • Equating • Hardware Effects • Test Compromise • Score Scale • New Form Development • Internet Testing • Software/Hardware Maintenance Issues • Multi-Mode Testing Programs

Item Pool Development (1980’s) • CAT-ASVAB Forms 1 and 2: First two operational forms • The P&P reference form (8A) was used to form the basis of the test specifications, but alterations were made • The adaptive pools – Wider range of item difficulties • Pretest Items: About 3,600 items • Items screened on the basis of small-sample IRT item parameter estimates • The surviving 2,118 items were administered to a large applicant sample: N = 137,000 • Items were divided into two pools with about 100 items per subtest

Item Pool Features • CAT item pools do not need to be extraordinarily large to obtain adequate precision and security • Exposure control can be handled by a combination of exposure control imposed by item selection, as well as the use of multiple test forms consisting of multiple (distinct) item pools • The use of multiple item pools (with examinees assigned at random to the pools) is an effective way to reduce item exposure rates and overlap among examinees.

Exposure Control • Experimental CAT-ASVAB system – Some items had very high exposure rates • 5-4-3-2-1 strategy (Wetzel & McBride, 1985) • Guards against remembering response sequences • Does not guard against sharing strategy • Sympson and Hetter • Place an upper limit on the exposure rate of the most informative items, and reduce the predictability of item presentation • Usage of items of moderate difficulty levels reduced; Little or no usage restrictions for items of extreme difficulty or lesser discrimination • Only small loss of precision when compared to optimal unrestricted item selection

Calibration Medium • Calibration Medium Concern • Could data collected from paper-and-pencil booklets be used to calibrate items that would be eventually administered in a computerized adaptive testing format? • Because CAT was not yet implemented, calibration of CAT items on computers was not feasible. • Some favorable results existed from other adaptive tests which had relied on P&P calibrations • A systematic treatment of this issue was conducted for the development of the operational CAT-ASVAB forms using data collected from 3,000 recruits • Calibration medium has no practical impact on the distributions or precision of adaptive test scores

Calibration Medium • Reading Speed is a primary cause of medium effects • Viewing/reading questions on computer is generally slower than Viewing/reading the same questions in a printed paper-based format • To the degree that tests are speeded (time-pressured), then medium is likely to have a larger impact • To the degree that tests are speeded, greater within medium effects can also occur • ASVAB approach: For power tests, reduce the time pressure by extending the time limits • Reducing time pressure for ASVAB power tests did not alter the construct measured – Cross-correlation-check study

Item Selection Rules • Based on maximum item information (contingent upon passing an exposure control screen) • Some consideration given to content balancing, but a primary emphasis was given to measurement precision • More recently, provisions have been made for item enemies • Maximizing precision was – and remains – a primary emphasis of the CAT-ASVAB item selection algorithm

Time Limits • CAT-ASVAB Time Limits • Administrative requirements • Separate time limit for each adaptive power test • IRT Model • Standard IRT model does not model the effects of time pressure on item responding • Alternate Approaches for Specifying Time Limits • Use the per-item time allowed on the P&P-ASVAB • Use the distribution of completion-times from an Experimental version (which was untimed) and set the limits so that 95% of the group would finish

Specifying Time Limits • Untimed Pilot Study • Supported the use of the longer limits • For reasoning tests, high ability examinees took more time than low-ability examinees • High ability examinees would be most effected by shortened time-limits since High ability examinees received more difficult questions, which required more time to answer • Opposite of relation between ability and test-time observed in most traditional P&P tests • In linear testing, low ability examinees generally take longer than high ability examinees

Penalty for Incomplete Tests • Penalty Procedure Required for Incomplete Tests • Due to the implementation of time-limits • Bayesian Estimates • Biased in the direction of the population mean • Bias stronger for shorter tests • Compromise Strategy • Below-average applicants answer minimum number of items • Penalty Procedure • Used to score incomplete adaptive tests • Discourage potential compromise strategy • Provides a final ability that is equivalent to the expected score obtained by guessing at random on the unanswered items

Penalty for Incomplete Tests • Simulations • Used to determine penalty functions (for each subtest and possible test length) • Penalty Procedure Features • Size of the penalty is correlated with the number of unfinished items • Applicants who have answered the same number of items and have the same provisional ability estimate will receive the same penalty • With this approach, test-takers should be indifferent about whether to guess or to leave answers blank given that time has nearly expired • Generous Time Limits Implemented • Permit over 98 percent of test-takers to complete • Avoids disproportionately punishing high ability test-takers

Seeding Tryout Items • CAT-ASVAB administers unscored tryout items • Tryout items are administered as the 2nd, 3rd, or 4th item in the adaptive sequence • Item Position • Randomly determined • Advantages over Historical ASVAB Tryout Methods • Operationally Motivated • No booklet printing required • No special data collection study required

Hardware Requirements • Customized Hardware Platform – 1984 • Abandoned in favor of an off-the-shelf system • The Hewlett Packard (HP) Integral Computer • Selected as first operational system • Superior portability (17 pounds), • Large random access memory (1.5 megabytes), • Fast CPU (8 MHz 6800 Motorola), • Advanced graphics display capability (9 inch monitor with eltroluminescent display and resolution of 512 by 255 pixels). • UNIX based operating system • Supported the C programming language • Floppy diskette drive (no internal hard drive) • Cost about $5,000 (in 1984 dollars) • Lesson: Today’s Computers can easily handle item selection and scoring calculations required by CAT • Challenge: Even though today’s computer’s are thousands of times more powerful, they are not proportionately cheaper than computers of yesteryears

User Acceptance Testing • Importance of User Acceptance Testing • Software development is obviously important • User Acceptance Testing is equally important • Acceptance Testing versus Software Testing • Software Testing – Typically performed by software developers • Acceptance Testing – Typically performed by those who are most familiar with the system requirements • CAT-ASVAB Development • Time spent on acceptance testing exceeded that spent by programmers developing and debugging code

Usability • Computer Usage: 1975 – 1985 • Limited primarily to those with specialized interests • Concerns • Deficient computer experience would lower CAT-ASVAB reliability and validity • Although instructions had been tested on recruits, they had not been tested with applicants, many of whom scored over the lower ability ranges • In addition, instructions had been revised extensively from the experimental system • Approach • Test instructions on a broad representative group of test-takers who had no prior exposure to the ASVAB

Usability • Usability Study (1986) • 231 military applicants and 73 high school students • Issues Addressed • Computer familiarity, instruction clarity, and attitudes towards CAT-ASVAB • Method of Data Collection • Questionnaire and Structured interviews • Findings • Test takers felt very comfortable using the computer, exhibited positive attitudes towards CAT-ASVAB, and preferred a computerized test over P&P – Regardless of their level of computer experience • Test-takers strongly agreed that the instructions were easy to understand • Negative outcome: Most test-takers wanted the ability to review and modify previously answered questions • Because of the requirements of the adaptive testing algorithm, this feature was not implemented • Lesson: Today with a well designed interface, variation in computer familiarity among (young adult) test-takers should not be an impediment to computer based testing

Usability Lessons • Stay in tune with the computer proficiency of the test-takers • Tailor instructions accordingly • Do not give verbal instructions • Keep all instructions on the computer • Keep user interface simple and intuitive

Reliability and Construct Validity • CAT Reliability and Validity • Contents and quality of the item pool • Item selection, scoring, and exposure algorithms • Clarity of test instructions • Item Response Theory • Provides a basis for making theoretical predictions about these psychometric properties • However, most assumptions are violated, at least to some degree • Empirical Test of Assumptions • To test the validity of key model-based assumptions, an empirical verification of CAT-ASVAB’s precision and construct equivalence with the P&P-ASVAB was conducted • If assumptions held true, then large amounts of predictive validity evidence accumulated on the P&P version would apply directly to CAT-ASVAB • Construct equivalence would also support the exchangeability of CAT-ASVAB and P&P-ASVAB versions

Reliability and Construct Validity • Study Design • Two Random Equivalent Groups • Group 1: (N = 1033) received two P&P-ASVAB forms • Group 2: (N = 1057) received two CAT-ASVAB forms • All participants received an operational P&P-ASVAB • Analyses • Alternative forms correlations used to estimate reliabilities • Construct equivalence was evaluated from disattenuated correlations between CAT-ASVAB and operational P&P-ASVAB versions

Reliability and Construct Validity • Results – Reliability • Seven of the ten CAT-ASVAB tests displayed significantly higher reliability coefficients than their P&P-ASVAB counterparts • Three other subtests displayed non-significant differences • Results – Construct Validity • All but one disattenuated correlation between CAT-ASVAB and P&P-ASVAB was equal to 1.0 • Coding Speed displayed a disattenuated correlation substantially less than one (.86) • However composites that contained this subtest had high disattenuated correlations approaching 1.0 • Discussion • Results confirmed the expectations based on theoretical IRT predictions • CAT-ASVAB measured the same constructs as P&P-ASVAB with equivalent or greater precision

Equating CAT and P&P Versions • 1980’s – Equating viewed as a major psychometric hurdle to CAT-ASVAB implementation • Scale Differences between CAT-ASVAB and P&P-ASVAB • P&P-ASVAB used a number-correct score-scale • CAT-ASVAB produces scores on the natural (IRT) ability metric • Equating must be done to place CAT-ASVAB scores on the P&P-ASVAB scale • Equating Objective • Transform CAT-ASVAB scores so its score distribution would match the P&P-ASVAB score distributions • Transformation would allow scores on the two versions to be used interchangeably, without effecting applicant qualification rates

Equating Concerns • Overall Qualification Rates • Equipercentile equating procedure used to obtain the required transformations • Distribution smoothing procedures • Equivalence of composite distributions verified • Distributions of composites were sufficiently similar across P&P-ASVAB and CAT-ASVAB

Equating Concerns • Subgroup Differences • Concern that subgroup members not be placed at a disadvantage by CAT-ASVAB relative to P&P-ASVAB • Existing subgroup differences might be magnified by precision and dimensionality differences between CAT and P&P versions • Approach • Apply the equating transformation (based on the entire group) to subgroup members taking CAT-ASVAB • Compare subgroup means across CAT and P&P versions • Results • No practical significance for qualification rates was found

Online Calibration and Equating • All Data can be collected seamlessly in an operational environment to both calibrate and equate new CAT-ASVAB forms • This is in contrast to earlier form development which required special data collections using special populations

Hardware Effects Study • Hardware Effects Concern (1990’s) • Differences among computer hardware could influence item functioning • Speeded tests especially sensitive to small changes in test presentation format • Dependent Measures • Score-scale • Precision • Construct validity • Sample • Data were gathered from 3,062 subjects • Each subject was randomly assigned to one of 13 conditions • Hardware Dimensions • Input device • Color scheme • Monitor type • CPU speed • Portability • Results • Adaptive power tests were robust to differences among computer hardware • Speed tests are likely to be effected by several hardware characteristics.

Stakes by Medium Interaction • Equating Study • Equating of Desktop and Notebook computers • Two Phases • Recruits – Develop a provisional transformation; random groups design with about 2,500 respondents per form • Applicants – Develop a final transformation from applicants to provide operational scores. Sample size for this second phase was about 10,000 per form

Stakes by Medium Interaction • Comparison of Equatings Based on Recruits (non-operational motivation) and Applicants (operational motivation) • Differences in the CAT-P&P equating transformations were observed • The difference was in a direction that suggested that in the first (nonoperational) equating, CAT examinees were more motivated than P&P examinees (possibly due to shorter test lengths or novel/interactive medium) • It was hypothesized that there were different levels of motivation/fatigue between CAT and P&P groups in the nonoperational recruit sample than in the operational applicant sample • Findings suggested that the results of a cross-medium equating may differ depending upon whether the respondents are motivated or unmotivated • For future equatings, this problem was avoided by: • Performing equatings in operationally motivated samples, or by • Performing within medium equatings when test-takers were nonoperationally motivated (and by using a chained-based transformation if necessary to link back to the desired cross-medium scale)

Test Compromise Concerns • Sympson-Hetter algorithm assumes a particular known ability distribution • Usage rates might be higher for some items if the actual ability distribution departs from the assumed distribution • Since CAT tests tend to be shorter than P&P tests, each adaptively administered item might have a greater impact on final score • So preview of a fixed number of CAT items may result in a larger score gain than preview of the same number of P&P items

Test Compromise Simulations • Simulation Study – Conditions • Transmittal mechanism (sharing among friends or item banking) • Correlation between the cheater and informant ability levels • Method used by the informant to select items for disclosure • Dependent Measure • Score gain (mean gain for the group of cheaters over a group of non-cheaters for the same fixed ability level) • Results • Score gains for CAT were larger than those for the corresponding P&P conditions • Implications • More stringent item exposure controls should be imposed on CAT-ASVAB item exposure • The introduction of a third item pool (where examinees are randomly to assigned to one of three pools) provided score gains for CAT that were equivalent to or less than those observed for six forms of the P&P-ASVAB under all compromise strategies • These results led to the decision to implement an additional CAT-ASVAB form

Score Scale • For testing programs that run two parallel modes of administration (i.e., paper-based and CAT), equating and measurement precision can be enhanced by scoring all tests (including the paper-based test) by IRT methods • IRT scoring of the paper-based tests provides distributions of test scores that more closely match their CAT counterparts (i.e., helps make them more Normal) • IRT scoring also reduces ceiling and floor effects of paper-based number-right distributions which can attenuate the precision of equated CAT-ASVAB scores • An underlying theta (or natural ability scale) can facilitate equating and new form development

New Form Development • The implementation of CAT-ASVAB on a large-scale has enabled considerable streamlining of new form development • DoD has eliminated all special form-development data-collection studies by replacing them with online calibration and equating • According to this approach, new item data is collected by seeding tryout items among operational items • These data are used to estimate IRT item parameters • These parameters are in turn used to construct future forms, and to estimate provisional equating transformations • These provisional (theoretical) equatings are then updated after they are used operationally to test random equivalent groups • Thus, the entire cycle of form development is seamlessly integrated into operational test administrations

Internet Testing • DoD Internet Testing • Defense Language Proficiency Tests • Defense Language Aptitude Battery • Armed Forces Qualification Test • CAT-ASVAB • Implications for Cost-Benefits • Implications for Software Development • Desktop Lockdown • Client Side: Unanticipated effects on test delivery from browser, operating system, and security updates • Server Side: • Unanticipated effects of operating system updates on test delivery • Interactions with other applications running on the same server

Internet Testing • Internet testing can defray much of the cost of computer based testing since the cost of computers and their maintenance is shared or eliminated. • Strict Standardization of administration format (line breaks, resolution, etc.) in internet testing is difficult (and sometimes impossible) to enforce. • Lesson: With Internet testing you do not have to pay for the computers’ purchase and maintenance, but you do pay a price for the lack of control over the system.

Software/Hardware Maintenance Issues • Generations of CAT-ASVAB Hardware/ Software • Apple III • HP • DOS • Windows I • Windows II • Internet • Early generations of hardware/software could be treated as static entities, much like test-booklets • Later Windows and Internet generations require treatment more like living entities – They require continuous care and attention (security, operating system, and software updates)

Multi-Mode Testing Programs • When transitioning from paper-based testing to computer-based testing, decide ahead of time if the two mediums of administration will run in parallel for an extended period, or if paper-based testing will be phased out after a fixed period of time • If the later, make sure this is communicated and that there is strong policy support for the elimination of all paper-based testing • There are different resourcing requirements and cost drivers for dual and single mode testing programs

Future Lessons ? • Intranet verses Internet -based Testing • Computer hardware effects on test scores • How to test speeded abilities on unstandardized hardware? • Can emerging technologies (such as mobile computing devices) provide additional or different benefits (above and beyond computers) for large-scale assessments?

Lessons Learned from Computer-Based Assessment in the Military: A Retrospective Analysis