Translating/Adapting Achievement Tests: PISA Guidelines, ITC Guidelines or a Mixture?

Translating/Adapting Achievement Tests: PISA Guidelines, ITC Guidelines or a Mixture? Ronald K. Hambleton Seong eun Hong Center for Educational Assessment Univ. of Massachusetts Amherst, USA PISA Conference, Paris, 2018

Background • Interest in test translations and adaptations has increased tremendously in the past 30 years: --Several IQ and personality tests have been adapted into more than 100 languages now. --Achievement tests for large scale international assessments (e.g., PISA, TIMSS) in over 40 languages.

Background --International uses of credentialing exams is expanding (e.g., see Microsoft, HP). --Many high school graduation/college admissions tests are in multiple languages (e.g., see Israel, South Africa, USA). --Health scientists with their “Quality of Life” measures are receiving wide use in many languages and cultures. --Marketing researchers are doing more.

Purposes of the Presentation • Sharing a comparison we did between the PISA and ITC guidelines. • Consideration of a test translation review form building on validated questions. • A review of small sample studies to detect problematic items.

Example 1 (IEA Study in Reading) Are these words similar in meaning? Pessimistic -- Sanguine

Pessimistic -- Sanguine Adapted to Pessimistic -- Optimistic

Example 2 (1995 TIMMS Pilot) Alex reads his book for 1 hour and then used a book mark to keep his place. How much longer will it take him to finish the book? A. ½ hour B. 2 hours C. 5 hours D. 10 hours

Example 3 Out of sight, out of mind (Back translated from French) invisible, insane

ITC Guidelines for Test Adaptation • The first edition of the Guidelines has been cited in more than 500 journal articles, reports, and papers, already. • A google search by April Zenisky turned up over 1000 references on test adaptation, and we are putting a reference list together—of both substantive studies and methodological studies.

Five Common Myths About Adapting Tests Across Languages and Cultures 1. Know two languages and you can be a translator. -Not true! Recall Ype Poortinga’s comment: about 80% of cross-cultural research prior to 1995 was seriously flawed because of poor test adaptation procedures.

Selection of Translators • Knowledgeable in the languages • Knowledgeable in the cultures • Some knowledge in the subject matter • Some knowledge about the principles of test development, item writing, and scoring rubrics, etc.

Five Common Myths 2. A good literal translation guarantees validity. -It does not! Format unsuitability, restrictive time limits, unclear directions, inappropriate content, and more, can still be problematic, even with a good literal translation.

Five Common Myths 3. Judgmental reviews are sufficient to identify problems in a test translation and adaptation; translators are excellent. -Reality: Translators can overlook many important features. Many items still show up in empirical studies of DIF. -Analogy to item writing—field-testing turns up many unanticipated problems with test items.

Five Common Myths 4. The common strategy of (1) a back-translations design, and (2) the use of a bilingual design to compile empirical data, is sufficient to validate a test for use in a second language. (One major problem: target language versions of tests are not even reviewed.)

They are not! -Back translation does not involve looking at target language version of the test. -Bilingual candidates are not representative of unilingual candidates in either language, generalizability of any findings obtained from the bilinguals can be problematic.

Five Common Myths 5. Constructs are universal, so all tests can be translated into other languages and cultures. -They are not! Well-known exceptions include intelligence tests and quality of life measures. With achievement tests and credentialing exams—content domain may not be completely relevant in a different country.

Summary of the Myths • All too often, test translation/adaptation process is not understood or implemented correctly (e.g., hiring one translator, use of back translations only). --Limited empirical work is common (e.g., item analysis, reliability assessment) --Literal translations are common.

Reasons for the Second Edition 1.Understanding of the test adaptations field has advanced considerably since the ITC began its work on the first edition of the Guidelines in 1994. (1000s of published articles on methodology and substantive findings.) 2.Guidelines may be clear, but not always clear how they can be applied.

Reasons for the Second Edition 3.Valuable methodological advances have been made: --Better judgmental designs (e.g., PISA), --more useful statistical methods (e.g., van de Vijver and Tanzer, 1997; Hambleton, Merenda, Spielberger, 2005) but that work, and more, needs to be used to update the ITC Guidelines for Test Adaptation.

Reasons for the Second Edition 4.Many constructive criticisms of the Guidelines have appeared, and suggestions for improving them have been made (e.g., Yu, Slater, Jeanrie, Bertrand, Tanzer, Sim, and others).

Committee Members • David Bartram, England • Giray Berberoglu, Turkey • Jacques Gregoire, Belgium • Ronald Hambleton, USA • Jose Muniz, Spain • Fons van de Vijver, the Netherlands

Test Translation vs. Test Adaptation? “Test Adaptation” is more descriptive of the process that usually takes place—revising/adapting directions, formats, contexts, timing, etc. “Test Translation” is usually only part of the process that must take place.

Test Adaptation Guideline A practice which is judged as important for conducting and evaluating the adaptation or parallel development of psychological and educational tests for use in different populations.

Six Organizational Categories for the 18 Guidelines • Pre-Condition Guidelines (decisions made before test development begins.) • Test Development Guidelines (everything from test planning to final production, scoring guides, etc.) • Confirmation Guidelines (compilation of empirical evidence such as reliability, item analysis, and validity)

Six Organizational Categories for the 18 ITC Guidelines • Administration Guidelines (associated with the administration of the test, and scoring guidelines). • Score Scales and Interpretations Guidelines (explanations of score scales, and uses of the scores). • Documentation Guidelines (addressing work that was completed).

Pre-Condition PC-1 (1) Obtain the necessary permission from the holder of the intellectual property rights relating to the test before carrying out any adaptation. PC-2 (2) Evaluate that the amount of overlap in the definition and content of the construct measured by the test and the item content in the populations of interest is suffi cient for the intended use (or uses) of the scores.

PC-3 (3) Minimize the influences of any cultural and linguistic differences that are irrelevant to the intended uses of the test in the populations of interest.

Test Development Guidelines TD-1 (4) Ensure that the adaptation process considers linguistic, psychological, and cultural differences in the intended populations through the choice of experts with relevant expertise. TD-2 (5) Use appropriate translation designs and procedures to maximize the suitability of the test adaptation in the intended populations.

TD-3 (6) Provide evidence that the test instructions and item content have similar meaning for all intended populations. TD-4 (7) Provide evidence that the item formats, rating scales, scoring rubrics, test conventions, modes of administration, and other procedures are suitable for all intended populations.

j TD-5 (8) Collect pilot data on the adapted test to enable item analysis and reliability assessment to be carried out, and other small-scale validity studies, as deemed useful, so that any necessary revisions to the adapted test can be made.

Confirmation Guidelines C-1 (9) Select sample with characteristics that are relevant for the intended use of the test and of sufficient size and relevance for the empirical analyses. C-2 (10) Provide relevant statistical evidence about the construct equivalence, method equivalence, and item equivalence for all intended populations.

C-3 (11) Compile evidence supporting norms, reliability and validity of the adapted version of the test in the intended populations. C-4 (12) Use an appropriate equating design and data analysis procedures when linking score scales from different language versions of a test.

Administration Guidelines A-1 (13) Prepare administration materials and instructions to minimize any culture- and language-related problems that are caused by administration procedures and response modes that can affect the validity of the inferences drawn from the scores. A-2 (14) Specify testing conditions that should be followed closely in all populations of interest.

Score Scales and Interpretation Guidelines SSI-1 (15) Interpret any group score differences with reference to all relevant available information. SSI-2 (16) Only compare scores across populations when the level of invariance has been established on the scale on which scores are reported.

Document Guidelines Doc-1 (17) Provide technical documentation of any changes, including an account of the evidence obtained to support equivalence, when a test is adapted for use in another population. Doc-2 (18) Provide documentation for test users that will support good practice in the use of an adapted test with people in the context of the new population.

2017 PISA Guidelines for Test Translation Some version of PISA guidelines have been in use after 1990. Whereas ITC has 18 guidelines and descriptions following each with possibilities for implementation (30 pages), PISA has 129 guidelines and nearly all are very specific about what needs to be do (29 pages).

Comparison of Guidelines ITC COVERED BY PISA PC-1 0 PC-2 0 PC-3 6 TD-4 7 TD-5 61 TD-6 0 TD-7 19 TD-8 0

Comparison of Guidelines ITC COVERED BY PISA C-9 0 C-10 0 C-11 0 C-12 0 A-13 0 A-14 0

Comparison of Guidelines ITC COVERED BY PISA SS-15 0 SS-16 0 D-17 3 D-18 8

Conclusions One new idea for the ITC guidelines: Given the importance of the translations guidelines, may be best to seek out at least two types of translators—one to handle the linguistic and cultural aspects—a second type to handle language, item writing mechanics, item objective match.

Conclusions PISA has put forward a new and potentially highly worthwhile judgmental design—two starting versions, one English and one French, then countries start from two places and see if their target (home) language versions from English and French versions end up at the same place.

Conclusions PISA has put forward a new and potentially highly worthwhile judgmental design—two starting versions, one English and one French, then countries start from two places and see if their target (home) language versions from English and French versions end up (converge) to the same place.

Conclusions PISA guidelines provide very specific guidelines for improving the translations process. This level of detail appears highly worthwhile for training, review, and completing the test development well. If empirical evidence is going to be problematic to compile—extra focus on the translations process can reduce some of the problems that require empirical evidence.

Conclusions It is clear that the two sets of guidelines can be improved from a review of each set, but in the main, the huge differences observed has everything to do with the ITC emphasis on compiling empirical evidence to support the validity of the adapted version of the target version of the test.

Conclusions The case for the importance of empirical evidence is clear: Translators/content reviewers are rarely able to spot all of the item problems that show up in statistical analyses.

Conclusions Expand and improve the role of translators. (e.g. more comprehensive reviews) Expand the role of preliminary statistical evidence—small scale studies seem possible. We recognize that to do more might be impossible.

Improvements in the Item Review Form Hambleton and Zenisky published a book chapter in a volume edited by Matsumoto and van de Vijver where they designed a review form that included only questions for which there was evidence that the questions in the review form linked to features of test items that could impact on item performance: 1. Is the item format, including physical layout, the same in the two language versions?

Improvements in the Item Review Form 2. Have terms in the item in one language version been suitably adapted to the cultural environment of the second language version? Each of the 25 items was written to address flaws found in empirical analyses published in the professional journals. There is evidence that PISA is doing this already, but we expect more can be done.

Small Sample Statistical Studies Cognitive labs have great promise. Several simple statistics could be revealing too though time would be needed in the schedule: -Delta statistics -Conditional p-value comparisons

Checking for Item Equivalence: Delta Plots (look for outliers)

Translating/Adapting Achievement Tests: PISA Guidelines, ITC Guidelines or a Mixture?