720 likes | 841 Vues
Measuring Success in English for Young People. Annabelle G. Simpson Director, Channel Management, ETS Global Division. Outline. Who is ETS? Two Families of Products: TOEFL® & TOEIC® How does ETS develop quality tests? What is TOEIC® Bridge? What is TOEFL® Junior?. ETS: Our Mission
E N D
Measuring Success in English for Young People Annabelle G. SimpsonDirector,Channel Management, ETS Global Division
Outline • Who is ETS? • Two Families of Products: TOEFL® & TOEIC® • How does ETS develop quality tests? • What is TOEIC® Bridge? • What is TOEFL® Junior?
ETS: Our Mission To Advance Quality and Equity in Education for All People Worldwide • We do this by providing: • Fair, valid and reliable assessments • Education research • Products and services that measure knowledgeand skills, promote learning and educational performance and support education and professional development
Two Families of English Assessments: TOEFL® & TOEIC® • TOEFL iBT TOEIC L&R • TOEFL ITP TOEIC S&W • TOEFL Junior TOEIC Bridge • Coming soon….
The Origins of ETS Work with Young People • English proficiency is an increasingly important skill for • students and young adults worldwide • - Expanding access to educational, personal and professional opportunities • EFL instruction is beginning at earlier ages • English-medium instructional environments take many forms • internationally: • - Public and private schools in English-dominant countries • - International schools in non English-dominant countries • - Schools in any country using bilingual or CLIL approaches • - Vocational schools • Responds to aspirations of students as they attain • English-language proficiency
Overview • Before discussing how ETS develops quality tests, I will discuss what we mean by “quality” in testing. • Then I will discuss the major steps in test development that are required to create a high quality test.
What Is a Quality Test? • A quality test must be • Reliable • Valid • Fair • Practical
Reliable • A test is only a sample. • The items are a sample of all the items that could be asked. • The time of testing is a sample of all the times that the test could be given. • The person scoring the essay is a sample of all possible scorers.
Reliability Is Consistency • If test taker’s knowledge is constant, how consistent would scores be if samples changed and parallel items were used? • The test was taken on a different day? • Different judges were used for scoring essays? • The higher the reliability, the more consistent the scores will be.
Factors That Determine Reliability • All other things being equal, • the more independently scored items, the higher the reliability • the more the items correlate with each other, the higher the reliability • the greater the variability of scores, the higher the reliability
Validity • Most important indicator of test quality • Extent to which inferences based on test scores are appropriate & supported by evidence • Requires evidence to support the use of the test for the intended purpose
Evidence of Validity • Qualifications of test designers • Process used to develop test • Qualifications of item writers and reviewers • Statistical indicators of item quality and fairness • Expert judgments of test content
Evidence of Validity • Match of items to content standards • Relations among parts of the test • Relations of scores with other variables • Results fit with theories • Claims for use of test are met • Good consequences
Fairness = Validity for All • Fairness is an aspect of validity. • Tests that show valid differences across groups are fair. • Tests that cause invalid differences across groups are not fair.
Practicality • Tests must be affordable in dollar costs and in time used. • Scores must be understandable & helpful to score-users. • Items must be acceptable to diverse constituencies. • Every test is a compromise among competing demands.
Major Steps in Test Development • 1) Make Initial Plan for Test • 2) Involve External Experts • 3) Write/Review Items • 4) Pretest Items (Whenever Possible) • Review Data & Revise Items • Assemble Final Test
Major Steps (continued) • 7) Administer Tests • 8) Checks Before Scoring • 9) Scaling & Equating • 10) Test Analyses • 11) Report Scores • 12) Begin Planning for Next Form
1) Plan Test • Purpose • What is test used for? • What decisions made on the basis of the scores? • Population • What are characteristics of test takers? • Construct • Content & skills
Plan Test • What constraints on test design? • Time, cost, format, scoring, etc. • Initial plan for test development work • Major tasks, schedule, staff • Evidence-Centered Design • What claims about test takers? • What evidence supports claims? • What tasks provide evidence?
2) Involve External Experts • Diverse (demographic, geographic, institutional, point of view) external contributors required in test design, item writing and reviewing. • Diverse experts help establish acceptability, validity and fairness.
Tasks of External Experts • Set/approve test specifications • What content to measure? • What skills to measure? • What statistical properties? • Write and review test items • Select items for final form
3) Write/Review items • Make item-writing assignments • Write items to meet specifications • Write overage for attrition • Internal & external reviews & revisions • At least 2 independent content reviews per item • Separate editorial review • Separate fairness review
3) Write/Review items • Question (Item ) Author • Artwork/graphics • Content Reviewer 1 • Content Reviewer 2 • Content Reviewer 3 • Edit • Fairness • Resolver • Studio recording • Lock
4) Pretest • When possible, try out items before operational use. Gives information to : • Identify problem items (ambiguous, wrong difficulty, poor discrimination. For MC: no key, multiple keys, bad distracter) • Pick most appropriate items to meet specifications • Estimate final form characteristics from item data
Use Differential Item Functioning (DIF) • DIF = statistical measure of how matched people in different groups perform on an item. • DIF helps spot items that may be unfair. • DIF is NOT proof of bias.
Uses of DIF • If data available, tests assembled with low DIF items. • If no data at assembly, DIF calculated after administration. • High DIF items reviewed and removed before test is scored, if judged unfair. • External people involved in reviews.
5) Review Data & Revise Items • Review test items based on data • Ensure accuracy, clarity • Appropriate difficulty • Acceptable discrimination • Revise or drop problem items • Write new items if necessary to meet specifications
6) Assemble Final Test • Choose set of items from pool according to specifications • Perform test reviews • Meet content, skill, & statistical specifications • Check for overlap, cueing of keys • Correctness of keys
7) Test Administration • Print or format for computer • Quality control checks • Ship securely • Administer test • Acceptable conditions (space, comfort, light, temperature) • Security (copying, impersonation, prior knowledge)
8) Checks Before Scoring • Investigate complaints & reports • Preliminary Item Analysis (PIA) • Identify “problem” items based on statistics (too hard, too easy, poor discrimination, change from pretest) • Review items to decide if keep in test or drop before scoring • DIF, if not done previously
Checks Before Scoring • Check for anomalies (sudden drops or increases in scores) that may indicate problems
9) Scaling & Equating • Raw scores are number right or percent right on a particular test form. • 50% right on a hard test form may take more knowledge & skill than 60% right on an easy test form. • Raw scores mean different things on different test forms. • ETS very rarely reports raw scores
Scaling & Equating • Scaling is arbitrary range of numbers used to report scores. e.g., 200-800 for SAT, 150-190 for PPST. • Equating is a statistical adjustment for differences in the difficulty of different forms of the same test. • Equating allows us to treat the scores on different forms of a test as though they meant the same thing.
Scaling & Equating • If a form happens to be a little harder than the others, it will take fewer raw score points to reach a particular scale score point. • If a form happens to be a little easier than the others, it will take more raw score points to reach a particular scale score point. • Scaled scores, after equating, mean the same on each form
10) Test Analyses • Analysis of final form characteristics. • Distribution of item difficulty & discrimination • Reliability • Speededness • Did test meet content & statistical specifications? If not, where were problems?
11) Report Scores • Explain what scores mean so scores are understandable to test users • Indicate Standard Error of Measurement on score report
12) Plan Next Form • What was learned from this administration to make the next administration of the test better? • What has to change for next form?
A TOEFL® product for a Younger Generation • A distinct product within the growing TOEFL® family of products • A natural extension of the TOEFL brand, but specifically geared to the language learning needs of middle grade students • - Informed by reviews of research and relevant standards • - Based on years of experience developing international assessments of English language proficiency for both adults and K12 students • Meets ETS Standards for Quality and Fairness • Builds upon ETS’s expertise in English language assessment for young learners. • TOEFL® products set the standard for English proficiency worldwide
The Paper-Based Test is designed to provide useful Information • Purpose is to assess the degree to which students aged 11-15 have attained language proficiency representative of middle school English-medium instruction
TOEFL Junior Structure • Format: • Paper • Three Sections: • Listening • Reading • Language Form and Meaning
TOEFL Junior Structure • Listening Comprehension: • This section tests how well students understand spoken English. • Number of Questions: 42 • Section administered by CD. Students are asked to answer questions based on a variety of statements, questions, conversations and talks recorded in English. • Total time: approximately 35–40 minutes. • Question Types • Classroom Instruction • Short Conversations • Academic Listening
Sample Listening Item • (Narrator): Listen to a high school principal talking to the school’s students. • (Man): I have a very special announcement to make. This year, not just one, but three of our students will be receiving national awards for their academic achievements. Krista Conner, Martin Chan, and Shriya Patel have all been chosen for their hard work and consistently high marks. It is very unusual for one school to have so many students receive this award in a single year. • (Narrator): What is the subject of the announcement? • What is the subject of the announcement? • (A) The school will be adding new classes. • (B) Three new teachers will be working at the school. • (C) Some students have received an award. • (D) The school is getting its own newspaper.
TOEFL Junior PBT Structure • Reading Comprehension: • - This section tests how well students read and comprehend written English. Students read a variety of materials. • - Number of Questions: 42 questions. • Total time: 50 minutes. • Question Types • - Non-academic • - Academic