Investigating the Reliability and Validity of High-stakes ESL Tests

Dr. Paula Winke Michigan State University winke@msu.edu Investigating the Reliability and Validity of High-stakes ESL Tests March 22, 2013, TESOL Dallas

Background I can fly planes! • Tests must be valid. • Validity = evidence that a test actually measures what it is intended to measure.

Background • Validity can be measured: • Quantitatively • Psychometric measures of reliability • Qualitatively • Social, ethical, practical, and consequential considerations (e.g. Messik, 1980, 1989, 1994; Moss, 1998; McNamara & Roever, 2006; Ryan, 2002; Shohamy, 2001, 2006) Tests should be reliable and consistent. Narrow Tests should be fair, meaningful, cost efficient, developmentally appropriate, and do no harm. Broad

K-12, NCLB ELPA in Michigan

Who determines reliability & validity? • Test developers • Researchers • Test Stakeholders Validity is a multifaceted concept that includes value judgments and social consequences. (Messick, 1980; 1989; 1994; 1995; McNamara & Roever, 2006) “Justifying the validity of test use is the responsibility of all test users.” (Chapelle, 1999, p. 258)

Study 1: The ELPA in Michigan • The purpose of this study was to evaluate the perceived effectiveness of the English Language Proficiency Assessment (ELPA), which is used in the state of Michigan to fulfill the No Child Left Behind (NCLB) requirements. • In particular, I wanted to look at the test validity of the ELPA.

Goal of this Study • Investigate the views of teachers and school administrators (educators) on the administration of the ELPA in Michigan

ELPA in Michigan • Administered to 70,000 English Language Learners (ELLs) in K-12 annually since spring of 2006 • Fulfillment of No Child Left Behind (NCLB) and Federal Title I and Title III requirements

The ELPA • Based on standards adopted by the State • Subtests • Listening • Reading • Writing • Speaking • Comprehension • Scoring • Basic • Intermediate • Proficient

Levels of the 2007 ELPA • Level I for kindergarteners • Level II for grades 1 and 2 • Level III for grades 3 through 5 • Level IV for grades 6 through 8 • Level V for grades 9 through 12

ELPA Technical Manual • 2006 MI-ELPA Technical Manual claimed ELPA is valid. A. Item writers were trained B. Items and test blueprints were reviewed by content experts C. Item discrimination indices were calculated D. Item response theory was used to measure item fit and correlations among items and test sections • Validity argument did not consider the test’s consequences, fairness, meaningfulness, or cost and efficiency, all of which are part of a test’s validation criteria (Linn et al., 1991)

Research Questions 1. What are educators’ opinions about the ELPA and its administration? 2. Do educators’ opinions vary according to the demographic or teaching environment in which the ELPA was administered?

267 Participants

Materials • 3-part online survey Part 1: 6 items on demographic information Part 2: 40 belief statements + comments Part 3: 5 open-ended questions

Procedure ELPA Testing Window 3/19-4/27 Online Survey Window 3/29-5/20 MITESOL Listserv email sent 3/29 Names and emails culled off Web 3/30-4/7 Reminder email to MITESOL and culled list sent 5/14

Analysis • Quantitative: factor analysis of the data generated from the 40 belief items; derived factor scores and used those to see if the ELL demographic or teaching environment affected survey outcomes • Qualitative: deductive-analytic, qualitative analysis of answers to the open-ended questions and comments on the 40 closed items.

How Factor Analysis Works Item 1 Factor 1 • After items are clustered together with other items that were answered in the same way, we can look at the items in the cluster and label the cluster—what is the theme of the cluster? Speaking ability • Items that don’t correlate with any larger factor (that don’t fit into a cluster) can be dropped from further analysis or discussion. • Factor analysis runs correlations among the answered items to see what items are related—that is, what questions tap into the same underlying construct. Item 2 Item 3 Factor 2 Item 4 Item 5 Listening ability Item 6 Item 7 Item 8 Item 9 Item 10

Results: 1. Factor Analysis

ANOVA results Factor 2 (affective administration) by ELL concentration

ANOVA results Factor 4 (speaking test) by ELL concentration

Qualitative Results • A. Perceptions on the ELPA subtests • Factor 1. Reading and writing • Factor 4. Speaking • Factor 5. Listening • B. Logistics • Factor 2. Effective Administration • C. Test impacts • Factor 3. Impacts

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • 65 wrote that the test was too difficult for lower grade levels. • Out of the 145 responders who administered ELPA Level I, the test for kindergarteners, 30 commented that it was too hard or inappropriate. • Most of these comments centered on the reading and writing portions of the test.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Example 1: Having given the ELPA grades K-4, I feel it is somewhat appropriate at levels 1-4. However I doubt many of our American, English speaking K students could do well on the K test. This test covered many literacy skills that are not part of our K curriculum. It was upsetting for many of my K students. The grade 1 test was very difficult for my non-readers who often just stopped working on it. I felt stopping was preferable to having them color in circles randomly. Educator 130, ESL teacher, administered Levels I , II, & III.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Speaking subsection of the ELPA was too subjective. • 31 educators wrote the rubrics did not contain example responses or did not provide enough levels to allow for an accurate differentiation among student abilities. • 4 educators wrote ELLs who were shy or did not know the test administrator did poorly on the speaking test. • 15 mentioned problems with two-question prompts that were designed to elicit two-part responses; some learners only answered the second question, resulting in low scores.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Example 2: [I]t [the rubric] focuses on features of language that are not important and does not focus on language features that are important; some items test way too many features at a time; close examination of the rubric language makes it impossible to make a decision to give 2 or 3 points because it is too subjective and not quantifiable enough because too many features are being assessed at a time Educator 79, ESL teacher, administered Levels I, II, & III.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Example 3: For some this was adequate but for those students who are shy it didn't give an accurate measure at all of their true ability. I have a student who is an excellent English speaker but painfully shy and she scored primarily zeros because she was nervous and too shy to speak. Educator 30, ESL teacher, administered Levels III & IV.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Example 4: When you say in succession: "What would you say to give your partner directions? What pictures would be fun to make? The students are only going to answer the last question and 100% of mine did! So it is already half wrong. Too many questions had a two part answer but only one bubble to fill in how they did. Not very adequate to me! Educator 133, ESL and bilingual teacher, administered Levels I & II.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • 30 of the 151 educators who administered lower levels of the ELPA commented that the listening section was unable to hold the students’ attention because it was too long and repetitive or had bland topics. • 9 complained of the reliance on memory and reading skills to answer questions.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Example 5: Some of the stories were very long for kindergarten and first graders to sit through and then they were repeated! The kids didn't listen well the second time it was read and had trouble with the questions because of the length of the story. Educator 176, Reading recovery and literacy coach, administered Levels I, II, & III.

A. Perceptions on the ELPA subtests (Factors 1, 4 and 5) • Example 6: The kindergarten level is where I saw the most difference between actual ability and test results. Here students could give me the correct answer orally but mark the wrong answer in the test booklet. Educator 80, ESL teacher, administered Levels I, II, & III.

B. Effective Administration(Factor 2) • There was not adequate space or time for testing. • Test materials did not arrive on time or were missing. • Other national and state tests were being conducted at the same time, which overburdened the test administrators and the ELLs.

B. Effective Administration(Factor 2) • Example 7: There are only six teachers to test over 600 students in 30 schools. We had to [administer the ELPA] ourselves because it was during IOWA testing and that was where the focus was. Because we are itinerant in our district we were given whatever hallway or closet was available. Educator 249, ESL teacher, administered all Levels.

B. Effective Administration(Factor 2) • Example 8: The CD's and cassettes were late. Several of the boxes of materials were also late and I spent a lot of time trying to track them. NONE of the boxes came to the ESL office as requested. It was difficult with four grades in a building to find the right time to pull students and not mess with classwork and schedules. Educator 151, school administrator, administered all Levels.

C. Impacts (Factor 3) • 49 educators wrote that the test did not directly impact the ESL curricular content. • 86 educators wrote that the test reduced the quantity and quality of ESL services during all or part of the ELPA test window. • 79 educators commented on the negative psychological impact on students. • 7 out of the 79 wondered about how much money and resources the ELPA cost Michigan.

C. Impacts (Factor 3) • Example 9: I feel it makes those students stand out from the rest of the school because they have to be pulled from their classes. They thoroughly despise the test . One student told me and I quote, "I wish I wasn't Caldean* so I didn't have to take this" and another stated she would have her parents write her a note stating that she was not Hispanic (which she is) Educator 71, Title 1 teacher, administered Levels I , II, III, & IV. *Chaldeans speak Aramaic, are from the part of Iraq originally called Mesopotamia. As Christians, they are a minority among Iraqis. Many have immigrated to Michigan.

C. Impacts (Factor 3) • Example 10: This test took too long to administer and beginning students, who did not even understand the directions, were very frustrated and in some cases crying because they felt so incapable. Educator 109, ELL and resource room teacher, administered Levels II & III.

C. Impacts (Factor 3) • Example 11: Just that the ELPA is a really bad idea and bad test... we spend the entire year building students confidence in using the English language, and the ELPA successfully spends one week destroying it again Educator 171, ESL teacher, administered Level IV.

Discussion • Research question 1 What are educators’ opinions about the ELPA and its administration? • Answers • Q1. 1. Difficult for lower grade students • Q1. 2. Speaking tests are problematic for youngsters • Q1. 3. Logistic problems with the administration

Q1. 1. Difficult for lower grade students The educators may be right. What do we know about young language learners? • The attention span of young learners in the early years of schooling is short, 10 to 15 minutes. • They are easily diverted and distracted. • They may drop out of a task when they find it difficult, though they are often willing to try a task in order to please the teacher (McKay, 2006, p. 6).

Q1. 1. Difficult for lower grade students • Children under 8 are unable to use language to talk about language. (They have no metalanguage.) (McKay, 2006) • Children do not develop the ability to read silently to themselves until between the ages of 7 and 9 (Pukett & Black, 2000). Before that, children are just starting to understand how writing and reading work.

Q1. 1. Difficult for lower grade students • Children between 5 and 7 are only beginning to develop feelings of independence. They may become anxious when separated from familiar people and places. Having unfamiliar adults administer tests in unfamiliar settings might be introducing a testing environment that is not “psychologically safe” (McKay, 2006) for children. • Children between 5 and 7 are still developing their gross and fine motor skills (McKay, 2006). They may not be able to fill in bubble-answer-sheets.

Q1. 1. Difficult for lower grade students Recommendations • Test administrators should read directions aloud and be allowed to clarify them if necessary. • Students should be allowed to give verbal answers in the listening and reading subsections. • Better would be to have a test that could be stopped if the educator felt like it should stop.

Computer Adaptive Testing for kids? • May be extremely problematic… • http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320/

http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320 http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320/ http://www.upi.com/blog/2013/03/04/Seattle-teacher-revolt-other-districts-join-fight-against-high-stakes-tests/2741362401320/

Q1. 2. Speaking tests are unreliable Recommendations • Stop the standardized, high-stakes, oral proficiency testing of young children. • If it must be done, use a puppet to reduce test anxiety and stranger-danger. • Allow non-verbal responses to count as responses.

Q1. 3. Effective Administration What do test logistics have to do with validity? • A test cannot be reliable if the physical context of the exam is not in order (Brown 2004). Materials and equipment should be ready, audio should be clear, and the classroom should be quiet, well lit, and a comfortable temperature. • If a test is not reliable, it is not valid (Bachman, 1990; Brown, 2004; Chapelle, 1999; McNamara, 2000).

Investigating the Reliability and Validity of High-stakes ESL Tests

Investigating the Reliability and Validity of High-stakes ESL Tests

Presentation Transcript

Reliability and Validity

Reliability and Validity

Reliability and Validity

Reliability and Validity

VALIDITY AND RELIABILITY

Reliability and Validity

Validity and Reliability

Reliability and Validity

Validity and reliability

Validity and Reliability

Reliability and Validity

High Stakes Math Tests

Validity and Reliability

Validity and Reliability

Reliability and Validity

Validity and Reliability

Reliability and Validity

Validity and Reliability of Analytical Tests

Reliability and Validity

Validity and Reliability

Investigating the Reliability and Validity of High-stakes ESL Tests