Item Response Theory in the Secondary Classroom: What Rasch Modeling Can Reveal About Teachers, Students, and Tests.

Item Response Theory in the Secondary Classroom: What Rasch Modeling Can Reveal About Teachers, Students, and Tests. T. Jared Robinson tjaredrobinson.com David O. McKay School of Education Brigham Young University NRMERA, 2012, Park City, UT

Purpose • My purpose is to show how Rasch modeling can be applied in certain secondary education situations, and how teachers, students, and tests might benefit. • This case study examines a high school biology exam using the Rasch model in order to demonstrate some possible implications of item response theory (IRT) in a secondary setting. • Provide a very brief and basic introduction to IRT/Rasch modeling

Context • Brookhart (2003)—Measurement theory developed for large-scale assessments not appropriate for classroom assessment. • McMillan (2003)—Measurement specialists need to adapt to be more relevant to classroom assessment. • Smith (2003)—traditional notions of reliability not appropriate for classrooms. • Plake (1993), Stiggins(1991, 1995) —Teachers are empirically under-trained in assessment and testing. • Newfields (2006)—Still important for teachers to develop assessment literacy. • Rudner & Schafer (2002)—Teachers need to understand reliability and validity now more than ever.

Some of my assumptions • While many types of classroom assessment defy application of measurement theory, teachers still use summative assessment in classroom settings. • To the extent that thinking about such assessments in terms of measurement theory provides utility for teachers and students, it should be explored.

How big of an N is big enough? • Source: John Michael Linacre, http://www.rasch.org/rmt/rmt74m.htm

Case Study Design • This study used data from a biology test given to sophomores at a suburban high school in the mountain west. • The test consisted of 35 multiple choice and true/false questions. • The study analyzed data for 115 students from four different sections of a biology class all taught by the same teacher. • A Raschanalysis of the data using the WINSTEPS software. Results were used to inform strengths and weaknesses of the test, as well as the general knowledge of students.

Classical Test Theory Reliability

Basics of IRT/Rasch Modeling • IRT/Raschmodellinghas several advantages over Classical Test Theory. One is that we get much more information about how each individual item interacts with students as a function of their ability. • Instead of reporting student ability scores on a percent scale of 0-100, they report scores on a logit scale that has a center point of 0 with most scores ranging from -3 to +3 (although for your test, you have students above 5). Students with positive logit scores are more able than average, and students with negative logit scores are less able than average.

Scalogram

What Rasch modeling can teach teachers about their tests • One useful thing about IRT is that the item difficulty estimates are also computed on the logit scale. Thus, we can easily compare the items difficulty with student ability, like in the chart on the next slide.

What does this mean? • WINSTEPS uses the mode or middle questions in terms of question difficulty to center 0 on the logit scale. This table visually demonstrates that most of the questions are much easier than these students are able. • Students like this because it means that they get a good grade on the test. But this is not a good situation from a measurement perspective. A test with the pattern like the one above cannot really distinguish with any reliability the differences between the ability levels of most of the students.

Test Information Function

What does this mean? • This graph illustrates that this test will give you a lot of information about students with an ability score between about -2 and +2 with the amount of information you get about students dropping off sharply after that. • In areas of the graph where information is high, there is a low error in measuring student scores. In areas where information is low, there is a lot of error in estimating student scores.

What does this tell us about this test? • This lack of matching between student ability and item difficulty leads to low score reliability. In this case, the reliability for the estimates of student ability is just .34. You want it to be much closer to .90 or even higher. • For example, 18 students out of the 115 got a score of 32/35, or 91%. In reality, these estimates are pretty rough, because we don’t have any questions that are at that difficulty level. Those students are probably not identical in ability or knowledge, but the test is designed in a way that makes so that we can’t really know their ability with any kind of precision.

Limitations • Evidence of multi-dimensionality, violating some key assumptions of Cronbach’s alpha and Rasch Measurement • Only looking at one limited case • Difficulty level gap might be non-representative • Rasch modeling might be less appropriate in other schools with different testing procedures • Only useful to the extent that is plausible for teachers to get access to and understand

Conclusions • This case is one example of where Raschmodeling has utility in understanding a test, and the students who took the test. • Rasch software presents visual interpretation tools that may be easier to interpret for teachers than traditional reliability concepts. • In instances where teachers teach multiple sections of one subject, or where assessments are common across teachers, Rasch modeling can be used to produce stable estimates in secondary settings.

Item Response Theory in the Secondary Classroom: What Rasch Modeling Can Reveal About Teachers, Students, and Tests.