Testing Teaching Methods: Profitable or Pointless?

Efforts to teach in a way that tests can detect: Pointless or profitable? Megan Welsh Neag School of Education NERA Conference 10/21/11

The Title: Pointless or Profitable? In 1989, Mehrens and Kaminski publish a paper entitled “Methods for Improving Standardized Test Scores: Fruitful, Fruitless or Fraudulent?” It addressed the “old, but increasingly relevant issue of teaching to the test” (p. 21) They conclude that at least some test preparation efforts are both fruitless and fraudulent.

This talk • Explores the assumptions underlying many current uses of test scores • Provides some preliminary evidence about the relationship between test-focused instruction and student performance • Discusses implications for next generation assessments

Then (1989) • Norm-referenced tests are widely used • Expectation that teachers do not know content of test; the test samples from a content area domain and that content area knowledge will generalize to test performance • Accountability=parent/community perceptions of schools • Test scores are considered to gauge minimum competency in a subject, but are not typically used to inform curriculum or to reflect on specific lessons

Now • Standards-based assessments are criterion-referenced • Both the test and teaching are expected to closely align with state standards /Common Core • High-stakes accountability based on test scores assumes test scores are reflective of instructional quality • Test scores are used to reflect on instruction and curriculum for specific topics

New uses of large-scale tests:1. Support accountability

New uses of large-scale tests:2. Reflect on instruction Question #: 15Question Type: Multiple ChoiceTopic: Number SenseShutesbury (correct): 6%Massachusettts (correct): 49%Correct Answer: C 61% selected A, 22% selected B, while only 6% selected the correct answer C.

New uses of large-scale tests:2. Reflect on instruction Question #: 22Question Type: Multiple ChoiceTopic: Number SenseShutesbury (correct): 56%Massachusettts (correct): 72%Correct Answer: C 22% selected A, 22% selected B.

Should standards based assessments be used in these ways? Is teaching to the test now appropriate? Does teaching to the test improve scores?

Should standards based assessments be used in these ways? Is teaching to the test now appropriate? Does teaching to the test improve scores? Are tests sensitive to instructional efforts?

Test scores might reflect • Instruction focused on standards • Teaching skill • Attainment of standards due to experiences outside of school • Test-wiseness • Situational anomalies (illness, distractions, mood, etc) • Aptitude

If test are insensitive to instruction

If test are insensitive to instruction Question #: 15Question Type: Multiple ChoiceTopic: Number SenseShutesbury (correct): 6%Massachusettts (correct): 49%Correct Answer: C 61% selected A, 22% selected B, while only 6% selected the correct answer C. Waste of time

If test are insensitive to instruction Why teach to the test?

Exploring instructional sensitivity A series of studies conducted in one suburban school district located in the Southwest. Participants • 16 third- and 20 fifth-grade mathematics classes in 13 schools • 784 students • Relatively white, high-performing district with moderate SES • Teachers were relatively experienced (M=13.9, SD= 9.9) • District used standards-based report cards • Districtwide mathematics curriculum uniformly implemented

Data Collection • Teachers interviewed for approximately two hours about: • instruction and assessment of two performance objectives • grading practices • Likelihood that students will correctly answer state test items relating to the objectives • Student mathematics scores on the state test • End-of-year grades • Demographics

Research questions Is teaching to the test now appropriate? Does teaching to the test improve scores? Are tests sensitive to instructional efforts?

Is teaching to the test appropriate?

My thoughts… General instruction on tested objectives Teaching test taking skills Decontextualized practice Practice on the operational (real) test Instruction on tested objectives using examples similar to the test

Is teaching to the test effective? First need to gauge teaching to the test. 1. Asked teachers about their test preparation practices. 2. Teachers participated in a blind review of mathematics tests containing items from their own and other state tests. They identified items their students could answer and commented on sources of difficulty.

Participants: This analysis • 31teachers (12 third-grade, 19 fifth-grade) • 711 students • Students relatively low-performing relative to district

Frequency of test preparation practices

Item review  State test awareness

Analysis Conducted a multilevel analysis; students nested within classrooms Predicted mathematics achievement on state standards-based assessment, standardized relative to statewide test performance and pooled across grades Controlled for student-level minority status, ELL status, special education status Teacher-level main effects: -teaching to the test categories compared to general instruction on tested objectives -state test awareness categories compared to test averse teachers

Results After controlling for student demographics • teaching to the test did not predict achievement • being test-secure did predict achievement; students of test secure teachers performed half a standard deviation better on the state test than students of test-averse teachers • there was no difference in performance between students whose teachers were test averse and those whose teachers were state test focused or out-of-state test focused

Final Model Predicting Mathematics Achievement * indicates statistically significant relationship at p<0.05.

Possible interpretations • Teaching to the test does not work • The teachers are teaching state standards in a relatively uniform way • The test does not detect instructional efforts

Instructional sensitivity The degree to which a test can detect differences in the instruction students receive. With teachers who do not teach state standards With teachers who teach state standards well

Big question… • How do we know what instruction has occurred? (opportunity to learn) • Instructional sensitivity: the degree of correspondence between opportunity to learn and test performance

Measuring opportunity to learn • Teaching to the test is one (gross) approach • Alignment: How consistent were test items and instructional efforts in terms of content and cognitive demand? • Emphasis: Were most heavily tested concepts fully addressed? • The interaction of alignment and emphasis is perhaps the best estimate and should also correlate with achievement

Alignment as Opportunity to Learn Test Test Test teach skill unlike test

My instructional sensitivity study Based on interviews with teachers about their teaching and assessment of the two objectives most heavily emphasized on the state test

Measuring alignment

For example The teacher who drew these examples was coded as having “close alignment” to AIMS because she required students to solve problems involving three sets of items using a tree diagram. She did not, however, present students with tree diagrams that they had to interpret (required for “perfect” alignment).

Distribution of alignment scores by grade level Perfect Alignment Close Alignment Some Alignment

Distribution of emphasis scores by grade level Daily Weekly Every other week Monthly Frequency of instruction 2 weeks per year 1 week per year 1-2 lessons Not taught

Analysis Conducted a multilevel analysis; students nested within classrooms Predicted mathematics achievement on state standards-based assessment, standardized relative to statewide test performance, run separately by grade level Controlled for student-level minority status, ELL status, special education status, teacher experience and education, school-level free lunch eligibility, and prior achievement on a norm-referenced test Teacher-level main effects: -alignment -emphasis -alignment x emphasis interaction

Results • None of the main effects predicted achievement after controlling for prior achievement and demographics at fifth grade • Alignment predicted achievement at third grade after accounting for prior achievement and free lunch eligibility; students whose teachers were a standard deviation above the mean in alignment scored a tenth of a standard deviation above the sample mean on the state test

Final Model Predicting Mathematics Achievement, Third Grade * indicates statistically significant relationship at p<0.05.

Possible interpretations • Test is instructionally sensitive to a limited degree at one grade level, but not the other • Objectives selected impacted results; third grade objectives comprised less of the curriculum—to teach them you had to be very aware of their presence on the test—while fifth grade objectives reflect commonly taught skills.

Implications • Need to evaluate instructional sensitivity if we want to use large scale assessments for accountability or to guide instruction • Sensitivity of total test scores • Review item sensitivity during test development

Exploring item sensitivity Two approaches recommended by Popham and Kaase (2009) • Judgmental review of test items • Differential item functioning based on content teachers report teaching well and teaching poorly So far, only approach 2 has been studied. Found no relationship between content teachers said they taught badly (or didn’t teach) and item functioning.

Another approach Combines both approaches… • Teachers review a test and identify items they consider problematic • Compare classroom level and statewide item difficulties across the entire test • Determine if teacher-identified items perform differently

Visual Analysis: An Example of DIF

Participants: This analysis • 10 third grade and 12 fifth grade teachers from the same data collection • Number of student test scores per classroom ranged from 19 to 30

Teacher who reported instructional alignment

Teacher concerned with a few items

Teacher concerned with test emphasis

Testing Teaching Methods: Profitable or Pointless?