500 likes | 602 Vues
Growth Scales and Pathways. William D. Schafer University of Maryland and Jon S. Twing Pearson Educational Measurement. NCLB leaves some unmet policy needs. Assessment of student-level growth. Sensitivity to change within achievement levels. Assessment and accountability at all grades.
E N D
Growth Scales and Pathways William D. Schafer University of Maryland and Jon S. Twing Pearson Educational Measurement
Descriptions of what students are able to do in terms of next steps
How can we meet these needs? Our approach starts with measurement of growth through cross-grade scaling of achievement
Current work is being done around: Vertical Scales in which Common items for adjacent grades used to generate a common scale across grades
Another approach is grade-equivalents. Both are continuous cross-grade scales.
We only have three problems with continuous cross-grade scales: • The Past • The Present • The Future
Why the Past? Ignores instructional history The same student score should be interpreted differently depending on the grade level of the student
Why the Present? Relationships among items may (probably do) differ depending on grade level of the student. (e.g., easy fifth grade items may be difficult for fourth graders) Lack of true equating. It is better for fourth graders to take fourth grade tests and for fifth graders to take fifth grade tests.
Why the Future? Instructional expectations differ. A score of GE = 5.0 (or VS = 500) carries different growth expectations from a fifth-grade experience next year for a current fifth grader than for a current fourth grader.
We do need to take seriously the interests of policymakers in continuous scaling. But the problems with grade-equivalents and vertical scaling may be too severe to recommend them. Here are seven criteria that an alternate system should demonstrate.
1. Implement the Fundamental Accountability Mission Test all students on what they are supposed to be learning.
2. Assess all contents at all grades. Educators should be accountable for all public expenditures. Apply this principle at least to all non-affective outcomes of schooling.
3. Define tested domains explicitly. Teachers need to understand their learning targets in terms of Knowledge (what students know) Factual Conceptual Procedural Cognition (what they do with it)
4. Base test interpretations on the future. We can’t change the past, but we can design the future. It can be more meaningful to think about what students are prepared for than about what they have learned.
5. Inform decision making about students, teachers, and programs. Within the limits of privacy, gathering data for accountability judgments about everyone and everything (within reason) will help decision makers reach the most informed decisions. This also means that we will associate assessments with those who are responsible for improving them.
6. Emphasize predictive evidence of validity. Basing assessment interpretations on the future (see point 4) suggests that our best evidence to validate our interpretations is how well they predicted in the past.
7. Capitalize on both criterion and norm referencing. Score reports need to satisfy the needs of the recipients. Both criterion-referencing (what students are prepared to do) and norm-referencing (how many are as, more, and less prepared) convey information that is useful. Other things equal, more information is better than less.
Our Approach to the Criteria Many of the criteria are self-satisfying. Some recent and new concepts are needed. Four recent or new concepts: • Socially moderated standard setting • Operationally defined exit competencies • Growth scaling • Growth pathways
Socially Moderated Standard Setting Ferrara, Johnson, & Chen (2005) Judges set achievement level cut points where students have prerequisites for the same achievement level next year. Note the future orientation of the achievement levels. This concept also underlies Lissitz & Huynh’s (2003) concept of vertically moderated standards.
Operationally Defined Exit Competencies If we implement socially moderated standards, where do the cut points for the 12th grade come from? Our suggestion is to base them on what the students are prepared for, such as (1) college credit, (2) ready for college, (3) needs college remediation, (4) satisfies federal ability-to-benefit rules, (5) capable of independent living, (6) below. Modify as needed for lower grades (e.g., fewer levels) and certain contents (e.g., athletics, music)
Growth Scaling Some elements of this have been used in Texas and Washington State. Test at each grade level separately for any content (i.e., only grade-level items). Report using a three-digit scale. First digit is the grade level. Second two digits are a linear transform of the lower “proficient” (e.g., 40) and “advanced” (e.g., 60)cut points. Could transform non-linearly to all cut points with more than three levels.
Growth Pathways Given that content is backmapped (Wiggins & McTighe, 1998), and achievement levels are socially moderated, can express achievement results in terms of readiness for growth (next year, or at 12th grade or both). Can generate transition matrices to express likelihoods of various futures for students.
Adequate Yearly Progress Capitalizing on Hill et al. (2005) can use growth pathways as the bases for expectations and give point awards for students meeting or falling below or above their expectations based on year-ago achievement levels.
Existing Empirical State Data Using existing data, we explored some of these concepts. Two data sets were used from Texas. • All data is in the public domain and can be obtained from the Texas website. • Current Texas data is used: TAKS • Previous Texas data is used: TAAS
Immediate Observations - TAAS Data Passing standards appear to be relatively lenient. • Actual standards were set in fall of 1989. • Curriculum change occurred in 2000. Texas Learning Index (TLI) • Is a variation of the “Growth Scaling” model previously discussed. • Will be discussed in more detail shortly. Despite the leniency of the standard, average cross-sectional gain is shown with the TLI. • About a 2.5 TLI value gain on average (across grades).
Immediate Observations -TAKS Data Passing standards appear to be more severe than TAAS, but still the majority of students pass for the most part. • Standards were set using Item Mapping and field-test data in 2003. • Standards were “phased in” by the SBOE. • “Passing” is labeled as “Met the Standard”. Scale Scores are transformed within grade and subject calibrations using Rasch. • Scales were set such that 2100 is always “passing”. • “Socially moderated” expectation that a 2100 this year is equal to a 2100 next year. • We will look at this in another slide shortly.
Immediate Observations-TAKS Data Some Issues/Problems seem obvious: • Use of field test data the and lack of student motivation the first year. • Phase in of the standards makes the meaning of “passing” difficult to understand. • Construct changes between grades 8 and 9. • Math increases in difficulty across the grades. • Cross-sectional gain scores show some progress, with between 20 and 35 point gains in average scaled score across grades and subjects. • Finally, the percentage of classifications (impact) resulting from the Item Mapping standard setting is quite varied.
A Pre-Organizer • Socially Moderated Standard Setting • Really sets the expectation of student performance in the next grade. • Growth Scaling • A different definition of growth. • Growth by fiat. • Operationally Defined Exit Competencies • How does a student exit the program? • How to migrate this definition down to other grades. • Growth Pathways • Cumulative probability of success. • Not addressed in this paper with Texas data.
Socially Moderated Standard Setting Consider the TAKS data in light of Socially Moderated Standard Setting. • The cut scores were determined separately by grade and subject using an Item Mapping procedure. • 2100 was selected as the transformation of the Rasch theta scale associated with passing. • 2100 became the passing standard for all grades and subjects. • Similar to the “quasi-vertical scale scores” procedure described by Ferrara et al. (2005).
Socially Moderated Standard Setting Despite implementation procedures, the standard setting yielded a somewhat inconsistent set of cut scores. • Panels consisted of on and adjacent grade educators. • Performance level descriptors were discussed both for the current grade and the next. • A review panel was convened to ensure continuity between grades within subjects. • This review panel was comprised of educators from all grades participating in the standard setting and use impact data for all grades as well as traditionally estimated vertical scaling information.
Socially Moderated Standard Setting Yet, some inconsistencies are hard to explain. • For example, the standards yielded the following passing rates for Reading: Grade 3 81 Grade 4 76 Grade 5 67 Grade 6 71 • Clearly, “social moderation” did not occur: • Differences in content standards from grade to grade. • Lack of a clearly defined procedure setting up expectation at the next grade. • Mitigating factors (i.e., “kids cry”; raw score percent correct, etc.).
Socially Moderated Standard Setting What about unanticipated consequences? • Are teachers, parents and the public calculating “gain score” differences between the grades based on these horizontal scale scores? • Will the expectation not be “2100 this year = 2100 next year”? This is similar to one of the concerns in Ferrara et. al. (2005) that prohibited the research from being conducted. • In fact, based on simple regression using matched cohorts, the expectation is a student with a scaled score of 2100 in grade 3 reading will earn a 2072 in grade 4 reading on average.
Growth Scaling The TAAS TLI is an example of this type of “growth scale”. • A standard setting was performed for the “Exit Level” TAAS test. • This cut score was expressed in standard deviation units above or below the mean (i.e., a standard score). • This same distance was then articulated down to other grades. • The logic was one defining growth in terms of maintaining relative status as students move across the grades. • For example, if the passing standard was 1.0 standard deviation above the mean at Exit Level, then students who are 1.0 standard deviation above the mean in the lower grade distributions are “on track” to pass the Exit Level test provided they maintain their current standing / progress.
Growth Scaling • For convenience, the scales were transformed such that the passing standards were at 70. • Grade level designations were then added to further enhance the meaning of the score. • This score had some appealing reporting properties: • Passing was 70 at each grade. • Since the TLI is a standard score, gain measures could be calculated for “value added” statements.
Growth Scaling Some concerns were also noted: • Outside of the first cut score, the TLI was essentially “content standard” free. • Because it was based on distribution statistics, the distributions (like norms) would become dated. • Differences in the shapes of the distributions (e.g., test difficulty) would have an unknown impact on student’s actually being able to “hold their own”. • Differences in the content being measured across the grades is essentially irrelevant.
Operationally Defined Exit Competencies The TAKS actually has such a component at the Exit Level. • This is called the “Higher Education Readiness Component (HERC)” Standard. • Students must reach this standard to earn “dual college credit” and to be allowed credit for college level work. • Two types of research were conducted to provide information for “traditional” standard setting: • Correlations with existing measures (ACT & SAT). • Empirical study examining how well second semester freshmen performed on the Exit Level TAKS test.
Operationally Defined Exit Competencies This research yielded the following:
Operationally Defined Exit Competencies Some interesting observations: • HERC standard was taken to be 2200, different from that needed to graduate. • Second semester college freshmen did “marginally better” than the required passing standard for TAKS to graduate. • Predicted ACT and SAT scores support the notion that the TAKS passing standards are “moderately” difficult. • Given the content of the TAKS assessments, how could this standard be articulated down to lower grades?
Concluding Remarks Three possible enhancements that may or may not be intriguing for policymakers: • Grades as Achievement Levels • Information Rich Classrooms • Monetary Metric
Grades as Achievement Levels Associating letter grades with achievement levels would: • Provide meaningful interpretations for grades • Provide consistent meanings for grades • Force use as experts recommend • Enable concurrent evaluations of grades • Enable predictive evaluations of grades • Require help for teachers to implement
Information Rich Classrooms Concept is from Schafer & Moody (2004). Achievement goals would be clarified through test maps. Progress would be tracked at the content strand level throughout the year using combinations of formative and summative assessments (heavy role for computers). Achievement level assignments would occur incrementally throughout the year.
Monitory Metric for Value Added Economists would establish value of each exit achievement level through estimating lifetime earned income. The earnings would be amortized across grade levels and contents. The “value added” for each student each year is the sum across contents of the products of the achievement level times the vector of probabilities of exit achievement levels times the vector of amortized monitory values. Enables cost-benefit analysis of education in a consistent metric for inputs and outputs.