Carole Gallagher Director of Research - Standards, Assessment, and Accountability Services

Principles of Evidence-Centered Design: Practical Considerations for State Assessment Development Efforts Carole Gallagher Director of Research - Standards, Assessment, and Accountability Services cgallag@wested.org Deb Sigman Director, Center on Standards and Assessment Implementation dsigman@wested.org December 8, 2017

Principles of Evidence-Centered Design (ECD) • Grounded in the fundamentals of cognitive research and theory that emerged more than 20 years ago, particularly in response to emerging technology-supported assessment practices • National Research Council, Mislevy et al. (at ETS-CBAL), Haertel (at SRI-PADI), Herman et al. (at CRESST) • Intended to ensure a rigorous process/framework for developing tests that (a) measure the constructs developers claim they measure and (b) yield the necessary evidence to support the validity of inferences or claims drawn from results

Principles of ECD • Focuses on defining explicit claims and pairing these claims with evidence of learning to develop a system of claim-evidence pairs to guide test development • ECD proponents view assessment as an evidentiary argument (Mislevy et al.) or a systematic pathway for linking propositions (e.g., standards define learning expectations), design claims (e.g., learning expectations are clear and realistic), and types of evidence (e.g., expert review, psychometric analyses) (Herman et al.)

Core Elements of ECD • To meet the expectations for ECD, five interrelated activities (“layers”) are required: • Domain Analysis • Domain Modeling • Conceptual Assessment Framework • Assessment Implementation • Assessment Delivery

Layer 1: Domain Analysis • Elements (each with multiple levels of detail) of the content that will be assessed • Generally requires complex maps of the content domain that include information about the concepts, terminology, representational forms (e.g., content standards, algebraic notation), or ways of interacting that professionals working in the domain use • how knowledge is constructed, acquired, used, and communicated • Goal is to understand the knowledge people use in a domain and the features of situations that call for the use of valued knowledge, procedures, and strategies

Layer 2: Domain Modeling • Goal is to organize information and relationships uncovered during Layer 1 into assessment argument schemas based on the big ideas of the domain and “design patterns” (recurring themes) • What does the assessment intend to measure? (KSAs) • How will it do so? (characteristic and variable task features) • What types of evidence can be identified from work products? • Generally, representations to convey information to and capture information from students

Layer 3: Conceptual Assessment Framework • Focuses on design structures (student evidence and task models), evaluation procedures, and measurement model • Goal is to develop specifications or templates for items, tasks, tests, and test assembly specifications • Generally, domain information is combined with goals, constraints, and logic to support blueprint development • Includes evidence from student, task models, item generation models, generic rubrics, and algorithms for automated scoring

Layer 4: Assessment Implementation • Goal is to implement assessment, including presentation-ready tasks and calibrated measurement models • Includes item or task development, scoring methods, statistical models, and production of test forms or algorithms, all done in accordance with specifications • Pilot test data may be used to refine evaluation procedures and fit measurement models

Layer 5: Assessment Delivery • This is where students interact with tasks, have their performances evaluated, and feedback is provided • Administration, scoring, and reporting of results at item/task and test levels • Work products are evaluated, scores assigned, data files created, and reports distributed to examinees

Constraints of ECD in Large-Scale Assessment • Ideal for research studies exploring innovative assessment activities • Complexity of language in approach, however, does not translate easily to standard practice (e.g., assessment as argument) • Very resource intensive in terms of time and cost • Not all developers have strong knowledge base about ECD, but nearly all understand the essential need for documentation and collection of evidence to support test use

So…enter, Evidence-Based Approach (EBA) • Familiar language, as it appears in Standards for Educational and Psychological Testing, peer review guidance, and assessment technical reports • Same focus on defining the explicit claims that developers seek to make and matching these claims with evidence of learning • Same focus on strategic collection of evidence from a range of sources to support test use for a specific purpose • Construct, content, consequential, and predictive validity • Reliability • Fairness • Feasibility

Evidence-Based Approach • Like Universal Design (ud vs UD), not all evidence-based approaches are ECD • But EBA does ensure ongoing, systematic collection and evaluation of information to support claims about the trustworthiness of inferences drawn from the results of that assessment (Kane, 2002; Mislevy, 2007; Sireci, 2007) • Validation is still intended to help developers build a logical, coherent argument for test use that is supported by particular types of evidence

Meets Highest Expectations in Current Era • With NGSS-based assessments, per BOTA (2014, Recommendation 3.1): Assessment designers should follow a systematic and principled approach, such as evidence-centered design or construct modeling. Multiple forms of evidence need to be assembled to support the validity argument for an assessment’s intended interpretative use and to ensure equity and fairness.

Key Reminders from Joint Standards • The different types of evidence are collected on an ongoing basis, through all phases of design, development, and implementation of the assessment • A test is not considered “validated” at any particular point in time; rather, it is expected that developers gather information systematically and use it in thoughtful and intentional ways • Gradual accumulation of a body of specific types of evidence that can be presented in defense of test use for a particular purpose

Application of EBA to Test Development: Use of an Assessment Framework • Useful tool for defining and clarifying what the standards at each grade mean and in supporting a common understanding of the grade-specific priorities for assessment • Ensures that developers begin collecting evidence from the outset to support the claim that assessments measure what they claim to measure fairly and reliably • Enables alignment among standards, assessments, and instruction and ensures transparency about what is tested • Ensures a purposeful and systematic analysis of key elements prior to development of any blueprints or items

Assessment Framework • Lays out the content in a domain and defines for test developers which content is eligible for assessment at each grade and how content will be assessed • Describes the ways in which items will be developed that have the specific characteristics (“specifications”) needed to measure content in the domain • Leads to development of blueprints for an item pool or test form that ensure the right types and numbers of test items are developed to fully measure assessable standards in the content area and defines intended emphases and/or assessment objectives for each grade • Outlines expected test administration practices (e.g., allowable accommodations, policy on calculator use)

Framework Step 1: Focus on Content • Making expectations for testing clear by specifying the content to be assessed. Guiding questions: • What are the indicators of college- and career-readiness at each grade? • What are the state’s priorities for assessment at each grade? What are the grade-specific outcomes it is seeking? Has the state adequately defined the content and constructs that are the targets for assessment? • Which standards will be assessed on the summative assessment? • What steps will be taken to ensure that the state has addressed the full range (depth and breadth) of the standards intended at each grade?

Framework Step 2: Measurement Model • Generally, developers seek to measure a particular characteristic (e.g., understanding of scientific concept) and use the results to make some sort of decision about students’ level of this characteristic • This is the foundation for a measurement model (or theory of action) that describes how expertise or competence in a content domain (e.g., science) is measured • A measurement model holds that students have different levels of this characteristic and, when measured appropriately, their “score” on this characteristic will fall along a continuum of least to most

Framework Step 2: Measurement Model • The measurement model is closely linked to content to be assessed and to the specific item types used to measure the characteristics of interest, i.e., that can be used to elicit meaningful information (responses) from students about the precise location of their level of expertise (score) for that characteristic • According to the National Research Council (2001), demonstrating understanding of the measurement model underlying item and test development is an important piece of evidence to support the validity of results that emerge from these assessments

Framework Step 2: Measurement Model • Guiding questions to consider: • How will each intended outcome defined above be measured? How will the state confirm that its measures address all aspects of the intended outcome? • What is the theoretical foundation or theory of action for the state’s measurement plan? How is it linked to expertise in this domain? • How will the state know if students have mastered/are proficient in the instructional priorities it identified? How will levels of achievement for each indicator be differentiated? • How will the state test the accuracy and usefulness of its measurement model for the purpose intended? What evidence should be collected?

Framework Step 3: Item Types • Determine which item types are developmentally appropriate for students in tested grades and most effective for measuring the content in the domain • e.g., open ended, selected response, short answer, extended answer, essay, presentation, demonstration, production, performance task • Each item type will be explicitly linked to a particular standard or group of standards and will fit the proposed measurement model • Clear links among the target content, the measurement model, and the item types must be evident • Key consideration: accessibility for all student subgroups

Framework Step 3: Item Types • Guiding questions to consider: • What item characteristics are needed to effectively measure the intended content of the state assessment? • Which item types most strongly demonstrate those identified characteristics? What evidence supports claims that these item types are appropriate for a specific assessment purpose? • What evidence links each item type with the measurement model? With a standard or cluster of standards? • Who will design the templates for each item type?

Framework Step 4: Specifications • Describe the important criteria or dimensions (e.g., complexity, length, number of response options) for each item type that are needed to effectively measure different standards • Brings consistency and quality assurance to the ways in which items will be presented, formatted, and used on each assessment • helps to ensure that all subsequent decisions reached during item development by teachers, expert panels, and/or state leaders in diverse locations are guided by the same set of pre-established guidelines • Item-type specifications include guidelines for selection of associated stimuli and for the administration and scoring of each item type • e.g., types and level of complexity of passages, graphics, or other support materials • may include allowable strategies for modifying/differentiating item types to meet the assessment needs of special student populations

Framework Step 4: Specifications • Guiding questions include the following: • What are the presentation and/or formatting characteristics that should be prescribed for each item type? • What assessment practices (e.g., calculator use) will be allowed? • What types of stimuli (e.g., passages) will be used? • What are allowable strategies for adapting or modifying items to maximize access for special student populations?

Framework Step 5: Blueprints (finally!) • Describes the overall design for the test or item pool • Prescribes the numbers of each item type needed, the balance of representation (e.g., percentages of items for each standard or groups of standards), and levels of cognitive demand (e.g., DOK 1, 2, 3, 4) to be assessed • Defines test or item pool breadth, depth, and length/size • In keeping with the measurement model, this combination of item types is expected to provide a comprehensive and coherent picture of what students know and can do in relation to the characteristic of interest

Framework Step 5: Blueprints • Guiding questions to consider: • How many of each item type/template will be needed for each assessment or item pool at each grade? • How will the items be distributed across the standards at each grade? • How will different item templates be combined on state measures to address the full range of the standards at each grade? Have we addressed the full range (depth and breadth) of the standards intended at each grade? • How can we ensure a balance between necessary test length (for full coverage of the standards and to ensure sufficient reliability) and burden to students and schools?

Framework Step 6: Item Development • Develop plan and schedule for subsequent item development that will yield sufficient numbers of the right types of items with the right specifications to meet blueprint needs • Should describe how an EBA will be used to ensure that each item is strongly linked to the content intended to be assessed through the measurement model • Conduct inventory of items in the existing bank so a development target can be set for each grade • Target should provide for item overdevelopment, as nearly half of the new items may not be useable after review and refinement • Identify who will do work (e.g., teachers? content specialists? contractor?), as well as how (multiple rounds of review?) and when

Framework Step 6: Item Development • Guiding questions to consider: • How will the state ensure development of sufficient numbers of different item templates to match blueprint needs? • How many of each item type will need to be developed to augment the existing item pool at each grade? • Who will develop the items necessary to meet the target number at each grade? How will they be recruited? • For each item template, to what general and specific quality standards should all developers be held? Will developers be expected to meet quotas?

Document, Document, Document • Must document all decisions made during every step in this process • Rationales for key decisions may be provided to support the defensibility of each in relation to the high-stakes and to reinforce the state’s commitment to transparency in testing • Guiding questions to consider: • How will decisions and processes be documented? • Who will review and approve all decisions? How will changes be recorded? • How will decisions based on reviews, pilot testing, and field testing be documented and their impact monitored?

Selected References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: AERA. Haertel, G., HaydelDeBarger, A., Villalba, S., Hamel, L., & MitmanColker, A. (2010). Integration of Evidence-Centered Design and Universal Design Principles Using PADI, and Online Assessment Design System. Assessment for Students with Disabilities Technical Report 3. Menlo Park, CA: SRI International. Herman, J. & Linn, R. (2015). Evidence-Centered Design: A Summary. Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Kane, M. (2002, Spring). Validating high stakes testing programs. Educational Measurement: Issues & Practice, 21(1), 31–41. Mislevy, R. & Haertel, G. (2006). Implications for evidence-centered assessment design for educational assessment. Educational Measurement: Issues & Practice, 25(4), 6–20. Mislevy, R. & Riconscente, M. (2005). Evidence-Centered Assessment Design: Layers, Structures, and Terminology. PADI Technical Report Series. Menlo Park, CA: SRI International. National Research Council. (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press. National Research Council. (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., and Glaser, R., editors. Board on Testing and Assessment, Center for Education. Division of Behavioral and Social Sciences and Education. Washington, DC: National Academy Press. Sireci, S. (2007). On validity theory and test validation. Educational Researcher, 36(8), 477–481. Snow, E., Fulkerson, D., Feng, M., Nichols, P., Mislevy, R., & Haertel, G. (2010). Leveraging Evidence-Centered Design in Large-Scale Test Development (Large-Scale Assessment Technical Report 4). Menlo Park, CA: SRI International.

This document is produced by the The Center on Standards and Assessment Implementation (CSAI). CSAI, a collaboration between WestEd and CRESST, provides state education agencies (SEAs) and Regional Comprehensive Centers (RCCs) with research support, technical assistance, tools, and other resources to help inform decisions about standards, assessment, and accountability. Visit www.csai-online.org for more information. This document was produced under prime award #S283B050022A between the U.S. Department of Education and WestEd. The findings and opinions expressed herein are those of the author(s) and do not reflect the positions or policies of the U.S. Department of Education.

Carole Gallagher Director of Research - Standards, Assessment, and Accountability Services