Validity in Action: State Assessment Validity Compliance with NCLB

Validity in Action: State Assessment Validity Evidence for Compliance with NCLB William D. Schafer, Joyce Wang, and Vivian Wang University of Maryland

Objectives • review the evidence that state testing programs provide to the United States Department of Education on the validity of their assessments • examine in detail the validity evidence that certain selected states provided for their peer reviews • make recommendations for improving the evidence submissions supporting validity for state assessments

Data Sources • official decision letters on each state's final assessment system under NCLB from USED; publicly available at www.ed.gov • peer review reports for five selected states • technical reports for available states that have received full approval from USED; downloaded from the web sites of each state

Types of Validity Evidence • the AERA/APA/NCME Standards lists five types of validity evidence • content-based evidence • response-process-based evidence • evidence based on internal structure • evidence based on relationships with other variables • evidence based on consequences • we will look at the judgments that each type should support in the context of statewide assessments of educational achievement

Content-Based Evidence judgments that need to be supported: • the domain is described in the academic content standards at the grade level • the test items sample that content domain appropriately • achievement level descriptions refer back to the content domain of the test

Response-Process-Based Evidence judgment that needs to be supported: • the activities the test demands of students are consistent with the cognitive processes the test is supposed to represent (as implied by the content standards)

Evidence Based on Internal Structure judgment that needs to be supported: • test score relationships are consistent with the strand structures of the academic content standards

Evidence Based on Relationships with Other Variables judgments that need to be supported: • higher correlations occur when traits are more similar • low correlations (perhaps partial on ability) exist with specific traits (e.g., gender, race-ethnicity, disability)

Evidence Based on Consequences judgments that need to be supported: • test use maximizes positive outcomes • test use minimizes negative outcomes

Decision Letters • decision letters were viewed at the USED web site – they are public documents • 19 of the states were required to provide additional validity evidence • the evidence was not classified by USED, but we classified it into the five types to help make the project manageable • decision-letter evidence is required by USED – it is mandatory – these elements may be thought of as necessary for states to submit

Content-Based Evidence • evidence to show that assessments measure the academic content standards and not characteristics not specified in the academic content standards or grade level expectations • blueprints, item specifications, and test development procedures • evidence of alignment with content standards – this is an emphasis in peer review • explanations of design and scoring • standard setting process, results, and impact

Response-Process-Based Evidence • evidence to show that items are tapping the intended cognitive processes – this sort of evidence is commonly a part of alignment studies

Evidence Based on Internal Structure • item interrelationships • subscale score correlations showing they are are consistent with the structures inherent to the academic content standards • scoring and reporting are consistent with the subdomain structure of the content standards • justification of score use given the threat (observed) that the subdomain correlations are higher between content areas than within content areas

Evidence Based on Relationships with Other Variables • criterion validity • relationships between test scores and external variables

Evidence Based on Consequences • studies of intended and unintended consequences

Evidence from State Submissions • each state submitted voluminous evidence to USED • the Peer Review Reports included descriptions of the evidence submitted • we had sets of Reports for five states • this evidence may be over and above what is actually required

Evidence of Purposes • each state was asked to provide evidence about the purposes of their assessments • each state did that • this is an important part of Kane’s (2006) concept of a validity argument • because it does not fall into the categories of validity evidence in the USED Peer Review Guidance, we did not include it in our review

Content-Based Evidence • test blueprints & construction process • alignment reports • categorical concurrence (each content strand has enough items for a subscore report) • range of knowledge (the number of content elements in each strand that have items associated with them) • balance of representation (the distribution of items across the content elements within each strand) • achievement level descriptions (ALDs) compared with the strand structure

Response-Process-Based Evidence • alignment reports • depth of knowledge (relates the cognition tapped by each item to that implied in the statement of the element in the content standards the item is associated with) • think-aloud studies (proposed)

Evidence Based on Internal Structure • dimensional analysis at the item level • principal components analysis • dimensionality hypothesis testing • intercorrelations among the subtest scores

Evidence Based on Relationships with Other Variables • correlations with external tests of similar constructs (and dissimilar constructs) • correlations with student demographics and course-taking patterns • choosing and implementing accommodations for disabilities and limited English proficiency • bias studies (e.g., DIF) and passage reviews • universal design principles • monitoring of test administration procedures

Evidence Based on Consequences • longitudinal change in dropout and graduation rates and NAEP results • use of results to evaluate schools and districts • use of test data to improve curriculum & instruction • use of adequate yearly progress reports • use of tests to make promotion & graduation decisions

Synthesis of Evidentiary Needs • it would be useful to have a minimum list for state regulatory submissions • can we use these studies to generate a list? • most likely over-inclusive using our evidence • as soon as we do so, it will surely be challenged • it seems reasonable to submit the following • for each test series (e.g., regular, alternate) • for each tested content and grade combination

Content Evidence • content standards • test blueprint • item (and passage) development process • item categorization rules and process • forms development process (e.g., item sampling; item location; section timing) • results of alignment studies

Process Evidence • test blueprint (if it has a process dimension) • item categorization rules and method (if items are categorized by process) • results of alignment studies • results of other studies, such as think-alouds

Internal Structure Evidence • subscore correlations • Item-subscore correlations • dimensionality analyses

Relations with Other Variables • convergent Evidence • correlations with independent, standardized measures • correlations with within-class variables, such as grades • discriminant Evidence • correlations with standardized tests of other traits (e.g., math with reading) • correlations with within-class variables, such as grades in other contents • correlations with irrelevant student characteristics (e.g., gender) • item-level (e.g., DIF) studies

Consequential Evidence • purposes of the test – as they describe intended consequences • uses of results by educators • trends over time • studies that generate and evaluate positive and negative aspects from user input

Validity in the Accountability Context – Role of Processes • majority of the evidence submitted capitalizes on well-known methods for study of the validity of a particular test form – a product • but object of study in accountability is actually a process by which tests are developed & used • a test form is important only as a representative of a process of test development • programs are expected to engage in a continual process of self-evaluation and improvement

Process Evidence • assume it is useful to distinguish between product evidence and process evidence • product evidence focuses on a particular test and • process evidence focuses on a testing program • will review and extend some suggestions for process evidence that were originally proposed in the context of state assessment and accountability peer reviews

What is a Process? • a recurring activity that takes material, operates on it, and produces a product • concept is borrowed from project management • could be as large as the entire assessment and accountability program • could be as small as, say, the production of a test item • one challenge is to organize the activities of a program into useful processes

Is Validity a Process Concept? • i.e., is there a sense in which we can use the concept of the validity of a process? • validity is justification for an interpretation of a score • a test form is a static element that can contribute support for an interpretation • a process is a dynamic element that can contribute support for future interpretations • so we give this one a tentative “yes”

Elements of Process Evidence • process • The process is described • The inputs and operating rules are laid out • product • The results of the process are presented or described • evaluation (how are these questions are considered) • is the process adequate? • can (or how can) it be improved? • should it be improved (e.g., do the benefits justify the costs)? • improvement (how the consideration is done?) • The recommendations from the evaluation are considered for implementation in order to improve the process

Examples of Process Evidence • three examples of these four elements of process evidence follow • they vary markedly in scope • small to large • illustrate the nature of process evidence for different contexts within an assessment and accountability program

Bias and Sensitivity Committee Selection • process. desired composition, generation of committee members, contacting potential members, proposed meeting schedule, etc. • product. committee composition, especially the constituencies represented. • evaluation. comparison of actual with desired composition, follow up with persons who declined, suggestions for improvement. • improvement. who has responsibility to consider the recommendations generated by the evaluation, how they go about their analysis, how change is implemented in the system, examples of changes that were made in the past to document responsiveness

Alignment • process. test blueprint, items, item categorizations, sampling processes • product. a test form • evaluation. alignment study • improvement. review of study recommendations, plan for future

Psychometric Adequacy of a Test Form • process. the analyses that are performed. • product. technical manual • evaluation. review by a group such as a TAC, recommendations for the manual as well as the testing program • improvement. consideration of recommendations, plan for future

Making Judgments About Processes • two typically independent layers of judgment • first layer is an evaluation that makes recommendations about improvement • second layer considers them • in many cases, second layer would be an excellent way for a state to use its TAC

Judging Process Evidence • process evidence by definition describes processes • it should be judged by how well it describes processes that support interpretations based on future assessments • it should also be judged on how well it describes processes that lead to improvements in the program

Possible Criteria for Process Evidence • data are collected from all relevant sources • data are reported completely and efficiently • reviewed by persons with appropriate expertise • review is conducted fairly • review results are reported completely and efficiently • recommendations are suggested in the reports • consideration given to the recommendations • past actions based are presented as evidence that the process results in improvement

Validity in Action: State Assessment Validity Compliance with NCLB

Validity in Action: State Assessment Validity Compliance with NCLB

Presentation Transcript

Chapter 4. Validity:

Consequential Validity

Standard Setting

External Validity

Lee Cronbach and the Evolving Concept of Validity

Reliability and Validity

Predictive Validity Evidence for DIBELS: Correlation to WASL Reading Scores

Measuring Research Variables

Unit 7: Validity and Reliability Slides

Assessment Population and the Validity Evaluation

Types of validity we will study for the Next Exam ... internal validity -- causal interpretability external validit

Validity

VALIDITY

Reliability or Validity

VALIDITY

Validity

Internal Validity

Chapter 6 Validity §1 Basic Concepts of Validity

What does exam validity really mean?

Validity

Validity Part II – Applications of Validity and Considerations in the Validation Process