ACCESS for ELLs® Scores, Reliability and Validity

ACCESS for ELLs®Scores, Reliability and Validity Prepared by Dorry Kenyon, CAL ISBE Meeting, Chicago, IL February 21, 2007 Developed by the Center for Applied Linguistics

Outline of my presentation • What do scores on ACCESS for ELLs® mean? • What do we know about the reliability of ACCESS for ELLs® scores? • What do we know about the validity of ACCESS for ELLs® scores? • So what does this mean for using scores on ACCESS for ELLs®?

1. What do scores on ACCESS for ELLs® mean?

Two types of scores • WIDA ACCESS for ELLs® Scale Scores = psychometrically-derived measure • WIDA ACCESS for ELLs® Proficiency Level Scores = socially-derived interpretation of the scale score in terms of the WIDA Standards’ Proficiency Level Definitions

What is measured? • Scale Scores (and interpretive Proficiency Level Scores) are given for measures in the four domains • Listening • Speaking • Reading • Writing • Scale Scores are combined into four composite scores (which are also interpreted in Proficiency Level Scores) • Oral (listening and speaking) • Literacy (reading and writing) • Comprehension (listening and reading) • Overall Composite (listening, speaking, reading, and writing)

Weighting of the overall composite • Scale Scales of the four domains are weighted differently in the overall composite score • Listening (15%) • Speaking (15%) • Reading (35%) • Writing (35%)

ACCESS administration times and composite score weights • Listening (15%): 20-25 minutes, machine scored • Reading (35%): 35-40 minutes, machine scored • Writing (35%): Up to 1 hour, rater scored • Speaking (15%): Up to 15 minutes, administrator scored

Scale Scores vs. Proficiency Level Scores • The WIDA ACCESS for ELLs® Scale Scores are the psychometrically derived measures of student proficiency • Range from 100 to 600 • One scale applies to all grades through vertical equating of tests • Vertical scale score takes into account that assessment tasks taken by students in the grade 9-12 cluster are more challenging than the assessment tasks taken by students in the grade 1-2 cluster • Average scale scores consistently show an increase from grade to grade

2005-2006 Overall Composite Scale Scores

Scale Scores vs. Proficiency Level Scores • Proficiency Level Scores are socially-derivedinterpretations of the WIDA ACCESS for ELLs® Scale Scores in terms of the six proficiency levels defined in the WIDA Standards • Comprised of two numbers, e.g. 2.5 • First number indicates the proficiency level into which the student’s scale score places him or her (e.g. 2 = Beginning) • Second number indicates how far, in tenths, the student’s scale places him or her between the lower and the higher cut score of the proficiency level (e.g. 2.5 = 5/10 or ½ of the way between the cut score for level 2 and for level 3) • The same scale score is interpreted differently based on what grade level cluster different students are in • The same proficiency level score corresponds to different scale scores based on the grade level cluster

Example: Scale score of 350 350 350 350 350

Example: Overall composite proficiency level score 6.0 9-12 429 6-8 410 3-5 394 1-2 354 350 475 600 225 100 Hard Items More Proficient Students Easy Items Less Proficient Students

How are proficiency level scores derived? • While Proficiency Level Scores are socially-derivedinterpretations, they are not arbitrary • Set by panels of content experts • Set following best technical practices • Set by consensus building procedures (standard setting studies) • Set by carefully documented replicable procedures • For WIDA ACCESS for ELLs®, these were set by panels of experts in April of 2004, for each grade level cluster (see WIDA Technical Report #1 for complete details)

6 5 4 3 2 1 Originally WIDA had grade level cluster cuts

6 5 4 3 2 1 Grade level cuts are being introduced this year

Cluster vs. grade level cuts

2005-2006 Overall Composite Scale Scores

Effect of grade level cut scores Proficiency Level Score

2. What do we know about the reliability of ACCESS for ELLs® scores?

What is reliability? • Psychometrically speaking, reliability refers to the consistency of test scores. • What evidence is there that this test score result is not just a chance occurrence, but would have been obtained had the student been tested on multiple occurrences or scored under multiple occasions?

In the Annual Technical Report, the reliability of each of the 44 separate test forms for ACCESS for ELLs® is reported. Multiple forms of ACCESS for ELLs®

For all test forms, internal consistency (coefficient alpha) is reported. For writing, agreement between operational raters is also reported (20%) For speaking, agreement between administrators from field test data is also given currently, but a larger study is underway Reliabilities for domain scores based on the individual forms for Series 100 (2004-2005) are within expected and acceptable ranges Types of reliability reported

Results indicate that the reliability of the overall composite score across tiers is similar and very high across all grade level clusters (Series 100). Reliability of the overall composite

For tests like ACCESS for ELLs®, by which decisions are based on a student’s classification into proficiency levels, the accuracy of classification is perhaps the most important reliability index. This index gives an estimate of how reliably a student was placed to be at least at or above a certain category (versus below that category). The most important reliability index

Accuracy of classification indices (Series 100)

3. What do we know about the validity of ACCESS for ELLs® scores?

What is validity? • Validity refers to an evaluative judgment of the degree to which theoretical rationales and empirical evidence support the adequacy and appropriateness of inferences and actions made on the basis of test scores.

Validity issues for ACCESS for ELLs® • Issues related to ACCESS for ELLs® include • Do the described proficiency levels exist? • How does the test relate to other measures of English language proficiency? • How confident are we in the cut scores that place students into the various levels, that they really define the levels? • Do we know that ACCESS for ELLs® tests the languageneeded for academic success and is not a content test? • And so on…

Study 1: Do the levels of the Standards really exist? Reading and Listening Selected Response Type Items • SI = Social and Instructional Language • LA = language of Language Arts • MA = language of Math • SC = language of Science • SS = language of Social Studies

The Standards guide test development • ACCESS for ELLS® makes the WIDA Standards operational • WIDA Standards provide • Content (What?) • Performance Levels (How well?)

Large-scale Standards: SC reading

Large-Scale standards: SC reading Classify living organisms (such as birds and mammals) by using pictures or icons

Large-scale Standards: SC reading Interpret data presented in text and tables in scientific studies

At the given level of English language proficiency, English language learners will process, understand, produce, or use: 5: technical language of the content areas 2: general language of the content areas 1: pictorial or graphic representation of the language of the content areas

Validation issues • Validity is about the adequacy and appropriateness of inferences about students made on the basis of test scores. • The WIDA Standards make claims about what students at five different proficiency levels can do. • Can those claims be substantiated empirically?

Research study questions • Are the ACCESS for ELLs™ items empirically ordered by difficulty as predicted by the WIDA Standards? • Does that ordering differ by domain (listening or reading)? • Does that ordering differ by standard (SI, LA, MA, SC, SS)?

Data • Results from ACCESS for ELLs™ field test • Fall 2004 • Over 6500 students grades 1 to 12 • 8 WIDA states • About 3.5% proportional representation

Method • Items were vertically scaled across grade levels using common item equating • Item difficulty was determined using the Rasch measurement model • Items that did not meet the requirements of the model were eliminated from the analysis • Average item difficulties were calculated by proficiency level

Number of items used = 651

Results

Conclusions

1. Are the ACCESS for ELLs™ items empirically ordered by difficulty as predicted by the WIDA Standards? • Yes. WIDA Standards (MPIs) provided sufficient content and rationale to develop specifications that operationalized the five proficiency levels through listening and reading selected response items.

2. Does that ordering differ by domain (listening or reading)? • No. The general ordering was similar across listening and reading. Some difference between listening level 5 and reading level 5 was observed.

3. Does that ordering differ by standard (SI, LA, MA, SC, SS)? • Yes. SI (social and instructional language) items showed a clear tendency to be easier than items assessing language in the content areas, particularly at higher proficiency levels. • Items assessing language in the content areas were similar except at level 5 where language arts appeared easier than expected.

Discussion 1. While many additional validation issues remain, this preliminary empirical analysis based on the field test data indicate that the WIDA Standards provide a strong basis for distinguishing among proficiency levels of ELLs.

Discussion 2. The operational plan for ongoing WIDA assessment item renewal and development provides opportunity to tighten item specifications based on empirical research while operationalizing the WIDA Standards.

Process of test development 1. Theory and Research 2. Standards 3. Specifications 4. Assessment

ACCESS for ELLs® Scores, Reliability and Validity

ACCESS for ELLs® Scores, Reliability and Validity

Presentation Transcript

Construct Validity and Reliability

Measurement: Reliability and Validity

Reliability and Validity Designs

VALIDITY AND RELIABILITY

Reliability and Validity

Reliability and Validity

Reliability and Validity

Reliability and Validity

Reliability, Validity, and Acceptability

Reliability and Validity Testing

Validity and Reliability

Reliability and Validity

Reliability, Validity, and Bias

Reliability and Validity

ACCESS for ELLs (ACCESS) and Alternate ACCESS for ELLs (Alternate ACCESS

ACCESS for ELLs/Alternate ACCESS for ELLs® Rosters

Validity and Reliability

Validity and Reliability

Reliability and Validity

Validity and reliability

Validity and Reliability

Generalizability, Reliability and Validity