1 / 21

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments. An Empirical Assessment Based on Four Recent Evaluations. IES Research Conference June 28 th , 2010. Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC.

glenna
Télécharger la présentation

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments An Empirical Assessment Based on Four Recent Evaluations IES Research Conference June 28th, 2010 Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC

  2. Two key concerns with using state tests in an evaluation… • They may not be suitable for the evaluation • Validity concerns: They may not be aligned with outcomes of interest (do not provide a valid inference about program impacts) • Reliability concerns: They may be too difficult for low-performing students (unreliable) • Variation in scale/content of state tests also complicates the task of combining impact findings across states and grades

  3. About This Study • Funded by Institute of Education Sciences (IES) • Purpose is to “bring data to bear” on several topics covered in May et al. discussion paper: • Are state tests suitable for evaluation purposes? • As a measure of the outcome(s) of interest? • As a measure of student achievement at baseline? • How should impacts on state tests be pooled? • Are impact findings sensitive to methods of rescaling and aggregating test scores across states and/or grades?

  4. Overview of Analytical Approach • We identified 4 large-scale randomized experiments where achievement was measured using both (i) state tests AND (ii) a study test • The study test provides a benchmark for gauging the suitability of state tests • Two types of analyses: • Impact analyses: We compared estimated impacts on state tests and on the « benchmark » study test • Descriptive analyses: We also examined published information on the characteristics/content of tests

  5. Data and Samples • Studies represent diversity with respect to grade levels and outcomes • Analysis sample includes students with a state test score and a study test score

  6. Approach for Estimating Impacts • Impact on state tests: • Rescaling: Scores are z-scored by state and grade using the sample mean and standard deviation • Pooling approach: Impacts by state and grade are aggregated using precision weighting • Impact on the study test: • Rescaled/pooled using the same approach for comparability

  7. Two dimensions of suitability Validity: Whether the content of state tests is aligned with the outcomes of interest in the evaluation Reliability: Whether state tests provide a reliable measure of achievement for the target population(in this case, low-performing students) A key concern: State tests have low reliability and do not yield valid inferences about program effectiveness Criteria for Assessing “Suitability”

  8. Criteria for Assessing “Suitability” • Implications for the impact findings: • Poor Validity: • Could fail to detect impacts on the outcome of interest (invalid inference about program effectiveness) • Affects the magnitude of the estimated impact on state tests • Low Reliability: • Student achievement is estimated with greater error • Affects the standard error of the estimated impact on state tests

  9. Criteria for Assessing “Suitability” • Reliability: Compare the standard error of the estimated impact on state tests vs. the study test • Smaller standard error is better (more precision) • Validity: Compare the magnitude of the impact estimates, in light of estimation error… • Compare the statistical significance of the impact findings (i.e., conclusions about program effectiveness based on p-value) • If both estimates are statistically significant, then also compare their magnitudes

  10. Criteria for Assessing Validity • The extent to which the magnitude of the impact estimates are expected to differ depends on the outcome that state tests are intended to measure • Two types of intervention: • Targeted outcome is general achievement(Studies A and B) • The outcome of interest is “general achievement” in math or reading • Both state tests and the study test measure the targeted outcome (general achievement) • If state tests are valid, then the impact on the study test and state tests should be similar

  11. Criteria for Assessing Validity • Two types of intervention (ctd.) • Targeted outcome is a specific skill(Studies C and D) • There are two outcomes of interest: • Targeted skill (short-term) and • General achievement (longer term) • Study test is used to measure the short-term outcome (specific skill), while state tests are used to measure the longer-term outcome (general achievement) • If state tests are valid, then the impact on state tests should be smaller than theimpact on the study test

  12. Benchmark: Impact on the Study Test

  13. P-Value & Magnitude (Validity) Targeted Outcome is General Achievement p = 0.119 p = 0.055

  14. P-Value & Magnitude (Validity) Targeted Outcome is General Achievement p = 0.119 p = 0.189 p = 0.055 p = 0.229

  15. P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = 0.002 p = 0.578

  16. P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = 0.002 p = 0.007 p = 0.578

  17. P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = 0.002 p = 0.007 p = 0.578 p = 0.219

  18. Standard Errors (Reliability)

  19. Standard Errors (Reliability) State-Study Ratio: 1.20 1.07 1.04 1.03

  20. Conclusion • Findings suggest that state tests can be used as a complement to a study-administered test • State tests are suitable (valid and reliable) in 3 of 4 studies • Whether state tests can be used as a substitute for a study test is an open question • Limited availability in some grades and subjects • Available for all states/grades in only 1 of 4 studies • May not be able to use them to measure a specific targeted skill • Possibly less reliable • Findings from descriptive analysis lead to the same conclusions as the impact analysis…

  21. Questions? • Marie-Andrée Somers • marie-andree.somers@mdrc.org • Pei Zhu • pei.zhu@mdrc.org • Edmond Wong • edmond.wong@mdrc.org

More Related