Statistical Considerations for Educational Screening & Diagnostic Assessments

Statistical Considerations for Educational Screening & Diagnostic Assessments A discussion of methodological applications which have existed in the literature for a long time and are used in other disciplines but are emerging more now in education Yaacov Petscher, Ph.D. Florida Center for Reading Research Florida State University

Discussion Points • Assessment Assumptions • Contexts of Assessments • Statistical Considerations • Reliability • Validity • Benchmarking • “Disclaimer” • Focusing on Breadth not Depth • Based on applied contract and grant research • One slide of equations

Assumptions of Assessment - Researchers Constructs exist but we can’t see them Constructs can be measured Although we can measure constructs, our measurement is not perfect There are different ways to measure any given construct All assessment procedures have strengths and limitations

Assumptions of Assessment - Practitioner Multiple sources of information should be part of the assessment process Performance on tests can be generalized to non-test behaviors Assessment can provide information that helps educators make better educational decisions Assessment can be conducted in a fair manner Testing and assessment can benefit our educational institutions and society as a whole

Contexts of Assessments • Instructional • Formative • Interim • Summative • Research • Individual Differences • Group Differences (RCT) • Growth • Legislative Initiatives • NCLB • Reading First • Race to the Top • Common Core

Common Core Adoption

PARCC

Smarter Balanced

Within Common Core • USDOE • PARCC Assessments • Smarter Balanced Assessments • Reading for Understanding Assessments • I3 Assessments • Private Sector

Underlying “Code” of Assumptions Researcher Constructs exist but we can’t see them Constructs can be measured Although we can measure constructs , our measurement is not perfect There are different ways to measure any given construct All assessment procedures have strengths and limitations Practitioner Multiple sources of information should be part of the assessment process Performance on tests can be generalized to non-test behaviors. Assessment can provide information that helps educators make better educational decisions Assessment can be conducted in a fair manner. Testing and assessment can benefit our educational institutions and society as a whole.

Statistical Considerations - Reliability • Stability, accuracy, or consistency of test scores • Many types • Internal consistency • Retest • Parallel-form • Split-half • Should not be viewed as interchangeable • Once could have very high stability but very poor internal consistency • Date of Birth/Height/SSN

Statistical Considerations - Reliability T X e Most frequently used framework is classical test theory What does this assume?

Benefits of IRT • Puts persons and individuals on the same scale • CTT looks at total score by p-value (difficulty) • Can result in shorter tests • CTT reliability increases with more items • Can estimate the precision of scores at the individual level • CTT assumes error is the same

Item Difficulty by Total Score Decile Groups

Item Difficulty by Ability

Items Don’t Always Do What We Want

Item Information

Test Information – Standard Error

Precision/Reliability

Statistical Considerations - Reliability • While precision improves on the idea of reliability, can precision be improved? • Account for context effects (Wainer et al., 2000) • Petscher & Foorman, 2011 • Account for time (Verhelst, Verstralen, & Jansen, 1997) • Prindle, Petscher, & Mitchell, 2013

Statistical Considerations - Reliability • Context effects • Any influence or interpretation that an item may acquire as a result of its relationship to other items • Greater problem in CAT due to unique testing • Emerges as an item and passage level problem

Statistical Considerations - Reliability Common stimulus

Statistical Considerations - Reliability “If several questions within a test a test are experimentally linked so that the reaction to one question influences the reaction to another, the entire group of questions should be treated preferably as an ‘item’ when the data arising from application of split-half or appropriate analysis-of-variance methods are reported in the test manual” APA Standards of Educational and Psychological Testing (1966)

Expressed in IRT

Study 1Reading Comprehension in Florida

Precision – After 3 passages

FAIR Technical Manual

Simulations are all well and good… How does accounting for item dependency improve testing in real world?

RCT • N ~= 800, randomly assigned to testing condition • Control was current 2pl scoring • Experimental was unrestricted bi-factor • Evaluate • Precision • # of passages • Prediction to state achievement

What this suggests “Newer” models help us to more appropriately model the data Precision/reliability are improved just by modeling the context effect Improve the efficiency and precision of a computer-adaptive test by modeling the item-dependency

Study 2Morphology CAT

Accounting for Time • Somewhat similar to the item dependency model • IRT models are concerned with accuracy • What about fluency? • CBM (DIBELS, AIMSweb, easyCBM) • Brief assessments (TOWRE, TOSREC, etc) • Prindle, Petscher, Mitchell (2013) • N = 200 • Word knowledge test • Limited to 60 sec • Compared 1pl with a 1pl-response time models

Results 1pl marginal α = .80 1pl-rt marginal α = .87

What this suggests • Accounting for response time of items can improve precision for most participants • Limitations • More difficult to do with younger children • Requires computer delivery to record accuracy and time • Cannot do with connected text

Validity

Statistical Considerations – Factor Validity • Assessments are measures of hypothetical constructs • Assessments are measured with error • Use latent variable to leverage the common variance • How is this modeled? • Unidimensional • Multidimensional • Three illustrations • Petscher & Foorman, 2012 (Syntactic Awareness) • Kieffer & Petscher, 2013 (Morphology/Vocabulary) • Justice, Petscher, & Pentimonti, 2013 (Early Literacy)

Study 1Syntactic Awareness

Distribution of Ability

Precision (reliability) of Ability Scores

Predictive Validity of Factor Scores

Study 2Morphological Awareness/Vocabulary

Morphological Awareness (MA) predicts Reading Comprehension (RC) For a while, we have known that MA is correlated with reading comprehension (e.g., Carlisle, 2000; Freyd & Baron, 1982; Tyler & Nagy, 1990) MA RC

Statistical Considerations for Educational Screening & Diagnostic Assessments