Application of Item Response Theory to PRO Development

Application of Item Response Theory to PRO Development Michael A. Kallen, PhD, MPH Associate Professor, Research Faculty Department of Medical Social Sciences Feinberg School of Medicine Northwestern University Chicago, Illinois

Outline • The new measurement • Constructing measures • Extending the usefulness of measurement

To see what’s new… • The Lower Extremity Function Scale (LEFS) • from a “classical” perspective

Original LEFS Instructions “We are interested in knowing whether you are having any difficulty at all with the activities listed below because of your lower limb problem for which you are currently seeking attention. Please provide an answer for each activity.” “Today, do you or would you have any difficulty at all with: (Circle one number on each line)”

Measurement products needed? • All patient individualitemscores, so as to obtain • a patient’s total score

The burden of measurement • For the healthcare provider/researcher: • instrument administration • score summation • score interpretation • For the patient: • The actual time (and thoughtfulness) it requires to respond • Being asked inappropriate questions

The cost of this burden? • For the healthcare provider/researcher: • Perhaps a loss of willingness to use the instrument to collect data, score responses, or interpret a patient’s self-reported condition. • For the patient: • Perhaps a loss of willingness to respond in a focused, honest way to an instrument that seems unresponsive or even annoying.

Classical psychometrics • Its beginnings go back to the turn of the 20th century. • Consider Spearman’s work in disattenuating the correlation coefficient. • Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 1907.

Psychometrics’ First “Golden Years” • The zenith of classical psychometrics arguably occurred during the time period surrounding the publication of Gulliksen’s book, Theory of Mental Tests (1950). • This time period saw the development of many of the best and brightest aspects of classical psychometrics: • e.g., 1) that measurements, not instruments, have psychometric properties; 2) the introduction of alpha reliability; 3) validity as the sine qua non of measurement.

Classical Test Theory (CTT) • CTT is a cornerstone of classical psychometrics. • It is theory-based measurement: • Individual scores are theory-defined as composed of a true score component and an error component. • That is, • observed score = true score + error

CTT as Theory • Because it is theory, CTT or “true score theory” does not provide: • hypotheses that are testable, or • models that are falsifiable.

CTT and Circular Definitions • In CTT, item difficulty is defined as: • the proportion of examines in a group of interest who answer an item correctly. • Thus, an item being “hard” or “easy” depends on the ability of the group of examinees measured. • As a result, it is a challenge to determine score meaning: • Examinee and test characteristics are entangled; • each can be interpreted only in the context of the other.

Scores depend on the difficulty of test items Very Easy Test Person 1 8 Expected Score 8 Very Hard Test Person 1 8 Expected Score 0 Person Medium Test 1 8 Expected Score 5 Reprinted with permission from: Wright, B.D. & Stone, M. (1979) Best test design, Chicago: MESA Press, p. 5.

The New Measurement • In the past 40-plus years, measurement has undergone a quiet evolution.

Modern Psychometrics • In 1968 Lord and Novick’s book, Statistical Theories of Mental Tests, was published. • In it, model-based measurement, a foundation ofmodern psychometrics, was introduced.

Item Response Theory (IRT) • IRT, a form of model-based measurement, now plays a fundamental role in modern psychometrics. • It addresses CTT’s major shortcomings: • By providing test-independent and group-independent measurement; • by employing models that can be tested and falsified. • If a proposed IRT model does NOT adequately explain the data at hand, it can be determined whether assumptions were met or an inappropriate model was used.

CTT vs. IRT • CTT, from classical psychometrics, is • theory-based, • sample and test-specific, and • focuses on test performance. • IRT, from modern psychometrics, is • model-based, • ability or functioning level-specific, and • focuses on item performance.

Outline • The new measurement • Constructing measures • Extending the usefulness of measurement

IRT approach • For any approach • Getting the measure right helps in getting the measurement right • For the IRT approach • There have come to be accepted and recommended steps to take for getting a measure right

Adapteval-HIV Project (Yang, Kallen) • Goal: To develop and evaluate a set of HIV-specific item banks to support • IRT-based CAT assessment • Implementation on multiple technology platforms (web/phone/PDA)

Item Response Theory Person Latent Trait                          Poor Good Easy Hard Q Q Q Q Q Q Q Q Q Q Q Item Location

IRT assumptions • Unidimensionality • Measure one “thing” only • Monotonicity • The “better” the trait status on a single scale item, the “better” the trait status on the overall scale • Local independence • Items are independent of each other statistically, after controlling for shared dimensionality

Monotonicity • The “better” the trait status on a single scale item, the “better” the trait status on the overall scale

Local Independence • Items are independent of each other statistically, after controlling for shared dimensionality • Components of a scale item’s variance • Required: shared variance, in common with the scale’s unidimensional factor • Expected: some residual or noise/error variance • Problem: If the residual variance of one item is correlated with that of another item, at some point the variance is no longer just noise

Advantages of IRT approach to measurement • Focus is on the item, not the scale • Each item possesses trait estimation capacities • Provides item- and group-independent measurement • Not tied to sample or particular items used • Makes computer adaptive testing a reality • Accumulation of detailed knowledge about individual items and their functioning • Customized item presentation reduces the number of patient responses needed to achieve measurements of similar quality

Pain Fatigue Sleep Emotion Functioning Cognitive Functioning Self Care/Daily Living Physical Act./Leisure Life Satisfaction Body Image Physical Symptoms Sat w/ Medical Care Work/Employment Neg. Social Issues Pos. Social Exp. Adapteval-HIV: 14 item “sets”

Measure development process • Identify initial set of items (adapted from 9 existing instruments) • (248 items) • Panel (20 docs/nurses/patients/psychosocial) eval • (226 items) • Panel (+ input from 3 psychosocial researchers) evaluation • (192 items) • Pilot test: 50 patients • Analysis of floor/ceiling, missing data, low SD • (146 items) • Primary data collection: 400 patients • Primary psychometric analysis • (107 items) • IRT parameter calibration to implement CAT

Psychometric analyses:item exclusion/inclusion • Primary Criteria • Missingness > 20% (2 items excluded) • CFA factor loadings < .50 (16 excluded) • Local dependence > .20 (5 excluded) • Lack of monotonicity (9 excluded) • Potential item bias (3 excluded) • Secondary Criteria • Multi-dimensionality (3 excluded) • Failed IRT parameter convergence (2 excluded)

Project History • Phase I • 50 patients at Northwestern • Web/phone • Prelim psychometric analyses (cognitive interviews, etc.) • Phase II • 225 each at Northwestern and UIC • Web/phone/PDA • Complete psychometric analyses

Ethnicity/race and gender characteristics of primary data collection (NU + UIC)

Adapteval-HIV question distribution • Overall Total: Pool (107)

Ceiling and Floor Effects • Aiming at < 20% *Entries in red denote the values falling within the expected ranges.

CFA Goodness of Fit (Standards for Unidimensionality) • Root Mean Square Error of Approximation (RMSEA) < 0.10 • Comparative Fit Index (CFI) > 0.90 • Standardized Root Mean Square Residual (SRMR) < 0.08 *Entries in red denote the values falling within the expected ranges.

Inter-correlations: ranged from 0.069-0.710 Scale Inter-Correlations *Entries in red denote the values falling within the expected ranges (i.e., <0.90).

Internal Consistency • Group comparison: Cronbach's alpha > 0.7 • Individual comparison: Cronbach's alpha > 0.9 *Entries in red denote the values falling within the expected ranges.

Computerized Adaptive Testing 1. Begin with initial trait estimate 2. Select & present optimal scale item 3. Record and score response No 5. Is stopping rule satisfied 4. Re-estimate trait 7. End of battery Yes No 8. Administer next scale item 6. End of assessment Yes Source: Wainer et al. (2000). Computerized Adaptive Testing: A Primer 2nd Ed. LEA. Mahwah N.J. 9. Stop

CAT performance: # of questions • Overall Total: Pool (107), Average (61.1)

Static (Full) vs. CAT/IRT • Correlation of assessment scores between full instrument and CAT/IRT implementation • Body Image (4 items, avg. 3.7): .99 • Cognitive Functioning (9 / 4.9): .91 • Emotional Functioning (23 / 4.9): .87

Relief of measurement burden • For the healthcare provider/researcher, computerizing the test can relieve some of the burden. • Use a scoring algorithm. • Deliver score-specific interpretation. • This could be accomplished witha computer administered test.

Relief for respondents? • Yes, if the measure is presented as a computer adaptive test (CAT), • i.e., the test adapts itself or customizes itself according to the responses presented to it by an individual patient.

CATs and Item Structure • Item presentation order • CTT: standardized • All patients start at Item #1 and complete all items in order • IRT: customized • Patients are presented new items based on their responses to previous items • IRT “logic” • There is an underlying hierarchy of activities. • Activities can be ordered (“calibrated”) from easiest to hardest.

Making sharp turns while running fast • Running on uneven ground • Running on even ground • Hopping • Walking a mile • Performing your usual hobbies, recreational or sporting activities • Squatting • Standing for 1 hour • Performing heavy activities around your home • Going up or down 10 stairs (about 1 flight of stairs) • Performing any of your usual work, housework, or school activities • Walking 2 blocks • Lifting an object, like a bag of groceries from the floor • Getting into or out of a car • Getting into or out of the bath • Performing light activities around your home • Putting on your shoes or socks • Walking between rooms CALIBRATION: 0-100 SCALE

Making sharp turns while running fast • Running on uneven ground • Running on even ground • Hopping • Walking a mile • Performing your usual hobbies, recreational or sporting activities • Squatting • Standing for 1 hour • Performing heavy activities around your home • Going up or down 10 stairs (about 1 flight of stairs) • Performing any of your usual work, housework, or school activities • Walking 2 blocks • Lifting an object, like a bag of groceries from the floor • Getting into or out of a car • Getting into or out of the bath • Performing light activities around your home • Putting on your shoes or socks • Walking between rooms VIEW: 30-65 RANGE

CATs • CATs have starting rules. • LEFS CAT: Begin with an item of moderate difficulty. • And CATs have stopping rules. • LEFS CAT: • When the SEM < 4 (score range: 0-100), or • when the average score change for last 3 score estimates < 1, or • when all LEFS items are completed.

CAT simulation:Focus on item selection • Item Pool: 18-items from the Lower Extremity Functioning Scale (LEFS)

Application of Item Response Theory to PRO Development

Application of Item Response Theory to PRO Development

Presentation Transcript

Techniques for Explaining Item Response Theory to Stakeholder

Introduction to Item Response Theory

Item Response Theory in a Multi-level Framework

Item Response Theory in Health Measurement

Item Response Theory

Basics of Item Response Theory

Bias, Item Response Theory, and Mixed-Models

Item Response Theory Using Bayesian Networks by Richard Neapolitan

Estimation of Item Response Models

Similarities and Differences Between Classical and Item Response Theory

Modeling Student Growth Using Multilevel Mixture Item Response Theory

Introduction to Item Response Theory

Item Response Theory

Item Response Theory

Item Response Theory (IRT) Models for Questionnaire Evaluation: Response to Reeve

Introduction to Item Response Theory (IRT)

Using Item Response Theory to Track Longitudinal Course Changes

Item Response Theory in Health Measurement

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory