1 / 74

Introduction to Item Response Theory (IRT)

Introduction to Item Response Theory (IRT). Friday Harbor 2009 Paul K. Crane, MD MPH Dan Mungas, PhD. Disclaimer. Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging.

abarbara
Télécharger la présentation

Introduction to Item Response Theory (IRT)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Item Response Theory (IRT) Friday Harbor 2009 Paul K. Crane, MD MPH Dan Mungas, PhD

  2. Disclaimer • Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging. • The views expressed do not necessarily reflect official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government.

  3. Topics • Application: brief introduction to neuropsychology • Application: educational testing • Similarities and differences • IRT is a single factor CFA model • Cool things you can do with IRT that you can’t do easily with classical test theory • Limitations of IRT; extensions to deal with those limitations

  4. Neuropsychology • Tests administered to understand brain functioning • Usually multiple domains are assessed • Typical clinical questions: • Is this person impaired? • If so, what is the diagnosis? • This person has been treated with (medication X, cognitive therapy Y); is it working?

  5. Neuropsychological test interpretation • Consider predicted ability (“premorbid” ability) and use as a benchmark for interpreting testing results • Based on history, such as occupational attainment, educational background, rank in the military • Based on vocabulary testing; vocabulary is relatively preserved in AD (but not other pathologies) • Generally assume premorbid ability is consistent across all cognitive domains • Compare results to estimated premorbid ability

  6. Neuropsychological tests • Most developed decades ago, primarily used in clinical rather than research settings • Patterns of deficits • More recent development of epidemiological studies with large batteries of tests • Potential opportunity for modern test theory to improve validity of statistical inference from these sorts of data

  7. Educational testing • A century+ of educational tests • Lord and Novick’s Statistical Theories of Mental Test Scores (1967) • Psychometrics: educational testing as a discipline • Testing companies (ETS, ACT) • State-level testing • Constant generation of new items • Tests change all the time • IRT has emerged as the dominant paradigm for educational tests

  8. Similarities and differences • In both cases, we care about the mental processes that lead to item responses rather than the item responses themselves • Items as indicators of a latent trait or ability • Multiple choice format more common in educational tests • Ordinal formats more common in neuropsychological tests • Item generation less common in neuropsychological tests • Generally many more indicators of each latent trait in educational tests than neuropsychological tests

  9. Can IRT add to usual neuropsychological praxis? • YES!!!

  10. IRT is a single factor Confirmatory Factor Analysis (CFA) model • Developed from separate lines of thinking • Mathematically equivalent • IRT has better developed infrastructure for addressing measurement precision / reliability • Better for interpretability of individual test taker’s score • Facilitates CAT (computerized adaptive testing, to be discussed later) • Structural equation modeling (SEM) has better developed infrastructure for addressing violations to IRT’s assumptions and for “the structural part”, which may be the primary interest from a research perspective • Infrastructure for addressing measurement precision is being developed (Dr. Curtis)

  11. Single factor CFA model

  12. IRT Assumption: unidimensionality

  13. IRT assumption: local independence

  14. Item characteristic curves

  15. Item characteristic curves b parameter: difficulty

  16. Item characteristic curves a parameter: slope

  17. Where do item parameters come from? • Large data set • Need a good distribution of the thing measured by the test (some high functioning, some low functioning) • Do not need to be representative of anything in particular (not “norms”) • IRT package figures out where the people are and where the items are • Iterative EM algorithm • Think of items as beads on horizontal strings that can be moved left-right depending on difficulty

  18. What do we do with ICCs? • Important picture: the test characteristic curve • TCC is the sum of all of the item characteristic curves • Plot of the standard score associated with each value of the underlying thing measured by the test • Next slides are TCCs of imaginary tests made up of dichotomous items

  19. 4 items each at 0.5 increments

  20. Comments on that test • Essentially linear test characteristic curve • Immaterial whether the standard score or the IRT score is used in analyses • No ceiling or floor effect • People at the extremes of the thing measured by the test will get some right and get some wrong • Pretty nice test!

  21. 2 items each at 0.5 increments

  22. Comments on that test • Essentially linear test characteristic curve • Immaterial whether the standard score or the IRT score is used in analyses • No ceiling or floor effect • People at the extremes of the thing measured by the test will get some right and get some wrong • Pretty nice test! • But that’s what we said about the last one and it had twice as many items!

  23. Why might we want twice as many items? • Measurement precision / reliability • CTT: summarized in a single number: alpha • IRT: conceptualized as a quantity that may vary across the range of the test • Information • Mathematical relationship between information and standard error of measurement • Intuitively makes sense that a test with 2x the items will measure more precisely / more reliably than a test with 1x the items

  24. Test information curves for those two tests

  25. Standard errors of measurement for the two tests

  26. Comments about these information and SEM curves • Information curves look more different than the SEM curves • Inverse square root relationship • TIC 100  SEM 0.10 (1/10) • TIC 25  SEM 0.20 (1/5) • TIC 16  SEM 0.25 (1/4) • TIC 9  SEM 0.33 (1/3) • TIC 4  SEM 0.50 (1/2) • Trade off between test length and measurement precision • CAT discussion later

  27. These were highly selected “tests” • It would be possible to design such a test if we started with a robust item pool • Almost certainly not going to happen by accident / history • What are more realistic tests? • First example: items bunched up in the middle

  28. Test characteristic curves for 2 26-item dichotomous tests

  29. Comments on these TCCs • Same number of items but very different shapes • Now it may matter whether you use an IRT score or a standard score in analyses • Both ceiling and floor effects

  30. TICs

  31. SEMs

  32. Comments on the TICs and SEMs • Comparing the red test and the blue test: the red test is better for people of moderate ability (more items close to where they are) • For people right in the middle, measurement precision is just as good as a test twice as long • Items far away from your ability level don’t help your standard error • The blue test is better for people at the extremes (more items close to where they are)

  33. Where do information curves come from? • Item information curves use the same parameters as the item characteristic curves (difficulty level, b, and strength of association with latent trait or ability, a) (see next slides) • Test information is the sum of all of the item information curves • We can do that because of local independence

  34. I(θ) = D2a2*P(θ)*Q(θ)

  35. I(θ) = D2a2*P(θ)*Q(θ)

  36. I(θ) = D2a2*P(θ)*Q(θ)

  37. I(θ) = D2a2*P(θ)*Q(θ)

  38. I(θ) = D2a2*P(θ)*Q(θ)

  39. More thoughts on IICs • The b parameters shift the location of the curve left and right (where is the bead on the string?) • The a parameters modify the height of the curve • But not tons; location location location • IRT gives us tools to evaluate and manage measurement precision

  40. What about real tests? MMSE

  41. Global cognitive tests: TCC

  42. Global cognitive test: not IRT

  43. Global cognitive tests: SEM

  44. Comments on global cognitive tests • None of these have hard items, so people with average and high levels of cognition are measured with low precision • Curvilinearity in these tests may be a problem for longitudinal data analyses, especially when people start at different places • For example, education as a risk factor • Varying levels of measurement precision across time

  45. What about specific domains? • Baseline data from ADNI • Assigned items to batteries • Figured out how to generate a TCC vs. a z score composite (which was not as simple as I had thought)

  46. ADNI memory TCC 0.8 0.7 0.5 From 1 to 0: 0.8 z units. From 0 to -1: 0.7. From -1 to -2: 0.5

  47. ADNI Executive Functioning TCC THAT looks pretty linear

  48. ADNI memory and EF TICs So EF has linear scaling, but does not have much measurement precision

  49. Summary of this section • IRT provides tools to evaluate and manage measurement precision / standard error of measurement • Test characteristic curve and the test information curve are very helpful in understanding how a test is working • Alongside a histogram of abilities from a population of interest, which are on the same metric

  50. Topics • Application: brief introduction to neuropsychology • Application: educational testing • Similarities and differences • IRT is a single factor CFA model • Cool things you can do with IRT that you can’t do easily with classical test theory • Limitations of IRT; extensions to deal with those limitations

More Related