150 likes | 633 Vues
Test Validity. S-005. Validity of measurement. Reliability refers to consistency Are we getting something stable over time? Internally consistent? Validity refers to accuracy Is the measure accurate? Are we really measuring what we want?.
E N D
Test Validity S-005
Validity of measurement • Reliability refers to consistency • Are we getting something stable over time? • Internally consistent? • Validity refers to accuracy • Is the measure accurate? • Are we really measuring what we want?
Important distinction!The term “validity” is used in two different ways • Validity of an assessment or method of collecting data • The validity of a test or questionnaire or interview • Validity of a research study • Was the entire study of high quality • Did it have high internal and external validity
Important distinction!The term “validity” is used in two different ways • Referring to entire studies or research reports: • OK: “We examined the internal validity of the study.” • OK: “We looked for the threats to validity.” • OK: “That study involved randomly assigning students to groups, so it had strong internal validity, but it was carried out in a special school, so it is weak on external validity.” • Referring to a test or questionnaire or some assessment: • OK: “The test is a widely used and well-validated measure of student achievement.” • OK: “The checklist they used seemed reasonable, but they did not present any information on its reliability or validity.” • NOT: “The test lacked internal validity.” (This sounds very strange to me.)
Types of validity • Validity – the extent to which the instrument (test, questionnaire, etc.) is measuring what it intends to measure • Examples: • Math test • is it covering the right content and concepts? • is it also influenced by reading level or background knowledge? • Attitude assessment • are the questions appropriate? • does it assess different dimensions of attitudes (intensity, direction, etc.) • Validity is also assessed in a particular context • A test may be valid in some contexts and not in others • A questionnaire may be useful with some populations and not so useful with other groups • Not: “The test has high validity.” • OK: “The test has been useful in assessing early reading skills among native speakers of English.”
Types of validity • Content validity • The extent to which the items reflect a specific domain of content • Is the sample of items really representative? • Often a matter of judgment • Experts may be asked to rate the relevance and appropriateness of the items or questions • e.g., rate each item: very important / nice to know / not important • “Face validity” refers to whether the items appear to be valid (to the test taker or test user)
Types of validity Criterion-related validity • Concurrent validity • agreement with a separate measure • common in educational assessments • e.g., Bayley Scales and S-B IQ test • Complete version and screening test version • Issue: Is there really a strong existing measure, a “gold standard” we can use for validating a new measure? • Predictive validity • agreement with some future measure • SAT scores and college GPA • GRE scores and graduate school performance
Types of validity (cont.) Construct validity • Does the measure appear to produce results that are consistent with our theories about the construct? • Example: We have a “stage-model” of development, so does out measure produce scores/results that look like “stages”? • Convergent validity • Does out measure converge or agree with other measures that should be similar? And . . . • Discriminant validity • Does our measure disagree (or diverge) where it should be different?
McCarthy Screening test example • A test for pre-school children (2.5 – 8.5) • Six subtests: • Verbal, perceptual-performance, quantitative, general cognitive (composite), memory, motor • Reliability evidence for using a short version as a screening test • Split-half correlations for several scales (r = .60 to .80) • Test-retest reliability for other scales (on a subset of children) showed a range of correlations, from .32 to .70.
McCarthy Scales of Children’s Abilities • Reliability • The internal consistency coefficients for the General Cognitive Index (GCI) averaged .93 across 10 age groups between 2.5, and 8.5 years. • Test-retest reliability of GCI over a one month interval was .80. Stability coefficients of the cognitive scales ranged from .62 to .76 with the Motor Scale emerging as the only scale that lacked stability (r=.33).
A short version developed as a screening test Validity information for a short version • A sample of 60 children with learning disabilities • On full version of entire test • 53 out of 60 (88%) failed at least 2 of the 6 subtests • On the short version (the proposed screening version) • 40 out of 60 (67%) failed (and would be identified) • Is this enough information?