1 / 61

Characteristics of Successful Assessment Measures

Characteristics of Successful Assessment Measures. Reliable Valid Efficient - Time - Money - Resources Don’t result in complaints. Reliability. What Do We Mean by Reliability?. The extent to which a score from a test is consistent and free from errors of measurement.

Rita
Télécharger la présentation

Characteristics of Successful Assessment Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characteristics of Successful Assessment Measures • Reliable • Valid • Efficient • - Time • - Money • - Resources • Don’t result in complaints

  2. Reliability

  3. What Do We Mean by Reliability? • The extent to which a score from a test is consistent and free from errors of measurement

  4. Methods of Determining Reliability • Test-retest (temporal stability) • Alternate forms (form stability) • Internal reliability (item stability) • Interrater Agreement

  5. Reliability Test-Retest

  6. Test-Retest Reliability • Measures Temporal Stability • Stable measures • Measures expected to vary • Administration • Same participants • Same test • Two testing periods

  7. Test-Retest ReliabilityScoring • To obtain the reliability of an instrument, the scores at time one are correlated with the scores at time two • The higher the correlation the more reliable the test

  8. Test-Retest ReliabilityProblems • Sources of measurement errors: • Characteristic or attribute being measured • may change over time. • - Reactivity • - Carry over effects • Practical problems: • - Time consuming • - Expensive • - Inappropriate for some types of test

  9. Standard Error of Measurement • Provides a range of estimated accuracy • 1 SE = 68% confident • 1.98 SE = 95% confident • The higher the reliability of a test, the lower the standard error of measurement • Formula

  10. ExampleMean = 70, SD = 10

  11. Practice Exercise

  12. Exercise Answers

  13. Serial Killer IQ ExerciseMean = 100, SD = 15, Reliability=.90IQ of 70 for death penalty

  14. Serial Killer IQ - AnswersMean = 100, SD = 15, Reliability=.90IQ of 70 for death penalty

  15. Reliability Alternate Forms

  16. Alternate Forms Reliability • Establishes form stability • Used when there are two or more forms of the same test • Different questions • Same questions, but different order • Different administration or response method (e.g., computer, oral) • Why have alternate forms? • - Prevent cheating • Prevent carry over from people who take a test more than once • GRE or SAT • Promotion exams • Employment tests

  17. Alternate Forms ReliabilityAdministration • Two forms of the same test are developed, and to the highest degree possible, are equivalent in terms of content, response process, and statistical characteristics • One form is administered to examinees, and at some later date, the same examinees take the second form

  18. Alternate Forms ReliabilityCounterbalancing

  19. Alternate Forms ReliabilityScoring • Scores from the first form of test are correlated with scores from the second form • If the scores are highly correlated, the test has form stability

  20. Difference Between Parallel and Equivalent

  21. Alternate Forms ReliabilityDisadvantages • Difficult to develop • Content sampling errors • Time sampling errors

  22. What the Research Shows • Computer vs. Paper-Pencil • Few test score differences • Cognitive ability scores are lower on the computer • for speed tests but not power tests • Item order • - Few differences • Video vs. Paper-Pencil • Little difference in scores • Video reduces adverse impact

  23. Reliability Internal

  24. Internal Reliability • Defines measurement error strictly in terms of consistency or inconsistency in the content of the test • With this form of reliability the test is administered only once and measures item stability

  25. Determining Internal ReliabilitySplit-Half Method • Test items are divided into two equal parts • Scores for the two parts are correlated to get a measure of internal reliability • Need to adjust for smaller number of items • Spearman-Brown prophecy formula: • (2 x split half reliability) ÷ (1 + split-half reliability)

  26. Spearman-Brown Formula (2 x split-half correlation) (1 + split-half correlation) If we have a split-half correlation of .60, the corrected reliability would be: (2 * .60) ÷ (1 + .60) = 1.2 ÷ 1.6 = .75

  27. Spearman-Brown FormulaEstimating the Reliability of a Longer Test L = the number of time longer the new test will be

  28. Example Suppose you have a test with 20 items and it has a reliability of .50. You wonder if using a 60-item test would result in acceptable reliability. = = = Estimated New Reliability = .75

  29. Practice

  30. Practice Answers

  31. Common Methods to Determine Internal Reliability • Cronbach’s Coefficient Alpha • - Used with ratio or interval data. • Kuder-Richardson Formula • Used for test with dichotomous items • yes-no • true-false • right-wrong

  32. Interrater Reliability • Used when human judgment of performance is involved in the selection process • Refers to the degree of agreement between 2 or more raters • 3 common methods used to determine interrater reliability • Percent agreement • Correlation • Cohen’s Kappa

  33. Interrater Reliability MethodsPercent Agreement • Determined by dividing the total number of agreements by the total number of observations • Problems • Exact match? • Very high or very low frequency behaviors can • inflate agreement

  34. Interrater Reliability MethodsCorrelation • Ratings of two judges are correlated • Pearson for interval or ratio data and Spearman for ordinal data (ranks) • Problems • Shows pattern similarity but not similarity of actual ratings

  35. Interrater Reliability MethodsCohen’s Kappa • Allows one to determine not only the level of agreement, but the level that would be determined by chance • A Kappa of .70 or higher is considered acceptable agreement

  36. Forensic Examiner A Forensic Examiner B

  37. Example

  38. Demonstration

  39. Increasing Rater Reliability • Have clear guidelines regarding various levels of performance • Train raters • Practice rating and provide feedback

  40. Scorer Reliability • Allard, Butler, Faust, & Shea (1995) • 53% of hand scored personality tests contained at least one • error • 19% contained enough errors to alter a clinical diagnosis

  41. Validity The degree to which inferences from scores on tests or assessments are justified by the evidence

  42. Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. ... The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores required by proposed uses that are evaluated, not the test itself. When test scores are used or interpreted in more than one way, each intended interpretation must be validated. Sources of validity evidence include but not limited to: evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, evidence based on consequences of testing. Standards for Educational and Psychological Testing (1999)

  43. Common Methods of Determining Validity • Content Validity • Criterion Validity • Construct Validity • Known Group Validity • Face Validity

  44. Validity Content Validity

  45. Content Validity • The extent to which test items sample the content that they are supposed to measure • In industry the appropriate content of a test of test battery is determined by a job analysis • Considerations • The content that is actually in the test • The content that is not in the test • The knowledge and skill needed to answer the question

  46. Test of Logic • Stag is to deer as ___ is to human • Butch is to Sundance as ___ is to Sinatra • Porche is to cars as Gucci is to ____ • Puck is to hockey as ___ is to soccer What is the content of this exam?

  47. Messick (1995)Sources of Invalidity • Construct underrepresentation • Construct-irrelevant variance • Construct-irrelevant difficulty • Construct-irrelevant easiness

  48. Domain Content Test Content

  49. Validity Criterion Validity

  50. Criterion Validity • Criterion validity refers to the extent to which a test score is related to some measure of job performance called a criterion • Established using one of the following research designs: • - Concurrent Validity • - Predictive Validity • - Validity Generalization

More Related