Test Validity and Reliability

Lesson 7a: Validity and Reliability

Lesson Objectives • Define the terms validity and reliability as they relate to tests. • Identify specific types of test validity and test reliability.

Validity and Reliability of Tests • Why should we be concerned about the validity and reliability of tests? • Currently, a great deal of emphasis is placed on assessment. • If we are not sure that a test (or other type of assessment) is reliable and valid, then we cannot accurately use the results. Results from research using such a test would not be true and there for not useful.

Validity and Reliability of Tests • You must prove the validity (accuracy) and reliability (consistency) of the tests you use in your research for your findings to be acceptable.

Test Validity • The validity question for a test/assessment is: Does it measure what it is supposed to measure? • A spelling test should measure spelling ability, a math test math ability, and so on. • An IQ test is supposed to measure intelligence, but some researchers say that it measures the degree of acculturation (how much has been learned about the dominant culture). • Personality tests are used to measure specific facets of our psychological structure.

Test Validity • Suppose a math test had only addition problems. Would it be a good measure of math achievement? • Suppose an AP geography exam had only questions about North America. Would it be a good measure of geography knowledge? • What if a personality test had only self-report items? • OK…you get the idea.

Test Validity • So how is test validity determined (other than common sense)? Answer: By proving you have several types of validity.

Test Validity • Types of (and methods of determining) test validity: • Face • Content-related • Criterion-related concurrent • Criterion-related predictive

Test Validity: Face • Face validity is how the test looks. Does it “appear” to measure what you intend?

Test Validity: Face • Suppose you and your colleagues have created a new creativity test (how creative of you!). • How would you show that your test has face validity? • Think about it for a moment, then go to the next slide for some ideas.

Test Validity: Face • Face validity is the weakest, most basic type of validity. If a test simply looks like it measures what you say it measure, it has face validity. • To see if a test of creativity has face validity, you would read it to see if all the questions ask about aspects of creativity and not math problems, historical facts, or driving a car (although you could be creative in all those areas).

Test Validity: Content-related • Content-related validity refers to how the test questions are representative of the domain that is being measured. • This type of validity is judged by experts in the field. A test publisher would hire one or more consultants to assure that a new test is valid. • It is also judged by a physical comparison of the content in the test items and the content of the source (e.g. the lesson that was taught, the book that was used, etc.)

Test Validity: Content-related • How would you show that your test has content-related validity? • Think about it for a moment, then go to the next slide for some ideas.

Test Validity: Content-related • You would probably need to hire experts (consultants or maybe professors) who can show they have expertise in the assessment of creativity. • They would look at your test and verify (or not) that it really measures creativity – that it has content-related validity.

Test Validity: Content-related • If you were conducting a study on improving creativity, you could place the test next to your instruction plan, or PowerPoint slides, or lecture notes, reading assignment or what ever materials were used to teach creativity—in this example.

Test Validity: Concurrent • Concurrent validity (also called criterion-related concurrent validity) means that your test measures the same way as another test in the same area that has already been proven to be valid. • This type of validity is established by giving a group of people two tests – yours and the one already validated – and correlating the scores. • A significant correlation means your test is valid.

Test Validity: Concurrent • Suppose you have created a new reading comprehension test for special education students. • How will you show concurrent validity? • Think about it for a moment and look on the next slide for possible answers.

Test Validity: Concurrent • The simplest way to establish concurrent validity for your new test is to get a group of special education students and give them two tests – yours and another, established reading test that would be appropriate for them. • A correlation would be calculated, and if it is significant, your test would have concurrent validity.

Test Validity: Predictive • Predictive validity (i.e., criterion-related validity – predictive) is when you have evidence that the scores on a test accurately predict another outcome or a relationship.

Test Validity: Predictive • Some of the most common “tests” used to predict specific outcomes include: • A typing/keyboarding test to predict how fast and accurate someone who was applying for a transcribers position could transcribe recorded meeting notes. If the test results were strongly correlated with the job output of the person hired, we would be able to say that the test had strong predictive validity. • SAT scores are used to predict college success. And, they are predictive, but only for the first year or so. (so they have predictive validity for the first year, but not for college success overall).

Self-Check • Which of the following is the best question to ask yourself about test validity? • Does the test measure what it is supposed to measure? • Can I apply the test results to my situation?

Self-Check • Which of the following is the best question to ask yourself about test validity? • Does the test measure what it is supposed to measure? • Can I apply the test results to my situation? Although this may be a valid question, the first choice is the “official” definition of test validity.

Self-Check • You want to establish concurrent (criterion-related concurrent) validity for your new test. What should you do? • Ask experts in the field to look at your test. • Give people your test and another, already validated test that measures the same thing. • Perform a literature review to show the validity of your test.

Self-Check • Which threat to validity comes from using a test that is not valid? • Selection • Instrumentation • Pretesting

Self-Check • Which threat to internal validity comes from using a test that is not valid? • Selection • Instrumentation • Pretesting

Test Reliability • The reliability of a test is related to its consistency. • Think of your car. You want it to be reliable – start all the time, turn when you turn the steering wheel, stop when you press the brakes. • A reliable test is consistent in the way it measures.

Test Reliability • Test reliability usually involves the degree of relationship between the two tests (a correlation). • Correlations range from –1 to 1, but test reliability measures are positive values only. • The number that is reported for test reliability is generally a positive decimal number such as .78 or .36. • The higher the number, the more reliable the test.

Test Reliability • There are several types of and ways to measure reliability: • Stability (also called Test/Retest) • Equivalence • Equivalence & Stability • Internal Consistency • Interrater Reliability (when rating scales are used)

Test Reliability • Stability means that as a test is given over and over, the way it measures remains consistent (in other words, the results remain stable). • It is established by giving a test twice to a single group of people (with some time in between) and correlating the two sets of scores. • If the correlation is significant, then the test is reliable in terms of stability.

Test Reliability • Equivalence is a kind of reliability that needs to be established when there are several forms of the same test. • Having multiple forms of a test is very common among widely used commercial tests (GRE, SAT, WISC, WRAT, etc.).

Test Reliability • Equivalence is established by giving two (or more) forms of the test to one group of people. • The scores are correlated—a significant correlation means that the various forms of the test are equivalent.

Test Reliability • Equivalence and Stability means just what you think – two or more forms of a test are given twice (with time in between) to the same group of people. • This procedure demonstrates that the various forms of the test are stable over time and that they are equivalent to each other.

Test Reliability • Equivalence and Stability • There is also a very weak form of equivalence and stability, one that I recommend you never use: giving one form of a test to a group of people and giving another/different form of the test to the same group at a later date. I’ve included this information, as information only in case you read about it. It truly if a very weak method for proving reliability.

Test Reliability • Internal Consistency is a popular form of reliability, because it involves using giving a test only once to one group of people and conducting a statistical analysis of their answers. • There are several variations: • Split-Half • Kuder-Richardson • Cronbach Alpha • (and others)

Test Reliability • Split-Half reliability is determined by giving a test to a group of people, then splitting the questions in half in some way (e.g., every other question). • The scores on the two halves are correlated, and a significant correlation means a reliable test.

Test Reliability • Kuder-Richardson (KR) reliability begins the same way – giving one test to one group of people. • NOTE: To use KR, the test must have right and wrong answers. • A computer is used to calculate every possible way to split the questions in half, then a correlation is calculated.

Test Reliability • Cronbach Alpha (α) reliability works the same way as Kuder-Richardson, but is used on tests which have no right or wrong answers (agree/disagree, etc.).

Test Reliability • Interrater reliability is used to establish the reliability of a rating scale (such as a rubric or behavior observation). • The rating scale is used by several raters on several individual or projects. • If the scores from the various raters are significantly correlated, the rating is said to be reliable.

Self-Check • Which of the following best defines test reliability? • Test measures consistently over time • Test measures what it is supposed to measure

Self-Check • Which of the following methods of determining reliability is done by giving the same test twice to one group of people? • Stability • Equivalence • Internal Consistency

Self-Check • True or false: To determine reliability through internal consistency, it is only necessary to give a test once to one group of people. • True • False

Practice Problems: Validity For each of the following problems, read the scenario and figure out which type of validity is described. The correct answer will appear on the next slide.

Practice Problems: Validity 1. A teacher compares his curriculum objectives to the new district-wide authentic assessment instrument to evaluate whether it is good match with what he teaches.

Practice Problems: Validity • Content-related

Practice Problems: Validity • Mr. Crossly sees two geology questions on his 11th grade American History final exam.

Practice Problems: Validity • This test does NOT have face validity.

Test Validity and Reliability