Validity/Reliability

Validity/Reliability

Reliability From the perspective of classical test theory, an examinee's obtained test score (X) is composed of two components, a true score component (T) and an error component (E): X=T+E

Reliability The true score component reflects the examinee's status with regard to the attribute that is measured by the test, while the error component represents measurement error. Measurement error is random error. It is due to factors that are irrelevant to what is being measured by the test and that have an unpredictable (unsystematic) effect on an examinee's test score.

Reliability The score you obtain on a test is likely to be due both to the knowledge you have about the topics addressed by exam items (T) and the effects of random factors (E) such as the way test items are written, any alterations in anxiety, attention, or motivation you experience while taking the test, and the accuracy of your "educated guesses."

Reliability Whenever we administer a test to examinees, we would like to know how much of their scores reflects "truth" and how much reflects error. It is a measure of reliability that provides us with an estimate of the proportion of variability in examinees' obtained scores that is due to true differences among examinees on the attribute(s) measured by the test.

Reliability When a test is reliable, it provides dependable, consistent results and, for this reason, the term consistency is often given as a synonym for reliability (e.g., Anastasi, 1988). Consistency = Reliability

The Reliability Coefficient Ideally, a test's reliability would be calculated by dividing true score variance by the obtained (total) variance to derive a reliability index. This index would indicate the proportion of observed variability in test scores that reflects true score variability. True Score Variance/Total Variance = Reliability Index

The Reliability Coefficient A test's true score variance is not known, however, and reliability must be estimated rather than calculated directly. There are several ways to estimate a test's reliability. Each involves assessing the consistency of an examinee's scores over time, across different content samples, or across different scorers. The common assumption for each of these reliability techniques that consistent variability is true score variability, while variability that is inconsistent reflects random error.

The Reliability Coefficient Most methods for estimating reliability produce a reliability coefficient, which is a correlation coefficient that ranges in value from 0.0 to + 1.0. When a test's reliability coefficient is 0.0, this means that all variability in obtained test scores is due to measurement error. Conversely, when a test's reliability coefficient is + 1.0, this indicates that all variability in scores reflects true score variability.

The Reliability Coefficient The reliability coefficient is symbolized with the letter "r" and a subscript that contains two of the same letters or numbers (e.g., ''rxx''). The subscript indicates that the correlation coefficient was calculated by correlating a test with itself rather than with some other measure.

The Reliability Coefficient Regardless of the method used to calculate a reliability coefficient, the coefficient is interpreted directly as the proportion of variability in obtained test scores that reflects true score variability. For example, as depicted in Figure 1, a reliability coefficient of .84 indicates that 84% of variability in scores is due to true score differences among examinees, while the remaining 16% (1.00 - .84) is due to measurement error. Figure 1. Proportion of variability in test scores True Score Variability (84%) Error (16%)

The Reliability Coefficient Note that a reliability coefficient does not provide any information about what is actually being measured by a test! A reliability coefficient only indicates whether the attribute measured by the test— whatever it is—is being assessed in a consistent, precise way. Whether the test is actually assessing what it was designed to measure is addressed by an analysis of the test's validity.

The Reliability Coefficient Study Tip: Remember that, in contrast to other correlation coefficients, the reliability coefficient is never squared to interpret it but is interpreted directly as a measure of true score variability. A reliability coefficient of .89 means that 89% of variability in obtained scores is true score variability.

Methods for Estimating Reliability The selection of a method for estimating reliability depends on the nature of the test. Each method not only entails different procedures but is also affected by different sources of error. For many tests, more than one method should be used.

1. Test-Retest Reliability: The test-retest method for estimating reliability involves administering the same test to the same group of examinees on two different occasions and then correlating the two sets of scores. When using this method, the reliability coefficient indicates the degree of stability (consistency) of examinees' scores over time and is also known as the coefficient of stability.

Test-Retest Reliability The primary sources of measurement error for test-retest reliability are any random factors related to the time that passes between the two administrations of the test. These time sampling factors include random fluctuations in examinees over time (e.g., changes in anxiety or motivation) and random variations in the testing situation. Memory and practice also contribute to error when they have random carryover effects; i.e., when they affect many or all examinees but not in the same way.

Test-Retest Reliability Test-retest reliability is appropriate for determining the reliability of tests designed to measure attributes that are relatively stable over time and that are not affected by repeated measurement. It would be appropriate for a test of aptitude, which is a stable characteristic, but not for a test of mood, since mood fluctuates over time, or a test of creativity, which might be affected by previous exposure to test items.

2. Alternate (Equivalent, Parallel) Forms Reliability: To assess a test's alternate forms reliability, two equivalent forms of the test are administered to the same group of examinees and the two sets of scores are correlated. Alternate forms reliability indicates the consistency of responding to different item samples (the two test forms) and, when the forms are administered at different times, the consistency of responding over time.

Alternate (Equivalent, Parallel) Forms Reliability • The alternate forms reliability coefficient is also called the coefficient of equivalence when the two forms are administered at about the same time; • and the coefficient of equivalence and stability when a relatively long period of time separates administration of the two forms.

Alternate (Equivalent, Parallel) Forms Reliability The primary source of measurement error for alternate forms reliability is content sampling, or error introduced by an interaction between different examinees' knowledge and the different content assessed by the items included in the two forms (eg: Form A and Form B)

Alternate (Equivalent, Parallel) Forms Reliability The items in Form A might be a better match of one examinee's knowledge than items in Form B, while the opposite is true for another examinee. In this situation, the two scores obtained by each examinee will differ, which will lower the alternate forms reliability coefficient. When administration of the two forms is separated by a period of time, time sampling factors also contribute to error.

Alternate (Equivalent, Parallel) Forms Reliability Like test-retest reliability, alternate forms reliability is not appropriate when the attribute measured by the test is likely to fluctuate over time (and the forms will be administered at different times) or when scores are likely to be affected by repeated measurement.

Alternate (Equivalent, Parallel) Forms Reliability • If the same strategies required to solve problems on Form A are used to solve problems on Form B, even if the problems on the two forms are not identical, there are likely to be practice effects. • When these effects differ for different examinees (i.e., are random), practice will serve as a source of measurement error. • Although alternate forms reliability is considered by some experts to be the most rigorous (and best) method for estimating reliability, it is not often assessed due to the difficulty in developing forms that are truly equivalent.

3. Internal Consistency Reliability: Reliability can also be estimated by measuring the internal consistency of a test. Split-half reliability and coefficient alpha are two methods for evaluating internal consistency. Both involve administering the test once to a single group of examinees, and both yield a reliability coefficient that is also known as the coefficient of internal consistency.

Internal Consistency Reliability To determine a test's split-half reliability, the test is split into equal halves so that each examinee has two scores (one for each half of the test). Scores on the two halves are then correlated. Tests can be split in several ways, but probably the most common way is to divide the test on the basis of odd- versus even-numbered items.

Internal Consistency Reliability A problem with the split-half method is that it produces a reliability coefficient that is based on test scores that were derived from one-half of the entire length of the test. If a test contains 30 items, each score is based on 15 items. Because reliability tends to decrease as the length of a test decreases, the split-half reliability coefficient usually underestimates a test's true reliability. For this reason, the split-half reliability coefficient is ordinarily corrected using the Spearman-Brown prophecy formula, which provides an estimate of what the reliability coefficient would have been had it been based on the full length of the test.

Internal Consistency Reliability Cronbach's coefficient alpha also involves administering the test once to a single group of examinees. However, rather than splitting the test in half, a special formula is used to determine the average degree of inter-item consistency. One way to interpret coefficient alpha is as the average reliability that would be obtained from all possible splits of the test. Coefficient alpha tends to be conservative and can be considered the lower boundary of a test's reliability (Novick and Lewis, 1967). When test items are scored dichotomously (right or wrong), a variation of coefficient alpha known as the Kuder-Richardson Formula 20 (KR-20) can be used.

Internal Consistency Reliability Content sampling is a source of error for both split-half reliability and coefficient alpha. • For split-half reliability, content sampling refers to the error resulting from differences between the content of the two halves of the test (i.e., the items included in one half may better fit the knowledge of some examinees than items in the other half); • for coefficient alpha, content (item) sampling refers to differences between individual test items rather than between test halves.

Internal Consistency Reliability Coefficient alpha also has as a source of error, the heterogeneity of the content domain. A test is heterogeneous with regard to content domain when its items measure several different domains of knowledge or behavior.

Internal Consistency Reliability The greater the heterogeneity of the content domain, the lower the inter-item correlations and the lower the magnitude of coefficient alpha. Coefficient alpha could be expected to be smaller for a 200-item test that contains items assessing knowledge of test construction, statistics, ethics, epidemiology, environmental health, social and behavioral sciences, rehabilitation counseling, etc. than for a 200-item test that contains questions on test construction only.

Internal Consistency Reliability The methods for assessing internal consistency reliability are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. They are not appropriate for assessing the reliability of speed tests because, for these tests, they tend to produce spuriously high coefficients. (For speed tests, alternate forms reliability is usually the best choice.)

4. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability: Inter-rater reliability is of concern whenever test scores depend on a rater's judgment. A test constructor would want to make sure that an essay test, a behavioral observation scale, or a projective personality test have adequate inter-rater reliability. This type of reliability is assessed either by calculating a correlation coefficient (e.g., a kappa coefficient or coefficient of concordance) or by determining the percent agreement between two or more raters.

Inter-Rater (Inter-Scorer, Inter-Observer) Reliability Although the latter technique is frequently used, it can lead to erroneous conclusions since it does not take into account the level of agreement that would have occurred by chance alone. This is a particular problem for behavioral observation scales that require raters to record the frequency of a specific behavior. In this situation, the degree of chance agreement is high whenever the behavior has a high rate of occurrence, and percent agreement will provide an inflated estimate of the measure's reliability.

Inter-Rater (Inter-Scorer, Inter-Observer) Reliability Sources of error for inter-rater reliability include factors related to the raters such as lack of motivation and rater biases and characteristics of the measuring device. An inter-rater reliability coefficient is likely to be low, for instance, when rating categories are not exhaustive (i.e., don't include all possible responses or behaviors) and/or are not mutually exclusive.

Inter-Rater (Inter-Scorer, Inter-Observer) Reliability The inter-rater reliability of a behavioral rating scale can also be affected by consensual observer drift, which occurs when two (or more) observers working together influence each other's ratings so that they both assign ratings in a similarly idiosyncratic way. (Observer drift can also affect a single observer's ratings when he or she assigns ratings in a consistently deviant way.) Unlike other sources of error, consensual observer drift tends to artificially inflate inter-rater reliability.

Inter-Rater (Inter-Scorer, Inter-Observer) Reliability The reliability (and validity) of ratings can be improved in several ways: • Consensual observer drift can be eliminated by having raters work independently or by alternating raters. • Rating accuracy is also improved when raters are told that their ratings will be checked. • Overall, the best way to improve both inter- and intra-rater accuracy is to provide raters with training that emphasizes the distinction between observation and interpretation (Aiken, 1985).

RELIABILITY AND VALIDITY Study Tip: Remember the Spearman-Brown formula is related to split-half reliability and KR-20 is related to the coefficient alpha. Also know that alternate forms reliability is the most thorough method for estimating reliability and that internal consistency reliability is not appropriate for speed tests.

Factors That Affect The Reliability Coefficient The magnitude of the reliability coefficient is affected not only by the sources of error discussed earlier, but also by the length of the test, the range of the test scores, and the probability that the correct response to items can be selected by guessing. • Test Length • Range of Test Scores • Guessing

1. Test Length: The larger the sample of the attribute being measured by a test, the less the relative effects of measurement error and the more likely the sample will provide dependable, consistent information. Consequently, a general rule is that the longer the test, the larger the test's reliability coefficient.

Test Length The Spearman-Brown prophecy formula is most associated with split-half reliability but can actually be used whenever a test developer wants to estimate the effects of lengthening or shortening a test on its reliability coefficient. For instance, if a 100-item test has a reliability coefficient of .84, the Spearman-Brown formula could be used to estimate the effects of increasing the number of items to 150 or reducing the number to 50. A problem with the Spearman-Brown formula is that it does not always yield an accurate estimate of reliability: In general, it tends to overestimate a test's true reliability (Gay, 1992).

Test Length This is most likely to be the case when the added items do not measure the same content domain as the original items and/or are more susceptible to the effects of measurement error. Note that, when used to correct the split-half reliability coefficient, the situation is more complex, and this generalization does not always apply: When the two halves are not equivalent in terms of their means and standard deviations, the Spearman-Brown formula may either over- or underestimate the test's actual reliability.

2. Range of Test Scores: Since the reliability coefficient is a correlation coefficient, it is maximized when the range of scores is unrestricted. The range is directly affected by the degree of similarity of examinees with regard to the attribute measured by the test.

Range of Test Scores When examinees are heterogeneous, the range of scores is maximized. The range is also affected by the difficulty level of the test items. When all items are either very difficult or very easy, all examinees will obtain either low or high scores, resulting in a restricted range. Therefore, the best strategy is to choose items so that the average difficulty level is in the mid-range (r = .50).

3. Guessing: A test's reliability coefficient is also affected by the probability that examinees can guess the correct answers to test items. As the probability of correctly guessing answers increases, the reliability coefficient decreases. All other things being equal, a true/false test will have a lower reliability coefficient than a four-alternative multiple-choice test which, in turn, will have a lower reliability coefficient than a free recall test.

The Interpretation of Reliability The interpretation of a test's reliability entails considering its effects on the scores achieved by a group of examinees as well as the score obtained by a single examinee.

Interpretation of Reliability Coefficient The Reliability Coefficient: As discussed previously, a reliability coefficient is interpreted directly as the proportion of variability in a set of test scores that is attributable to true score variability. A reliability coefficient of .84 indicates that 84% of variability in test scores is due to true score differences among examinees, while the remaining 16% is due to measurement error. While different types of tests can be expected to have different levels of reliability, for most tests in the social sciences, reliability coefficients of .80 or larger are considered acceptable.

The Interpretation of Reliability When interpreting a reliability coefficient, it is important to keep in mind that there is no single index of reliability for a given test. Instead, a test's reliability coefficient can vary from situation to situation and sample to sample. Ability tests, for example, typically have different reliability coefficients for groups of individuals of different ages or ability levels.

Interpretation of Standard Error of Measurement While the reliability coefficient is useful for estimating the proportion of true score variability in a set of test scores, it is not particularly helpful for interpreting an individual examinee's obtained test score. When an examinee receives a score of 80 on a 100-item test that has a reliability coefficient of .84, for instance, we can only conclude that, since the test is not perfectly reliable, the examinee's obtained score might or might not be his or her true score.

Interpretation of Standard Error of Measurement A common practice when interpreting an examinee’s obtained score is to construct a confidence interval around that score. The confidence interval helps a test user estimate the range within which an examinee's true score is likely to fall given his or her obtained score. This range is calculated using the standard error of measurement, which is an index of the amount of error that can be expected in obtained scores due to the unreliability of the test. (When raw scores have been converted to percentile ranks, the confidence interval is referred to as a percentile band.)

Interpretation of Standard Error of Measurement The following formula is used to estimate the standard error of measurement: Formula 1: Standard Error of Measurement SEmeas = SDx *(1 – rxx)1/2 Where: SEmeas = standard error of measurement SDx = standard deviation of test scores rxx= reliability coefficient

Validity/Reliability