Impact of Criterion Reliability on Meta-analysis Results

THE EFFECT OF CRITERION RELIABILITY ON MEANS AND INTERACTIONS IN META-ANALYSIS LAWRENCE R. JAMES PSYCHOLOGY AND MANAGEMENT GEORGIA INSTITUTE OF TECHNOLOGY

META-ANALYSIS • Correlations involving the same or very similar predictors and criteria are retrieved from prior studies. • This set of validities constitutes a distribution that can be summarized statistically using standard descriptors such as the mean and the variance.

VALIDITY GENERALIZATION • Archival information pertaining to statistical artifacts that might affect each validity is obtained (e.g., sampling error, reliability of criterion and predictor, range restriction). • Distributional summary statistics are corrected for artifacts to provide estimates of the mean true (population) validity and the variance among the mean true (population) validities.

WHY VALIDITY GENERALIZATION? • Validity generalization is founded on the possibility that true validities from different populations may be equal, and yet the sample validities may vary because of the operation of statistical artifacts (Hunter, Schmidt, & Jackson, 1982). (This is a question of interaction.) • There is also the strong likelihood that true validities are underestimated by sample validities due to unreliability and range restriction.

RESULTS OF VALIDITY GENERALIZATION Meta-analyses based on validity generalization (VG) procedures continue to be impressive

ILLUSTRATIVE RESULTS • General intellectual ability is said to have an average corrected validity of .53 in predicting job performance (Hunter & Hunter, 1984). • Structured interviews can attain corrected validities in the .47 to .60 range against job performance (Huffcut & Arthur, 1994). • Perceptual speed has an average corrected validity of .47 against clerical performance (Schmidt, 1992). • Integrity tests have average corrected validities of .40 against job performance (applicant samples) and .47 against counterproductive behaviors (all samples) (Ones, Viswesvaran, & Schmidt, 1993).

INFERENCES • Many VG studies suggest that a single intellectual, cognitive, or personality trait can account for upwards of 16% to 36% of the variance in some aspect of job performance. • The days when 16% of the variance (or a validity of .40) was the maximum expected for a trait (Ghiselli & Brown, 1955) are gone, as are the days when validities in the .20s and .30s were commonplace in the reports of ”well-done" validity studies.

QUESTION What precipitated this boost in validities and accountable variance?

BETTER SCIENCE? • Improved measurement instruments? • More sophisticated sampling techniques? • Superior research designs?

Well, not really. We still rely on • the same measurement procedures • the same small samples • the same bivariate correlation designs

Then what gave rise to this bountiful enhancement in validities?

ENHANCEMENT IN VALIDITIES • The boosts in validities come from correcting the observed validities, which have stayed pretty much the same, for attenuation due to unreliability in the criterion (and sometimes the predictor) and direct range restriction in the predictor.

WHAT CHANGED? • Change was not due to improvements in science. • What changed was our historical cautiousness in applying correction equations to validity coefficients?

A CULTURE OF CORRECTIONS The genesis of this “culture of corrections” can be traced to desires to estimate relationships devoid of statistical artifacts.

A FORERUNNER: LATENT VARIABLES For example, latent variable procedures such as LISREL frame the opportunity to employ estimates of perfectly reliable variables in studies of covariation as a major advance in science.

ANOTHER FORM OF LATENT VARIABLE No less dedicated to the pursuit of truth and scientific principle is VG (Schmidt, 1992), the objective being to estimate correlations among true scores (i.e., latent variables) unencumbered by statistical artifacts (e.g., unreliability).

RECEPTIVENESS TO CORRECTIONS • It is the idea that corrected coefficients give greater insight into scientific truths that engendered the current culture of corrections. • Investigators are prone to compute corrected coefficients, and editors, reviewers, and readers tend to be receptive to them.

OUR GOALS • It is not our intent to stand between scientists and the seeking of truth via corrected coefficients. • We do feel that it is reasonable, however, to inquire about the statistical values that are being used to make the corrections. • We are specifically interested in corrections for attenuation due to unreliability in criteria assessed via ratings of job performance. • Study the effects these corrections have on the estimates of the mean true validity and the variance among the estimated true validities from separate populations.

INTERRATER RELIABILITY FOR RATINGS Viswesvaran, Ones, & Schmidt (1996) concluded that • job performance is typically assessed byratings, • the reliability of ratings should be estimated via an interrater reliability analysis, and • the mean interrater reliability for job performance ratings over studies is approximately .52.

WHERE AND WHEN TO USE .52 • If a given study in a VG analysis fails to report criterion reliability, and the criterion is based on ratings, then the best estimate of the missing interrater reliability is .52. • If one is using one of the myriad of VG equations to estimate means and variances of true correlations, and interrater reliability for ratings is missing from many studies (as is often the case), then .52 is the value to insert into the estimating equations for mean observed criterion reliability.

CONSEQUENCE OF USING .52 It is instructive to illustrate the product of using .52 as an estimate of interrater reliability. Using the standard correction for attenuation • an observed validity of .25 becomes a .35 (i.e., 25/[.52]1/2 ) • .30 becomes a .42, • .35 becomes a .49, • .40 becomes a .55.

MAGNITUDE OF INCREASE So, simply by correcting for attenuation based on an interrater reliability of .52, we obtain an 89% increase (i.e., [.552-.402]/.402) in what is regarded as the maximum expected variance accounted for by a single predictor (i.e., .16 to .30).

AN ADVANCE IN SCIENCE? To what extent is this 89% increase in maximum expected variance accounted for reflective of science?

COMPARISONS TO OTHER VARIABLES • Where else in personnel research do we accept, and use, measurement procedures that produce variables with reliabilities of .52? • Is it not true that almost every conceivable variable except performance ratings would be cast out of personnel research if its reliability were .52?

NUNNALLY & BERNSTEIN, 1994 “A reliability of .80 may not be nearly high enough in making decisions about individuals….If important decisions are being made with respect to specific test scores, a reliability of .90 is the bare minimum, and a reliability of .95 should be considered the desirable standard.” (p.265)

DESIRABLE STANDARD FOR PERFORMANCE RATINGS If we desire a .95 reliability for the test scores that are used to hire people for jobs, it seems reasonable to expect the same standard of reliability for the ratings that are used to determine whether people keep their jobs.

PRACTICAL CONSIDERATIONS • Many reliabilities for scores used to make decisions about individuals are not in the .90s. • Many, however, are in the .80s. • With the exception of performance ratings, almost none are in the .50s.

QUESTIONS • Why are performance ratings allowed to survive in spite of what most would agree is questionable measurement? • How do we allow observed validities to be corrected for unreliability in what appear to be flawed variables, and then act as if these corrected validities actually convey some sort of credible scientific information?

QUESTIONS (continued) • Does anyone really believe that it makes sense to talk about a "perfectly reliable criterion" when the observed criterion begins with an interrater reliability of .52? • How exactly does a variable in which almost one-half of the observed variance is some form of bias or error become perfectly reliable?

WHERE IS THE NEW TECHNOLOGY? • It would seem that researchers would have instituted the necessary improvements, given that problems with performance ratings were documented as early as 50 years ago in Guilford’s (1954) classic text in psychometrics. • Have not hundreds of articles been written on the biases and errors that affect performance ratings, especially after the classic articles on problems with performance ratings written by Feldman and Landy & Farr? • We know what the problems are. Why have we not fixed them?

IS THE PROBLEM INTRACTABLE? • Maybe it is not possible to build ratings that can achieve high interrater reliabilities. • If we admit that this is true, then should we also not admit that we cannot justify inserting .52 in corrections for attenuation because we know that “theoretically perfectly reliable” is not going to be even remotely approximated?

Is .52 an accurate estimate of interrater reliability? • This issue is currently being debated elsewhere (LeBreton, Kaiser, Burgess, Atchley, & James, 2001; Murphy & DeShon, 2000a, 2000b; Schmidt, Viswesvaran, & Ones, 2000). • If this estimate is later shown to be inaccurate or ill-founded, then a different debate ensues. • However, for now, let us assume that the .52 estimate is legitimate and accurate.

THE ISSUE We may then deal with the issue of concern here, which is basing substantive scientific judgments on corrections which employ a below threshold reliability for a criterion to produce an enhanced, sometimes much enhanced, estimate of corrected validity.

Is 40 years of research wrong and job satisfaction really is correlated with job performance? • Judge, Thoresen, Bono, and Patton (2001) used .52 as an estimate of criterion reliability to repudiate 40 years of research findings and previous meta-analyses that concluded that job satisfaction has a low correlation with overall job performance. • A mean observed correlation of .18 was corrected to a mean (estimated) true correlation of .30. Correction for unreliability in the criterion accounted for approximately 60% of this increase. • The use of .52 in the correction for attenuation was justified by arguing that this approach was “consistent with all contemporary (post-1990) meta-analytic studies involving job performance.” (p.384)

A COMPARISON • Had criterion reliability been .85 instead of .52, the corrected correlation would have been approximately .23 (job satisfaction reliability was set at .74). Had the reliabilities for both variables been .85, the corrected correlation would have been approximately .21. • Neither of these correlations suggests a substantial linear, additive relationship between job satisfaction and job performance. • Are we going to change this conclusion based on corrections engendered by not being able to measure job satisfaction particularly well and performance hardly at all?

STATISTICAL DYSFUNCTIONS OF CORRECTING FOR LOW RELIABILITIES • At this juncture, I hope that you realize that we have a problem. We cannot base our science on large corrections engendered by poor measurement. • If you have yet to be convinced, then allow me to proceed to demonstrate some unanticipated dysfunctions of inserting low reliabilities into correction equations. • Statistics are based on a working paper by James, LeBreton, and Ladd.

A SINGLE VG ANALYSIS A meta-analysis is conducted on the correlations between scores on a structured interview and ratings of overall job performance. • The mean observed correlation is .35. • Mean criterion reliability is set at .52. • Mean predictor reliability is set at .80. • The ratio between the restricted and unrestricted standard deviations on the predictor is set at .71. (a common value).

Result of a Single VG Analysis The estimate of mean true validity is .67 (Raju, Burke, Normand, & Langolis, 1991, Equation 2).

ADDITIONAL PREDICTORS Three additional predictors chosen to contribute unique variance to prediction. • intelligence test • integrity test • biographical questionnaire

PSYCHOMETRICS OF SEPARATE PREDICTORS • Each additional predictor has an observed validity of .35 against job performance, correlates .20 with each of the other predictors, and has a reliability of .80. • The ratio between the restricted and unrestricted standard deviations is again set at .71.

RESULTS OF THREE ADDITIONAL VG ANALYSES The estimate of mean true validity in each additional VG analysis is .67

MULTIPLE CORRELATION ANALYSIS • Our four separate VG analyses each furnish an impressive increase in validity from .35 to .67. • Now let’s compute a multiple correlation by inserting the results of each separate VG analysis into a multiple correlation analysis.

RESULTS The squared multiple correlation (R2) is 1.03 We account for more than 100% of the variance in the job performance ratings.

COMPARATIVE RESULTS-1 A multiple correlation analysis based on the observed or uncorrected data produces an R2 of .31.

COMPARATIVE RESULTS-2 If all corrections remained the same except that the performance ratings were given a reliability of .80 rather than .52, then • the mean estimated true validity for each of the four variables would have been .54. • the R2 would have been .67.

COMPARATIVE RESULTS-3 If all corrections remained the same except that the performance ratings were given a reliability of .70, which is often considered the lower bound for reliability (Nunnally & Bernstein, 1994), then • the mean estimated true validity for each of the four variables would have been .57. • the R2 would have been .77.

IMPROPER TERRITORY • With reasonable values for criterion reliability set by accepted standards in psychometrics, corrected coefficients provide R2s in proper ranges. • When accepted standards are suspended, the R2 may wander off into improper territory.

SLIPPERY SLOPE • We typically do not see r2s greater than 1.0 in bivariate studies. • Investigators have thus failed to realize that once one begins to suspend judgment about acceptable thresholds for criterion reliability and to allow a value as low as .52 into correction equations, one is on a slippery slope. • The multiple correlation analysis picked up on the slippery slope by producing an improper R2. It follows that the bivariate corrections that engendered this improper value have a tenuous foundation.

VARIANCES • Heretofore we have focused on the mean of a distribution of validities and the estimate of the mean true validity. • It is also possible to focus on the variance of a distribution of validities and the estimate of the variance among true validities.

ESTIMATED VARIANCE AMONG TRUE VALIDITIES • Each sample validity is corrected for artifacts. This provides an estimate of the true validity for the population from which that sample was drawn. • The variance among the estimated true validities is calculated. • This variance is adjusted for sampling error (Raju et al., 1991). • If artifact data are not available for each sample, estimating equations are available.

Impact of Criterion Reliability on Meta-analysis Results