Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa

Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa Daniel H. Mathalon, Ph.D., M.D. Department of Psychiatry Yale University School of Medicine

Inter-rater Reliability of Clinical Interview Based Measures • Ratings of clinical severity for specific symptom domains (e.g, PANSS, BPRS, SAPS, SANS) • Continuous scales • Use intraclass correlations to assess inter-rater reliability. • Diagnostic Assessment • Categorical Data / Nominal Scale Data • How do we quantify reliability between diagnosticians? • Percent Agreement, Chi-Square, Kappa

Two raters classify n cases into k mutually exclusive categories. Rater 2 Category nij=number of cases falling into cell =freq of joint event ij Rater 1 n..=total number of cases pij= nij / n.. = proportion of cases falling into particular cell. Reliability by Percentage Agreement = ∑ipii = 1/n ∑inii

Percent Agreement Fails to Consider Agreement by Chance Rater 2 .90 x .90 = .81 Rater 1 .10 x .10 = .01 Proportion Agreement = .82 •Assume that two raters whose judgments are completely independent (i.e., not influenced by the true diagnostic status of the patient) each diagnose 90% of cases to have schizophrenia and 10% of cases to not have schizophrenia (i.e., Other). •Expected agreement by chance for each category obtained by multiplying the marginal probabilities together. •Can get Percentage Agreement of 82% strictly by chance.

• Can perform a Chi-Square Test of Association to test null hypothesis that the two raters’ judgments are independent. • To reject independence, show that observed agreement departs from what would be expected by chance alone. Chi-Square = ∑cells (Observed - Expected)2 / Expected • Problem: In example below, we have a perfect association between the Raters with zero agreement.Chi-Square is a test of Association, not Agreement. It is sensitive to any departure from chance agreement, even when the dependency between the raters’ judgments involves perfect non-agreement. • So, we cannot use Chi-Square Test to assess agreement between raters. Chi-Square Test of Association as Proposed Solution Rater 2 Rater 1

po -pc Kappa, K = 1 - p c po= .53 + .14 .03 = .7 pc= .39 + .075 + .01 = .475 .7 - .475 K = = .429 1 - .475 K = 1, perfect agreement K = 0, chance agreement K< 0, agreement worse than chance. •High reliability requires that the frequencies along the diagonal should be > chance and off diagonal frequencies should be < chance. • Use marginal frequencies/probabilities to estimate chance agreement. Kappa Coefficient (Cohen, 1960) Proportion agreement observed, po= ∑ipii = 1/n ∑inii Proportion agreement expected by chance, pc= ∑ipi. x p.i Rater 2 Rater 1 pi. x p.i .39 .075 .01

po -pc Kappa, K = 1 - p c po= .53 + .14 .03 = .7 pc= .39 + .075 + .01 = .475 .7 - .475 K = = .429 1 - .475 • Interpretations of Kappa K = P (agreement | no agreement by chance) 1-pc = 1- .475 = .525 of cases where no agreement by chance po - pc = .7- .475 = .225 of cases are those non-chance agreement cases where observers agreed. Kappa is the probability that judges will agree given no agreement by chance. Can test Ho that Kappa = 0, Kappa is normally distributed with large samples, can test significance using normal distribution. Can erect confidence intervals for Kappa.

Kw= 1 - pc(w) Weighted Kappa Coefficient Can assign weights, wij, to classification errors according to their seriousness using ratio scale weights. po(w) - pc(w) Rater 2 Rater 1

Kappa Rules of Thumb • K ≥ .75 is considered excellent agreement. • K ≤ .46 is considered poor agreement.

Weighted Kappa and the ICC • Is an intraclass correlation coefficient ( except for factor of 1/n) when weights have following property: wij = 1 - (i - j)2 (k - 1) 2

Problems with Kappa • Affected by base rates of diagnoses. • Can’t easily compare across studies that have different base rates, either in the population, or in the reliability study. • Chance agreement is a problem? • When the null hypothesis of rater independence is not met (which is most of the time), the estimate of chance agreement is inaccurate and possibly inappropriate).

Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa