Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis.

Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis. Marina Kondratovich, Ph.D. U.S. Food and Drug Administration, Center for Devices and Radiological Health No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred. September, 2003

Outline • Introduction • Place for propensity scores • Distributions of covariates (details) • Distributions of a New Test results (details) • Bias of naïve AUC estimation • Matching for one covariate Weighted ROC analysis • Stratification for one covariate • Relationship between AUC by matching and by stratification • Propensity score – pre-test risk of disease • Conjunction of a New Test with other diagnostic tests

ROC Analysis New Test is quantitative. New Test Variable: X for Diseased population Y for Non-Diseased population ROC curve = relationship between sensitivity and specificity of a New Test over all possible cut-off values. The AUC (area under curve) is the most common measure of the test performance. • AUC = sensitivity averaged over all values of specificity; specificity averaged over all values of sensitivity; • AUC = P{X>Y} probability that a randomly selected Diseased subject has a test value bigger than that for a randomly chosen Non-Diseased subject

In order to correctly estimate the diagnostic accuracy of a New • Test, we should compare the values of the New Test for Diseased • subjects and the values of the New Test for the same Non-Diseased • subjects. • Each subject has two potential values of the New Test: • a value X that would be observed if the subject was Diseased and • a value Y that would be observed if the subject was Non-Diseased. • But X and Y cannot be observed jointly for same subject. • Subject = {New Test, Covariates (e.g., C1=Age, C2=BMI)} • If we were able to assign randomly the subjects to Diseased and • Non-Diseased clinical states then Diseased and Non-Diseased • groups were comparable in the sense of covariates and • diagnostic accuracy of Test was evaluated correctly. • But such a random assignment is impossible.

Biased estimators of AUC occur if I. Distributions of covariates are different for the Disease and Non-Diseased study groups; and II. Distributions of New Test results are different for different sets of covariates. Problem: Consider M randomly selected Diseased subjects and N randomly selected Non-Diseased subjects. Naïve estimation of AUC is biased (usually overstated). Consider these two situations in more details for one covariate, Age.

Age distribution (t1, t2, t3). t1=0.5; t2=0.3; t3=0.2 I. Different Age distributions in Diseased and Non-Diseased study groups. Target Population π1 π2 π3 0.1 Age1 Pre-test risk of Disease (Age) = 0.3 Age2 0.5 Age3 πpopulation = π1·t1+ π2·t2+ π3·t3 0.24

Age distributions I. Study Groups: M randomly selected Diseased subjects, N randomly selected Non-Diseased subjects. Diseased Non-Diseased M = m1 + m2 + m3 N = n1 + n2 + n3 E [mi/M] = pi E [ni/N] = qi p1=0.21; p2=0.38; p3=0.41 q1=0.59; q2=0.28; q3=0.13

I. Study Groups: M randomly selected Diseased subjects, N randomly selected Non-Diseased subjects. Monotonic function of πi, depends on πpopulation and πstudy. Pre-test risk of Disease in the study (Agei) = related to the pre-test risk of Disease in the population

II. The distribution of the Test variable depends on Age. The New Test variables of Diseased subjects: X1 , X2 , X3 with c.d.f. F1(x), F2(x), F3(x) Non-Diseased subjects: Y1 , Y2 , Y3 with c.d.f. G1(y), G2(y), G3(y) Example. Disease=Fracture, Non-Diseased=No Fracture, New Test=Ultrasound test for body site. This is a hypothetical relationship between the average ultrasound test and the age. Usually, the ultrasound values becomes lower with increasing of age. PSA test values (for prostate cancer) are increasing with increasing age; BNP test values (for congestive heart failure) are increasing with increasing age.

This is a typical picture of the data • (ultrasound test for the bone status). • The age distributions for Diseased and Non-Diseased subjects are different. • II. The values of the New Test depend on age. Prostate cancer is more prevalent in older men; Congestive heart failure is more prevalent in older people.

PROBLEM: Naïve estimation of AUC is biased (usually overstated). Indeed, Wilcoxon - Mann -Whitney statistic Ψ(A,B) =1 if A>B; ½ if A=B; 0 if A<B where area under ROC curve when the Diseased subjects are Agek -years old and the Non-Diseased subjects are Ages -years old.

Example. New Test does not have diagnostic ability: New Test cannot discriminate Diseased and Non-Diseased subjects in every age group. X1 , Y1 ~ N(1,1/4) X2, ,Y2 ~ N(2,1/4) X3 , Y3 ~ N(3,1/4) Age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41); age distribution of Non-Diseased subjects is qT=(0.59; 0.28; 0.13), AUC matrix is Non-diseased Age1 Age2 Age3 Age1 Age2 Age3 Diseased Two groups, Diseased and Non-Diseased, appear different with respect to the values of the New Test.

Example (continued). Non-diseased Age1 Age2 Age3 Age1 Age2 Age3 AUC matrix: Diseased If the age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41); age distribution of Non-Diseased subjects is qT=(0.59; 0.28; 0.13), then the mean value of the Wilcoxon-Mann-Whitney statistic, pTAUCq, is 0.68. The matrix element AUC3,1=0.98, which corresponds to the biggest age group of Diseased subjects (p3=0.41) and the biggest age group of Non-Diseased subjects (q1=0.59), makes the largest contribution to the bilinear form pTAUCq, computed for vectors p and q.

Adjustments for one covariate Three common methods of adjusting for one confounding covariate: • Matching • Stratification • Covariate adjustment through logistic regression

Matching Matching of Diseased and Non-diseased subjects means that the age distributions of these subjects are the same. Let the diseased and non-diseased subjects be matched with common age distributionφT = (φ1 , φ2 , φ3 ) Theorem. A New Test cannot discriminate Diseased and Non-Diseased populations for each age group. Then the expected value of the Mann-Whitney statistic is 0.5 for any age distribution in the age-matched samples of Diseased and Non-Diseased subjects. Wilcoson-Mann-Whitney statistic correctly evaluates the test performance (area under ROC curve) only for age-matched samples.

Matching (continued) By matching, we create a “quasi-randomized” experiment. That is, if we find two subjects, one in the Diseased and one in Non-Diseased group, with the same pre-test risk of Disease (same age), then we could imagine that there was one subject to whom the value of the New Test was observed when this subject was Diseased and when this subject was Non-Diseased. The age-matched study groups are similar with respect to the Age (AUC for the covariate Age is exactly 0.5). Then we are sure that the difference in the New Test distributions for Diseased and Non-Diseased groups are not due to the difference in age. Problem: The data of unmatched subjects are not used in AUC. Then the weighted ROC analysis should be used.

Weighted ROC Analysis Data set: Diseased and Non-Diseased Subjects are not Age-matched. We want to have these two samples be age-matched with the common age distribution φ, where φk = dk/D (dk = min(mk, nk)). Age distribution DiseasedNon-diseased for matching Age1 d1=3 m1=3 n1=5 1=3/7 Age2 d2=3 m2=4 n2=3 2=3/7 Age3 d3=1 m2=2 n2=1 3=1/7

Weighted ROC Analysis (continued) • For each age Agek, we can take • Some set of size dk of mk Diseased subjects. There are different variants. • Some set of size dk of nk Non-Diseased subjects. There are different variants. For Age1, 10 variants; for Age2, 4 variants; for Age3, 2 variants. Total number of different matched sets: 80 (=10 x 4 x 2). Using the particular age-matched set of D Diseased and D Non-Diseasedsubjects, we can estimate age-matched AUC using the Wilcoxon statistic. Then we consider all possible sets of matching, estimate AUC for each set, and then take the average of AUC over all these sets.

Weighted ROC Analysis (continued) This is equivalent to the calculation of AUC with all N Diseased subjects with weights dk/mk and with all M Non-Diseased subjects with weights dk/nk: The weighted ROC analysis is equivalent to consideration of all possible variants of age-matching with common age distribution φ. Also, the weighted estimate of AUC can be obtained using the bootstrap technique.

Weighted ROC Analysis (continued) Age distribution DiseasedNon-diseased for matching Age1 d1=3 m1=3 n1=5 1=3/7 Weights 1 1 1 3/5 3/5 3/5 3/5 3/5 Age2 d2=3 m2=4 n2=3 2=3/7 Weights 3/4 3/4 3/4 3/4 1 1 1 Age3 d3=1 m2=2 n2=1 3=1/7 Weights 1/2 1/2 1

Weighted ROC Analysis (continued) The weighted AUC is unbiased estimate of φ-age-matchedAUC. The variance of the weighted estimate is: If dk≤ min(mk, nk) (all weights are not more than 1) then this variance is smaller than the variance for one matching set.

Stratification The strata are defined and Diseased and Non-Diseased subjects who are in the same stratum are compared. DiseasedNon-diseased Age1 m1=3 n1=5 AUC1,1 Age2 m2=4 n2=3 AUC2,2 Age3 m2=2 n2=1 AUC3,3

Is there a relationship between AUC by matching ? and AUC by stratification Stratification (continued) Overall diagnostic accuracy of the New test can be the weighted average of AUC1,1, AUC2,2, and AUC3,3. We can consider the linear combination: where φ is the same as in matching, φk = dk/D (dk = min(mk, nk)). If AUC1,1=AUC2,2=AUC3,3=AUC, then the weights φk are similar to the weights inversely proportional to variances of stratum AUC.

AUC by matching: φTAUCφ = 0.624 AUC by stratification: 0.639 Example. New Test = Ultrasound test for bone status. The results of the ultrasound test are the normal variables with the means whichare differentfor different ages and with the same standard deviation of 130 m/sec. Matrix AUC φT = (0.2; 0.5; 0.3)

Theorem. Let φT=(φ1, φ2, φ3) be the age distribution in the age-matched Diseased and Non-Diseased groups. Then , where the matrix Δis a symmetric matrix with elements Relationship between AUC by matching and AUC by stratification Matrix Δ from previous Example. For broad class of distributions, ≤ AUC by stratification AUC by matching

Covariates (C1, C2, …, CL) • Matching based on many covariates is difficult. • Stratification: As the number of covariates increases, • the number of strata grows exponentially.

Propensity Scores Replace the collectionof confounding covariates with one scalar function of these covariates: the propensity score. Propensity score (PS): conditional probability be in Diseased group rather than Non-Diseased group, given a collection of observed covariates. PS (C1, C2, …, CL)= Pr (Disease| C1, C2, …, CL). Propensity Score = Pre-test risk of Disease given a collection of covariates, C1, C2, …, CL.

Construction of propensity score (pre-test risk) Logistic regression or others (neural networks,..) Outcome: Disease – 1, Non-Disease – 0. Predictors: all measured covariates, some interaction terms or squared terms, and so on. New Test is not included. AUC for combined covariates – a measure of covariates unbalance. The distributions of X and Y variables, the values of a New Test for Diseased and Non-Diseased groups, depend on the covariates but this dependence is approximated well through the pre-test risk: F (x, C1, C2, …, CL) = F (x, PS(C1, C2, …, CL)); G (y, C1, C2, …, CL) = G (y, PS(C1, C2, …, CL)).

Propensity Scores (continued) • Calculate estimated propensity scores (pre-test risk) for all subjects using the propensity score model. • Sort all subjects by propensity scores. • Divide subjects into strata that have similar PS. • Estimate AUC by matching (use weighted AUC) or AUC by stratification. BMI mk Diseased nk Non-Diseased Age Five strata based on logistic regression model of age and BMI (linear terms).

Propensity Scores (continued). Example: conjunction of a New Test with other diagnostic tests A New test is used in conjunction with other clinical tests to detect the clinical state “Disease”. The use of propensity scores technique is convenient tool for the matching based on all available prior information (covariates) about the subjects. Example: “Disease”= any stenosis during coronary angiography; New Test; C1 = Age; C2 = Gender; C3 = Total cholesterol; C4 = HDL (“good” cholesterol) C5 = LDL (“bad” cholesterol) In order to correctly evaluate the diagnostic ability of a New Test, matched AUC analysis should be performed. Matching based on propensity score is recommended.

Use of matched ROC analysis when New Test results do not depend on the covariates. If the distribution of the New Test results for each strata is the same (F1=F2=F3=F, G1=G2=G3=G) but we do not have any information about that and use the matched ROC analysis. How is the matched estimate of AUC related to the usual empirical estimate? Theorem. The matched estimate of the AUC is unbiasedestimate of AUC but the variance of the matched estimate is inflated. Proof based on the Hölder’s inequality (see [1]).

Summary • If the results of a New Test depend on covariates and distributions of covariates in Diseased and Non-Diseased groups are different then only matched ROC analysis correctly evaluates the diagnostic accuracy of the New Test. • Matching based on propensity scores (pre-test risk of Disease) reduces bias. Propensity score is seriously degraded when important covariates influencing pre-test risk have not been collected. • Weighted ROC analysis allows more effectively utilizing all the data.

References The propensity scores technique is well developed in the context of observational studies and studies for the therapeutic devices. In the context of diagnostic studies, however, there has been little papers. • Kondratovich, Marina V. (2000). Methodology of removing the effect of confounding variables in receiver operating characteristic (ROC) analysis. • Proceedings of the 2000 Joint Statistical Meeting, Biopharmaceutical Section, Indianapolis, IN. • 2. Kondratovich, Marina V. (2002). Matched receiver operating • characteristic (ROC) analysis and propensity scores. Proceedings of the 2002 Joint Statistical Meeting, Biopharmaceutical Section, New York, NY. • 3. Zweig, M.H. and Campbell, G. (1993). Receiver operating • characteristic (ROC) plots: a fundamental evaluation tool in • clinical medicine. Clinical Chemistry, 39, p. 561-577.

References for the propensity scores technique • Rubin, DB, Estimating casual effects from large data sets using propensity scores. Ann Intern Med 1997; 127:757-763 • Grunkemeier, GL and et al, Propensity score analysis of stroke after off-pump coronary artery bypass grafting, Ann Thorac Surg 2002; 74:301-305 • Wolfgang, C. and et al, Comparing mortality of elder patients on hemodialysis versus peritoneal dialysis: A propensity score approach, J. Am Soc Nephrol 2002; 13:2353-2362 • Rosenbaum, PR, Rubin DB, Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79:516-524 • Blackstone, EH, Comparing apples and oranges, J. Thoracicand Cardiovascular Surgery, January 2002; 1:8-15 • D’agostino, RB, Jr., Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group, Statistics in medicine,1998,17:2265-2281

Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis.