Testing the Incremental Predictive Accuracy of New Markers

Testing the Incremental Predictive Accuracy of New Markers Presenter: Colin Begg Co-Authors: Mithat Gonen Venkatraman Seshan Department of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center 5th Annual UPenn Conference on Statistical Issues in Clinical Trials April 2012

The Issue • In evaluating the incremental predictive or diagnostic accuracy of a new marker it is commonplace to use nested models and then test using a comparison of AUCs derived from the baseline model and from the model augmented with the new marker • Often two tests are performed • A Wald test or equivalent from the regression • A Delong test of the AUCs

From Kwon et al. Radiology 2011Detection of Coronary Artery Disease

General Observations • The Wald test and the ROC AUC test would appear to be testing the same hypothesis • Doesn’t feel right to require a marker to survive 2 significance tests • Anecdotal evidence that ROC test is frequently non-significant for markers significant in regression tests • The use of one test (or testing strategy) with clear statistical properties seems advisable • But which test should be used?

Simulations • Standard bivariate normal markers generated with mean (μ1,μ2) and correlation ρ • μ2 = effect of new marker • μ1 = effect (collective) of existing predictors • Generate datasets and analyze the incremental effect of μ2 using the Wald test and the Delong et al. AUC test

From Vickers et al. BMC Med Res Meth 2011 AUC test is exceptionally conservative with very low power.

The Delong et al. (1988) AUC Test Simulations from Venkatraman and Begg (Biometrika 1996) Key assumption: observations (pairs) are i.i.d.

Why is the AUC Test Invalid in this Context? • Notation • Baseline predictor – m1i (could be multivariate) • New marker – m2i • Outcome (binary) – yi • Model • Derived predictors – w1i and w2i • AUC approach derives the AUC of (yi,w1i) and the AUC of (yi,w2i) and compares them using the Delong et al. (1988) test • This test accommodates the fact that w1i and w2i are correlated. • What’s wrong with this approach?

Problems with AUC Test in Nested Models Two fundamental problems • (w1i, w2i) are not independent between patients • corr(w1i, w1j), corr(w2i, w2j) and corr(w1i, w2j) are all typically strongly positive • In later example we will use (n=55, μ1=0.3, μ2=0.0) these are, respectively, 0.50, 0.41 and 0.35 • Derived AUCs influenced strongly by the concept of known directionality

What is Known Directionality? • Predictors from regression models inherently strive to improve predictability • If a new marker is truly null, it is equally likely to produce a positive or a negative regression parameter estimate • Either way the model interprets a non-zero parameter estimate as a contribution of predictive information, and this will generally increment the AUC upwards regardless of the sign of the effect • In contrast, if we are comparing distinct markers as in a conventional ROC comparison of diagnostic tests, the AUC of the new marker is equally likely to be smaller or larger than the control AUC under the null

Consequences for AUC Test Solid curve – observed test statistic from 5000 simulations Dashed curve – asymptotic null distribution μ1 = 0.3, μ2 = 0.0, ρ = 0, n = 500 Although the test statistic is biased upwards, its greatly reduced variance (relative to the asymptotic variance) makes it unlikely for the statistic to be in the critical region.

ConsequencesImpact of Known Directionality

A Valid Area Test • Define w1={m1i} and w2={m2i} • Construct an orthogonal decomposition of w2 • w2 = Pw2 + (I – P)w2 • P = (X’X)-1X where X = (1 w1 z) and z represents other covariates • w2c = (I – P)w2 forms an exchangeable sequence under the null hypothesis • To create a valid reference distribution we can permute w2c, regenerate w2 = Pw2 + permuted w2c, perform the regressions and the AUC test • Simulations show that this has size and power similar to (slightly lower than) the Wald test.

Power of the “Projection AUC” Test (n=500; 5000 simulations) The Projection AUC Test has power approaching the Wald test

Conclusions • The asymptotic reference distribution of the DeLong et al. AUC test is grossly invalid when the markers being compared are derived predictors from nested regression models. • The test statistic (difference in AUCs) is fine – with a valid reference distribution the power approaches the Wald Test Two natural follow-up questions:- • Independent validation samples • These are widely used to correct for over-confidence in risk prediction models • Is the strategy of comparing AUCs from “validation” samples appropriate and valid? • Non-nested models • Is the AUC test valid in this context?

Independent Validation Samples • The samples being compared are new cases to which the predictors (and parameter estimates) from the former models are applied • One cannot perform a Wald-type test on the predictors, though one could repeat the Wald analysis on the entire validation dataset • One could take these predictors and perform an AUC test • But this is problematic -- Since the AUC test is a rank test we are in effect comparing {w1i*} versus {w2i*} where -- Since under the null we are comparing {w1i} with the same marker with additional noise. -- This does not feel entirely satisfactory

Size of AUC Test in Validation Samples

Power of AUC Test in Validation Samples Comment: Testing the AUC increment using the AUC test does not appear to be a sensitive approach, especially as the baseline predictiveness increases.

Is the Test Valid for Non-Nested Models? (n=500)

General Conclusions Regarding Testing • ROC curve (and the AUC measure) is one among many measures that have been proposed for characterizing predictive accuracy • However, ROC curves derived from predictors from regression models must be used with caution • They are affected by optimism bias • Predictors cannot be used as original data to perform conventional tests of incremental predictive accuracy • Use of validation samples • AUC difference is still not strictly unbiased • Confirming a positive result is much more powerful using Wald • Non-nested models • AUC test is invalid as in the nested setting

Measuring Improvements in Prediction • ROC curves have been promoted for use as descriptive tools for characterizing the degree of improved predictive power • Several other measures have also been promoted in recent years: notably:- • NRI – net reclassification improvement • IDI – integrated discrimination improvement • Are these tools affected by similar biases?

Net Reclassification Improvement (NRI) • The general goal of this measure is to determine the extent to which the new predictive rule improves the classification of patients into clinically distinct categories • There are various reclassification indices that depend on pre-defined classification points • Here we use the “continuous” version (Pencina Stat Med 2008) • NRI = # times new predictor improves upon old predictor less the number of times the new predictor is inferior. Contribution of cases is weighted by (prevalence)-1 and of controls by (1 – prevalence)-1. An improvement is defined by • if subject is a case • if subject is a control

Plot of Wald Statistic Versus Standardized NRINull Effect NRI WALD Model fitting ensures that there is a strong positive bias in the NRI. Simulations: n=250; μ1 = 0.3; μ2 = 0.0; ρ = 0.

Corresponding Densities Red curve – Wald statistic Black curve – NRI statistic

Integrated Discrimination Improvement (IDI) • This measure is an average of the improvements in sensitivity and specificity due to the additional marker • IDI = mean of • where is (prevalence)-1 for cases and – (1 – prevalence)-1 for controls

Plot of Wald Statistic Versus Standardized IDINull Effect IDI Wald Clearly the two statistics are closely related.

Corresponding Densities Red curve – Wald statistic Black curve – IDI statistic

IDI and |Wald| Are Closely Related Red curve – Wald statistic Black curve – IDI statistic Green curve -- |Wald| statistic

Conclusions • Biases induced by known directionality affect all measures of discrimination performance • The use of risk predictors derived from regression models as data inputs for subsequent tests or measures of discrimination leads to profound bias, especially when the effects of new markers are close to the null • Tests of new markers (in a nested framework) should be accomplished using traditional Wald or likelihood ratio tests from the regression model. • Use of independent validation samples is essential for measuring the magnitude of the impact of new markers

References • The work in this talk is mostly based on material in the following two articles • Vickers AJ, Cronin AM, Begg CB.One statistical test is sufficient for assessing new predictive markers. BMC Med Res Methodol. 2011 Jan 28;11:13 • Seshan VE, Gonen M, Begg CB. Comparing ROC curves derived from regression models. Manuscript under review. Available at http://www.bepress.com/mskccbiostat/paper20/

Testing the Incremental Predictive Accuracy of New Markers