Imperfect Gold Standards for Biomarker Evaluation

Imperfect Gold Standards for Biomarker Evaluation Rebecca A. Betensky Conference on Statistical Issues in Clinical Trials University of Pennsylvania April 18, 2012

Outline • Motivation: need for kidney injury biomarkers for diagnosis of acute kidney injury (AKI) • Impact of imperfect gold standard on apparent sensitivity and specificity of perfect biomarker • Examine conditional independence assumption: implicit restrictions • Bounds on true sensitivity and specificity

Serum creatinine for AKI • Clinicians have used SCr to diagnose AKI for decades. • Acknowledged as inadequate gold standard: • Poor specificity in some settings that are not associated with kidney injury • Poor sensitivity in setting of adequate renal reserve • Relatively slow kinetics after injury • Considerable interest in identifying better biomarkers of tubular injury: potentially more accurate and earlier diagnosis.

How to evaluate new biomarkers? • Studies have used changes in SCr as the gold standard against which to test novel tubular injury biomarkers. • Aside from problems of specificity and sensitivity, • SCr does not directly reflect tubular function or injury • Based on a cutoff, which will impact its true spec and sens, and thus that of novel marker.

Conceptual framework • Actual disease that is the target of the diagnostic test (AKI) is not synonymous with clinical conditions identified by imperfect gold standard (SCr). • AKI is difficult to establish without invasive and risky histopathological assessment. • Using imperfect gold standard (i.e., imperfect reference test) may distort apparent diagnostic performance of novel biomarker.

Idealized example of perfect novel biomarker

disease prevalence=20% imperfect gold standard sensitivity=80%, specificity=80% Relative to imperfect gold standard, a perfect novel biomarker will have apparent sensitivity of 50% and apparent specificity of 64/68=94%.

At lower prevalence, dominant effect of imperfect gold standard is on perfect biomarker’s apparent sensitivity: apparent sens= apparent spec=

This is similar to imperfect gold standard=“need for dialysis”. At prevalence of 20%, apparent sensitivity of perfect biomarker is 100% and apparent specificity is 84%. The bounds of the apparent AUC are 0.84-1.00. Even rare false positives (imperfect gold standard spec=99%) lead to apparent sensitivity of 86% and bounds of apparent AUC of 0.72-0.98.

Cut-offs for SCr • Recent clinical studies of novel AKI biomarkers have used a variety of SCr criteria to define AKI. • These examples illustrate that different choices of cut-off’s can lead to hugely different apparent properties of a novel biomarker.

What if new biomarker is not perfect? • Need assumptions on relationship between new biomarker and imperfect gold standard and disease to evaluate new biomarker. • Conditional independence is convenient; allows for latent class models. • However, it introduces implicit restrictions.

What can we learn for imperfect novel biomarker? • Previous illustration assumes perfect novel biomarker. • Common assumption is conditional independence: P(B=b|G=g,D=d)=P(B=b|D=d) • Apparent sensitivity of B relative to G: • Apparent specificity of B relative to G: • Use these to solve for “true sensitivity” and specificity of B relative to D • Bounds on apparent AUC: • Apparent AUC< apparent sens × apparent spec • Apparent AUC>apparent sens+(1-apparent sens) × apparent spec

Problems with conditional independence • May not be plausible from mechanistic or physiological perspective; the two tests measure related phenomena. • May be association between disease severity and test results; two tests may be conditionally independent given disease severity, but not conditionally independent given presence or absence of disease. • Assumption of conditional independence constrains the disease prevalence; may not be plausible.

Conditional Independence: disease severity • Independence given disease severity: P(G=1, B=1|D=1,X)=P(G=1|D=1,X)×P(B=1|D=1,X) does not imply independence given disease: P(G=1,B=1|D=1)=P(G=1|D=1)×P(B=1|D=1)

Conditional Independence: disease prevalence Conditional independence may not be possible at a given disease prevalence.

Bounds on prevalence under conditional independence Under conditional independence, split into two tables, with some constraints: D=1 D=0 p=P(D=1)= a+ b+c+ d

Example Ignoring sampling variability, for p(0.285,0.715), conditional independence is not possible.

Other dependence assumptions • With more tests, some methods model relationships between some tests. This is arbitrary, and cannot be tested without a rich enough study. • Discrepant resolution method; disfavored due to bias. • Composite reference method; success depends on reliability of reference tests.

Bounds on true sensitivity and specificity of a new biomarker • Explore information available from the comparison of B and G, when no assumptions are made regarding their dependence. • Assume operating characteristics of G are known. • Derive bounds for operating characteristics of B.

Idea • Simply by bounding cells in cross tabulation of G and (B,D) to be between 0 and 1 we derive bounds for • P(D=1, B=1|G=1) • P(D=0, B=0|G=0) • True sensitivity and specificity of G maximized at maxima of these and minimized at minima of these.

Example • Apparent sens=25/35=71% • Apparent spec=60/65=92% • Suppose sens of G is 90% and spec of G is 95% • True sens of B is (61%,81%) • True spec of B is (87%,98%) • These bounds are reasonably narrow.

Example • Apparent sens=50% • Apparent spec=75% • Suppose sens of G is 90% and spec of G is 95% • The true sens of B is (33%,67%) • True spec of B is (71%,78%) • Bound for sens is quite wide, ranging from poor test to possibly adequate; bound for spec is narrow.

Conclusions • Low sensitivity of a promising kidney injury biomarker when expected prevalence of disease is low (e.g., contrast nephropathy – NGAL sensitivity=78%), raises question of imperfect specificity of “gold standard”. • Likewise, low specificity when expected prevalence is high (e.g., ICU with hypotension and sepsis – NGAL spec=76% when applied to critically ill patients) raises question of imperfect sensitivity of gold standard.

Conclusions • Need “hard” clinical endpoints for use as gold standard, but even these have potential problems (e.g., long latency, confounding by other risk factors). • Could use exposure status, such as to nephrotoxic drug, to avoid SCr. • Amount of information in comparing new biomarker to imperfect gold standard may not be very high, even if imperfect gold standard is a good test itself. • Conditional independence is problematic – physiologically and technically. • Nonparametric bounds may or may not be useful; but certainly reflect true information content. • Ultimate validation of a biomarker’s utility is demonstration in a randomized clinical trial that it alters clinical management and improves clinical outcomes.

Acknowledgments • Sarah Emerson, PhD • Sushrut Waikar, MD • Joseph Bonventre, MD Waikar SS, Betensky RA, Emerson SC, Bonventre JV (2012). Imperfect gold standards for kidney injury biomarker evaluation. J Am Soc Nephrol 23: 13-21. Emerson SC, Waikar SS, Bonventre JV, Betensky RA (2012). Biomarker validation with an imperfect reference: issues and bounds. Unpublished manuscript.

With low prevalence, maintaining high specificity is more important than high sensitivity.

Imperfect Gold Standards for Biomarker Evaluation