Revising FDA’s “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests”

Revising FDA’s “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests” FDA/Industry Statistics Workshop September 28-29, 2006 Kristen Meier, Ph.D. Mathematical Statistician, Division of Biostatistics Office of Surveillance and Biometrics Center for Devices and Radiological Health, FDA

Outline • Background of guidance development • Overview of comments • STARD Initiative and definitions • Choice of comparative benchmark and implications • Agreement measures – pitfalls • Bias • Estimating performance without a “perfect” [reference] standard - latest research • Reporting recommendations

Background • Motivated by CDC concerns with IVDs for sexually transmitted diseases • Joint meeting of four FDA device panels (2/11/98): Hematology/Pathology, Clinical Chemistry/Toxicology, Microbiology and Immunology • Provide recommendations on “appropriate data collection, analysis, and resolution of discrepant results, using sound scientific and statistical analysis to support indications for use of in vitro diagnostic devices when the new device is compared to another device, a recognized reference method or ‘gold standard,’ or other procedures not commonly used, and/or clinical criteria for diagnosis”

Statistical Guidance Developed “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests: Draft Guidance for Industry and FDA Reviewers” • issued in Mar. 12, 2003 with a 90-day comment period • http://www.fda.gov/cdrh/osb/guidance/1428.html • for all diagnostic products not just in vitro diagnostics • only addresses diagnostic devices with 2 possible outcomes (positive/negative) • does not address design and monitoring of clinical studies for diagnostic devices

Dichotomous Diagnostic Test Performance Study Population TRUTH Truth+ Truth New Test+ TP (true+) FP (false+) Test Test FN (false ) TN (true) estimate: sensitivity (sens) = Pr(Test+|Truth+)  100%×TP/(TP+FN) specificity(spec) = Pr(Test|Truth)  100%×TN/(FP+TN) “Perfect” test: sens=spec=100% (FP=FN=0)

Example Data: 220 Subjects TRUTH Imperfect Standard +  +  New + 44 1 New + 40 5 Test  7 168Test  4 171 total 51 169 total 44 176 UnbiasedEstimates Biased* Estimates Sens 86.3% (44/51) 90.9% (40/44) Spec 99.4% (168/169) 97.2% (171/176) * Misclassification bias (see Begg 1987)

Recalculation of Performance Using “Discrepant Resolution” STAGE 1 – retest discordants STAGE 2 – revise 2x2* using a “resolver” test based on resolver result Imperfect Standard Resolver/imperfect std. +  “+” “” New + 40 5 (5+, 0)New + 45 0 Test  4 (1+, 3) 171 Test  1 174 total 44 176 total 46 174 “sens” 90.9% (40/44)  97.8% (45/46) “spec” 97.2% (171/176)  100% (174/174) *assumes concordant=“correct”

Topics for Guidance Realization: • Problems are much larger than “discrepant resolution” • 2x2 is an oversimplification, but still useful to start Provide guidance: • What constitutes “truth”? • What to do if we don’t know truth? • What name do we give performance measures when we don’t have truth? • Describing study design: how were subjects, specimens, measurements, labs collected/chosen?

Comments on Guidance FDA received comments from 11 individuals/organizations: • provide guidance on what constitutes “perfect standard” • remove “perfect/imperfect standard” concept and include and define “reference/non-reference standard” concept (STARD) • reference and use STARD concepts • provide approach for indeterminate, inconclusive, equivocal, etc… results • minimal recommendations • discuss methods for estimating sens and spec when a perfect [reference] standard is not used • cite new literature • include more discussion on bias, including verification bias • some discussion added, add more references • add glossary

STARD Initiative STAndards for Reporting of Diagnostic Accuracy Initiative • effort by international working group to improve quality of reporting of studies of diagnostic accuracy • checklist of 25 items to include when reporting results • provide definitions for terminology • http://www.consort-statement.org/stardstatement.htm

STARD Definitions Adopted Purpose of a qualitative diagnostic test is to determine whether a target condition is present or absent in a subject from the intended use population • Target condition (condition of interest) – can refer to a particular disease , a disease stage, health status, or any other identifiable condition within a patient, such as staging a disease already known to be present, or a health condition that should prompt clinical action, such as the initiation, modification or termination of treatment • Intended use population (target population) – those subjects/patients for whom the test is intended to be used

Reference Standard (STARD) Move away from notion of a fixed, theoretical “Truth” • “considered to be the best available method for establishing the presence or absence of the target condition…it can be a single test or method, or a combination of methods and techniques, including clinical follow-up” • dichotomous - divides the intended use population into condition present or absent • does not consider outcome of new test under evaluation

Reference Standard (FDA) What constitutes “best available method”/reference method? • opinion and practice within the medical, laboratory and regulatory community • several possible methods could be considered • maybe no consensus reference standard exists • maybe reference standard exists but for non-negligible % or intended use population, the reference standard is known to be in error FDA ADVICE: • consult with FDA on choice of reference standard before beginning your study • performance measures must be interpreted in context: report reference standard along with performance measures

Benchmarks for Assessing Diagnostic Performance NEW: FDA recognizes 2 major categories of benchmarks • reference standard (STARD) • non-reference standard (a method or predicate other than a reference standard; 510(k) regulations) OLD: “perfect standard” and “imperfect standard”, “gold standard” – concepts and terms deleted Choice of comparative method determines which performance measures can be reported

Comparison with Benchmark • If a reference standard is available: use it • If a reference standard is available, but impractical: use it to the extent possible • If a reference standard is not available or unacceptable for your situation: consider constructing one • If a reference standard is not available and cannot be constructed, use a non-reference standard and report agreement

Naming Performance Measures: Depends on Benchmarks Terminology is important – help ensure correct interpretation Reference standard (STARD) • a lot of literature on studies of diagnostic accuracy (Pepe 2003, Zhou et al. 2002) • report sensitivity, specificity (and corresponding CIs), predictive values of positive and negative results Non-reference standard (due to 510(k) regulations) • report positive percent agreement and negative percent agreement • NEW: include corresponding CIs (consider score CIs) • interpret with care – many pitfalls!

Agreement Study Population Non-Reference Standard +  New Test+ a b Test Test c d Positive percent agreement (new/non ref. std.) =100%×a/(a+c) Negative percent agreement (new/non ref. std.)=100%×d/(b+d) [overall percent agreement=100%×(a+d)/(a+b+c+d)] “Perfect” new test: PPA≠100% and NPA≠100%

Pitfalls of Agreement • agreement as defined here is not symmetric: calculation is different depending on which marginal total you use for the denominator • overall percent agreement is symmetric, but can be misleading (very different 2x2 data can give the same overall agreement • agreement ≠ “correct” • overall agreement, PPA and NPA can change (possibly a lot) depending the prevalence (relative frequency of target condition in intended use population)

Overall Agreement Misleading Non-Ref Non-Ref Standard Standard +  +  New + 40 1 New + 40 19 Test  19 512Test  1 512 total 59 513 total 41 531 overall agreement = 96.5% ((40+512)/572)) PPA = 67.8% (40/59) PPA = 97.6% (40/41) NPA = 99.8% (512/513) NPA = 96.4% (512/531)

Agreement ≠ Correct Original data: Non-Reference Standard +  New + 40 5 Test  4 171 Stratify data above by Reference Standard outcome Reference Std + Reference Std  Non-Ref Std Non-Ref Std +  +  New + 39 5 New + 1 0 Test  1 6Test  3 165 tests agree and are wrong for 6+1 = 7 subjects

Bias Unknown and non-quantified uncertainty • Often existence, size (magnitude), and direction of bias cannot be determined • Increasing overall number of subjects reduces statistical uncertainty (confidence interval widths) but may do nothing to reduce bias

Some Types of Bias • error in reference standard • use test under evaluation to establish diagnosis • spectrum bias – do not choose the “right” subjects” • verification bias – only a non-representative subset of subjects evaluated by reference standard, no statistical adjustments made to estimates • many other types of bias See Begg (1987), Pepe (2003), Zhou et al. (2002)

Estimating Sens and Spec Without a Reference Standard • Model-based approaches: latent class models and Bayesian models. See Pepe (2003), and Zhou et al. (2002) • Albert and Dodd (2004) • incorrect model leads to biased sens and spec estimates • different models can fit data equally well, yet produce very different estimates of sens and spec • FDA concerns & recommendations: • difficult to verify that model and assumptions are correct • try a range of models and assumptions and report range of results

Reference Standard Outcomeson a Subset • Albert and Dodd (2006, under review) • use info from verified and non-verified subjects • choosing between competing models is easier • explore subset choice (random, test dependent) • Albert (2006, under review) • estimation via imputation • study design implications (Albert, 2006) • Kondratovich (2003; 2002-Mar-8 FDA Microbiology Devices Panel Meeting) • estimation via imputation

Practices to Avoid • using terms “sensitivity” and “specificity” if reference standard is not used • discarding equivocal results in data presentations and calculations • using data altered or updated by discrepant resolution • using the new test as part of the comparative benchmark

External validity A study has high external validity if the study results are sufficiently reflective of the “real world” performance of the device in the intended use population

External validity FDA recommends • include appropriate subjects and/or specimens • use final version of the device according to the final instructions for use • use several of these devices in your study • include multiple users with relevant training and range of expertise • cover a range of expected use and operating conditions

Reporting Recommendations • CRITICAL - need sufficient detail to be able to assess potential bias and external validity • just as (more?) important as computing CI’s correctly • see guidance for specific recommendations

References Albert, P. S. (2006). Imputation approaches for estimating diagnostic accuracy for multiple tests from partially verified designs. Technical Report 042, Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute (http://linus.nci.nih.gov/~brb/TechReport.htm). Albert, P.S., & Dodd, L.E. (2004). A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60, 427–435. Albert, P. S. and Dodd, L. E. (2006). On estimating diagnostic accuracy with multiple raters and partial gold standard evaluation. Technical Report 041, Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute (http://linus.nci.nih.gov/~brb/TechReport.htm). Begg, C.G. Biases in the assessment of diagnostic tests. Statistics in Medicine 1987;6:411-423. Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)

References (continued) Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Moher, D., Rennie, D., deVet, H.C.W., & Lijmer, J.G. (2003). The STARD statement for reporting studies of diagnostic accuracy: Explanation and elaboration. Clinical Chemistry, 49(1), 7–18. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44) Lang, Thomas A. and Secic, Michelle. How to Report Statistics in Medicine. Philadelphia: American College of Physicians, 1997. Kondratovich, Marina (2003). Verification bias in the evaluation of diagnostic devices. Proceedings of the 2003 Joint Statistical Meetings, Biopharmaceutical Section, San Francisco, CA. Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press. Zhou, X. H., Obuchowski, N. A., & McClish, D. K. (2002). Statistical methods in diagnostic medicine. New York: John Wiley & Sons.

Revising FDA’s “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests”