Statistical Considerations in the Evaluation of Digital Pathology Devices

Statistical Considerations in the Evaluation of Digital Pathology Devices Hematology and Pathology Devices Panel Meeting October 22-23, 2009 Shanti Gomatam, Ph.D. Mathematical Statistician FDA/CDRH/OSB/DBS

Outline Q 0 Outline • Intended Use • Clinical Study Design Issues • Study Design Examples • Assessing Results • Precision Studies

Intended Use The intended use under discussion is for primary diagnosis of surgical pathology microscope slides in lieu of an optical microscopy (OM). Broad application -- not organ or disease specific. The Intended Use Population (IUP) is the population of subjects on whom the device is intended to be used.

Supporting Evidence Sponsors would be required to provide evidence to support safety and effectiveness of WSI under its intended use. • Clinical studies assess how well WSI performs with respect to OM under clinical use. • Precision studies characterize imprecision (variability) in WSI results.

Supporting EvidenceFlowchart Analyze Results Establish Performance

Bias and Variance Low bias, high variance Large Bias but low variance Low bias, low variance

Bias and Variance • Bias is about hitting the right target. • Variance or imprecision is about how close together your repeated attempts are. • Right data (right study design) helps reduce bias; more data does not help. • More data can help reduce uncertainty (imprecision).

Clinical Study Design Issues Factors to consider: • Diagnostic reference standard • Time of specimen collection • Comparing modalities • Paired design • Reader design • Sample Selection

Clinical Study Design Diagnostic Reference Standard(Reference Diagnosis) • Diagnostic accuracy is based on determination of “truth” via a diagnostic reference standard (See FDA Diagnostic Guidance1). • Diagnostic reference standard allows determination of accuracy (e.g, TP, FP, TN, FN). • The diagnostic reference standard should not be based on the device being evaluated for accuracy. • When diagnostic reference is based on control device (OM), there can be potential bias in comparison. 1 FDA Guidance document: Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic tests.

Q 3.2 Clinical Study Design Time of Specimen Collection Prospective Studies Prospective studies are those in which specimens (cases/slides) are prospectively collected and assessed by each modality (WSI or OM). • Prospective planning required. • Common protocol used across specimens. • Prospective studies less likely to be biased. • Study duration is potentially longer. • Final collection of study specimens may not contain all specimens of interest.

Q 3.2 Clinical Study Design Time of Specimen Collection Retrospective Studies Retrospective studies are based on specimens that were previously collected from the patient. • Easier to enrich. • Potential for bias - selection criteria; hidden missing sample/data issues; variation in pre-analytical processes • Potential lack of clinical, demographic, and other information for specimens (case/slide)

Clinical Study Design Comparing Modalities • Best to compare WSI to OM (“control”) on same samples. • Avoid potential bias due to change in clinical practice, change in other time- or location-dependent factors. • Difficult to evaluate WSI without comparison to control device OM.

Clinical Study Design Paired designs When each specimen (case/slide) is tested with both WSI and OM the study design is paired. • Paired designs have good properties. Design considerations: • Memory of first reading can affect next reading (non-washout). • Order of WSI and OM readings should be randomized.

Clinical Study Design Paired Designs OM reading

Clinical Study Design Paired Designs OM reading WSI reading

Clinical Study Design Paired Designs WSI reading

Clinical Study Design Paired Designs WSI reading OM reading

Paired designs Clinical Study Design Paired Designs S P E C I M E N S OM ----------------- WSI WSI ----------------- OM WASHOUT TIME

Clinical Study Design Reader design • Pathologists are “readers” for this indication. • Reader effect makes a difference to results obtained. • Reader designs: • every reader reads every specimen under every modality • … • each reader reads a different subset of specimens under a single modality. • The first design is most efficient.

Q 3.2 Clinical Study Design Sample Selection • Non-representative samples may lead to conclusions that are not generalizable to the IUP (bias may be high and variance estimates may be incorrect). • Random selection from IUP is preferred statistical choice. Consecutive (sequential) selection from IUP may be reasonable (under suitable conditions) . • Enrichment may be necessary to have rare conditions represented in sufficient numbers.

Q 3.2 Clinical Study Design Sample Selection • Adequate representation of non-disease and benign disease cases needed. • Factors to be considered while picking sample: • Organ/disease for which specimens are collected; • Type of specimen (needle biopsy, resection etc.); • Potential spectrum effect (level of difficulty -- case-mix); • Clinical center/site from which samples are obtained. • Ideally statistical mechanism for drawing specimen does not introduce bias; pre-specification preferred.

Potential Study Design Examples

Study Design Examples Common Elements of all Design Examples • Specimens picked from regular clinical practice at multiple sites. • Paired design; Specimen order and order of read are randomized. • Diagnostic reference standard available for statistical analysis. • Readers read de-identified specimens. • Results from specimens are compared on diagnoses.

Study Design Examples Study I: Prospective Clinical Study • Prospective study using consecutive clinical specimens. • R pathologists at each site read all specimens at site with WSI and OM with appropriate washout.

Study Design Examples Study II: Retrospective Enriched Clinical Study • Prospectively planned retrospective study using enriched clinical specimens randomly picked from those available. • R pathologists at each site read all specimens at all sites with WSI and OM • Non-study pathologist reads specimens to implement enrichment; Study pathologists blinded to enrichment read.

Study Design Examples Study III: Retrospective Clinical Study • Prospectively planned retrospective study using consecutive clinical specimens. • R pathologists at each site read all specimens at all sites with WSI and OM.

Study Design Examples Study I: Prospective Clinical Study Pros: • Representative of intended use • Ensures planning (prospective) • Common protocol (prospective) • Reduction in bias (prospective) • Reader design not as efficient Cons: • Potential implementation challenges (prospective) • May take longer (non-enriched, prospective) • Reader behavior could be affected (multiple reads)

Study Design Examples Study II: Retrospective Enriched Clinical Study Pros: • Easier to implement (retrospective) • Potentially smaller sample size (enrichment) • Ensures some planning (prospectively planned) • Reader effect efficient (All cases read with both) Cons: • Lack of common protocol (retrospective) • Potential bias (retrospective) • Reader behavior could be affected (enrichment + multiple reads)

Study Design Examples Study III: Retrospective Clinical Study Pros: • Ensures some planning • Potentially shorter duration (retrospective) • Potentially larger sample size (non-enriched) • Reader design efficient Cons: • Lack of common protocol (retrospective) • Potential bias (retrospective) • Reader behavior could be affected (multiple reads)

Additional Clinical Design IssuesAssessing Results

Assessing Results Assessing Results • Attributes/measurements to be evaluated • Hypotheses on Attributes • Study success criterion • Study sizing

Assessing Results Examples Two organ systems will be used as examples in the following slides. • Breast: CAP Breast IC protocol checklist • Lung: CAP Lung IC Biopsy protocol checklist

Assessing Results CAP Protocol for Breast ICMacroscopic Elements • Specimen Type • Lymph Node Sampling • Specimen Size • Laterality • Tumor Site

Assessing Results CAP Protocol for Breast IC (cont.)Microscopic elements • Size of invasive component • Histologic Type (check all that apply): • ___ Noninvasive carcinoma (NOS) • ___ Ductal carcinoma in situ • ___ Lobular carcinoma in situ • … • ___ Other(s) (specify): ____________________________ • ___ Carcinoma, type cannot be determined

Assessing Results CAP Protocol for Breast IC(cont.)Microscopic elements • Histologic Grade: • Nottingham Histologic Score (Tubule formation; nuclear Pleomorphism; Mitotic count) OR • Other Grading System + Mitotic count • Pathologic Staging • Margins • Venous/Lymphatic Invasion • Microcalcifications • Additional Pathologic Findings

Assessing Results CAP Protocol for Lung IC BiopsyMicroscopic elements • Histologic Type: • ___ Carcinoma, non-small cell type • ___Small cell carcinoma • ___ Squamous cell carcinoma • … • ___ Other(s) (specify): ____________________________ • ___ Carcinoma, type cannot be determined

Assessing Results CAP Protocol for Lung IC Biopsy Microscopic elements • Histologic Grade: • ___ Not applicable • ___ GX: Cannot be assessed • ___ G1: Well differentiated • ___ G2: Moderately differentiated • ___ G3: Poorly differentiated • ___ G4: Undifferentiated • ___ Other (specify): ______ • Visceral Pleura Invasion • Venous Invasion • Lymphatic Invasion • Additional Pathologic Findings

Assessing Results Measurements • Measurements vary by tissue-type. • Measurements vary by pathological findings. • Lots of potential measurements per specimen. • What results/findings should one use to assess device performance?

Assessing Results Selecting Measurements • Should one assess on the basis of: Case (multiple slides); or single whole slide? • Should microscopic and/or macroscopic findings be assessed? • Pathological report has multiple “lines” of results each potentially containing information on type, grade, size, … How many “lines” is it sufficient to assess agreement on?

Assessing Results Selecting Measurements • What fields within each “line” should be compared? • Histologic type • Histologic grade • Histologic determination of size (for case) using multiple slides • … • Results are tissue-type/disease dependent.

Assessing Results Potential Measurements for Performance Comparison • Disease/non-disease status • Primary diagnosis only (main diagnosis for specimen); Some diagnoses from pathological evaluation; All diagnoses from pathological evaluation • Any of the above will have multiple measurements of different kinds: Type is nominal, grade is ordinal, size is interval, …

Q 3.5 Assessing Results “Primary” and “Secondary” Measurements • Agreement on which measurements is key for regulatory decisions? (“Primary” measurements) • What additional comparisons are useful to report? (“Secondary” measurements)

Assessing Results Assessing AccuracyScales Accuracy and comparative performance can be assessed at various levels and for different outcomes: • On binary scale (eg. disease/non-disease) • On nominal scale (eg. Histologic type) • On ordinal scale (eg. Histologic grade) • On continuous scale (eg. Tumor size or probability of being diseased)

Assessing Results More on Assessing Accuracy • Sensitivity and specificity can be used for assessments on the binary scale • Agreements on ordinal scale can be evaluated using sensitivities/specificities conditioning on category and ROC-based methods. • Many methods exist for assessing agreement on a continuous scale. • Nonparametric methods for assessing diagnostic accuracy on all scales*. * Obuchowski (2005), Acad. Radiol.

Assessing Results Assessing Nominal Accuracy • Histologic type is important attribute for performance assessment • KxK* table for nominal types • WSI with Reference and • OM with Reference • Example using Breast IC histologic types * K is the number of types of responses

Assessing Results Assessing Nominal AccuracyK by K tables NIC: Non-invasive carcinoma; DCIS: ductal carcinoma in situ; C,ND: Carcinoma, not determined.

Assessing Results Assessing Nominal Accuracy • Can use percent “correct” calls for each of the K types. • If K is large, then need large N to power estimates. • Can also reduce K by combining categories into subgroups

Assessing Results Assessing Nominal Accuracy • If ordinal subgroups possible, can have ordinal analyses. • May also be able to define differences between categories in terms of clinical importance – this could reduce table size and create ordinal categories • However, loss of information should be considered when combining categories.

Statistical Considerations in the Evaluation of Digital Pathology Devices

Statistical Considerations in the Evaluation of Digital Pathology Devices

Presentation Transcript

Statistical Issues in the Evaluation of Predictive Biomarkers

neuropsychological considerations in the evaluation of adhd

Digital Pathology Devices Panel Meeting

Statistical Considerations

Statistical Evaluation of Data

Statistical considerations

Image Quality in Digital Pathology

Quality considerations in Statistical Surveys

Statistical considerations

Evaluation of Statistical Reports

Digital Imaging in Pathology

Statistical Evaluation of Data

Statistical Considerations in Interventional Research

Digital Pathology in Clinical Trials

Power Considerations in Mobile Devices

Statistical Models of Anatomy and Pathology

STATISTICAL EVALUATION

Statistical Tools in Evaluation

Statistical Tools in Evaluation

Statistical Considerations

Digital Pathology