Constantin F. Aliferis M.D., Ph.D., FACMI

Introduction to translational and clinical bioinformaticsConnecting complex molecular information to clinically relevant decisions using molecular profiles Alexander Statnikov Ph.D. Director, Computational Causal Discovery laboratory Assistant Professor, NYU Center for Health Informatics and Bioinformatics, General Internal Medicine Constantin F. Aliferis M.D., Ph.D., FACMI Director, NYU Center for Health Informatics and Bioinformatics Informatics Director, NYU Clinical and Translational Science Institute Director, Molecular Signatures Laboratory, Associate Professor, Department of Pathology, Adjunct Associate Professor in Biostatistics and Biomedical Informatics, Vanderbilt University

Overview • Session #1: Basic Concepts • Session #2: High-throughput assay technologies • Session #3: Computational data analytics • Session #4: Case study / practical applications • Session #5: Hands-on computer lab exercise

Execution of experiments Experimental design Programmer microarray data; plan of experiments Write-up of the report Analysis of results Results, dx model Building Cancer Diagnostic Models from Microarray Gene Expression Data samples Clinical Researcher Microarray Lab microarray data report, dx model microarray data, hypothesis Biostatistician Bioinformatician / Computer Scientist Weeks or Months

Automated Diagnostic System GEMS:Gene Expression Model Selector report, dx model microarray data, hypothesis Execution of experiments Biostatistician Bioinformatician / Computer Scientist Experimental design Programmer microarray data; plan of experiments Write-up of the report Analysis of results Results, dx model samples Clinical Researcher Microarray Lab microarray data report, dx model microarray data, hypothesis Automated System • Outputs high quality models for cancer • diagnosis from gene expression data; • Produces reliable performance estimates; • Allows to apply models to unseen patients; • Discovers target biomarker candidates. • Implement best known diagnostic methodologies • Use sound techniques for model selection & performance estimation Weeks or Months Minutes or hours

There exist many good software packages for supervised analysis of microarray data, but… • Neither system provides a protocol for data analysis that precludes overfitting. • A typical software either offers an overabundance of algorithms or algorithms with unknown performance. Thus is it not clear to the user how to choose an optimal algorithm for a given data analysis task. • The software packages address needs of experienced analysts. However, there is a need to use this software (and still achieve good results) by users who know little about data analysis (e.g., biologists and clinicians). Other Systems for Supervised Analysis of Microarray Data

Cannot specify a small set of best performing diagnostic algorithms; Have to to perform evaluation de novo What Does Prior Research Suggest About the Best Performing Methods? • Limited range of methods & datasets per study • No description of parameter optimization of learners • Different experimental designs are employed • Overfitting [Ntzani et al., Lancet 2003]: • 74% no validation • 13% incomplete cross-validation • 13% implemented cross-validation correctly • The available meta-analyses are not aimed at identification of best performing methodologies 193 primary studies 2 meta-analyses

Algorithmic Evaluations to Inform Development of the System

Multiclass SVMs (this study) 100 80 60 Accuracy, % Multiple specialized classification methods (original primary studies) 40 20 0 1st Algorithmic Evaluation Study Main Goal: Investigate which ones among the many powerful classifiers currently available for gene expression diagnosis perform the best across many datasets and cancer types. • Results: • Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV. • Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms; • Ensemble classification does not improve performance; • Obtained results favorably compare with literature. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 2005, 21: 631-643.

Result: • Markov Blanket techniques (e.g., HITON) provide the smallest subsets of predictors that achieve optimal classification performance. 2nd Algorithmic Evaluation Study Main Goal:Determine feature selection algorithms (applicable to high-dimensional microarray gene expression or mass-spectrometry data) that significantly reduce the number of predictors, maintaining optimal classification performance. Aliferis CF, Tsamardinos I, Statnikov A. HITON: A novel Markov Blanket algorithm for optimal variable selection. AMIA Symposium, 2003.

Cross-Validation Designs Classifiers N-Fold Cross-Validation One-Versus-Rest LOOCV GeneSelection Methods One-Versus-One Nested N-Fold Cross-Validation Normalization Techniques DAGSVM MC-SVM Nested LOOCV S2N One-Versus-Rest Method by WW S2N One-Versus-One [a, b] Method by CS Non-param. ANOVA (x – MEAN(x)) / STD(x) BW ratio x / STD(x) HITON_MB x / MEAN(x) HITON_PC x / MEDIAN(x) x / NORM(x) x – MEAN(x) x – MEDIAN(x) ABS(x) x + ABS(x) Algorithms Implemented in GEMS Performance Metrics Accuracy RCI AUC ROC

System Description

1. Classification model 2. Performance estimate 3. In application mode: the model’s diagnoses/predictions & overall performance 4. A reduced set of genes 5. Links from the genes to literature and other resources. Inputs & Outputs 1. Dataset & outcome or diagnostic labels 2. Optional: gene names and/or accession numbers 3. Various choices of parameters for the analysis (defaults) 4. In application mode: previously saved model GEMS

train test train train What if classifier is parametric? test test train train valid train valid Model selection / parameter optimization by (nested) cross-validation Cross-Validation Design Performance estimation by Cross-Validation data

Client (Wizard-Like User Interface) Computational Engine Normalization Cross-Validation Loop for Performance Est. II II Estimate classification performance N-Fold CV Gene Selection LOOCV S2N One-Versus-Rest I Generate a classification model and estimate its performance S2N One-Versus-One I I Non-param. ANOVA II Cross-Validation Loop for Model Selection BW ratio Generate a classification model HITON_PC N-Fold CV HITON_MB LOOCV Apply existing model to a new set of patients Classification by MC-SVM Performance Computation One-Versus-Rest Accuracy One-Versus-One X RCI DAGSVM Report Generator AUC ROC Method by WW Method by CS Software Architecture

User Interface

Steps in User Interface Task selection Dataset specification Cross-validation design Normalization Logging Performance metric Gene selection Classification Report generation Analysis execution

An Evaluation of the System: • Apply GEMS to datasets not involved in algorithmic evaluation and compare results with ones obtained by human analysts and published in the literature; • Verify generalizability of models produced by GEMS in cross-dataset applications. Statnikov A, Tsamardinos I, Aliferis CF. GEMS: A system for decision support and discovery from array gene expression data. (to appear) International Journal of Medical Informatics, 2005.

Evaluation Using New Datasets Datasets Comparison with literature Analyzes were completed within 10-30 minutes with GEMS.

Verify Generalizability of Models in Cross-Dataset Applications

Live Demonstration of GEMS

Scenario 1:Binary classification model development and evaluation using a lung cancer microarray gene expression dataset.

GEMS Various parameters User Classification model Cross-validation performance estimate (AUC) List of genes Live Demo of GEMS (Scenario 1)Binary classification model development and evaluation • Lung cancer dataset from • Bhattacharjee, 2001: • Diagnostic task: • Lung cancer vs normal tissues • Microarray platform: • Affymetrix U95A • Number of oligonucleotides: • 12,600* • Number of patients: • 203

Scenario 2: Multicategory classification model development and evaluation using a small round blood cell tumor microarray gene expression dataset.

GEMS Various parameters User Classification model Cross-validation performance estimate (AUC) List of genes Live Demo of GEMS (Scenario 2) Multicategory classification model development and evaluation • Lung cancer dataset from • Khan, 2001: • Diagnostic task: • Ewing Sarcoma vs • rhabdomyosarcoma vs • Burkitt Lymphoma vs • neuroblastoma • Microarray platform: • cDNA • Number of probes: • 2,308 • Number of patients: • 63

Scenario 3:Validating the reproducibility of genes selected in Scenario 1 using another lung cancer microarray gene expression dataset.

List of genes From Scenario 1 (Bhattacharjee’s data) GEMS Various parameters User Classification model Cross-validation performance estimate (AUC) Live Demo of GEMS (Scenario 3) Are selected genes reproducible in another dataset? • Lung cancer dataset from • Beer, 2002: • Diagnostic task: • Lung cancer vs normal tissues • Microarray platform: • Affymetrix HuGeneFL • Number of oligonucleotides: • 7,129* • Number of patients: • 96

Scenario 4:Verifying generalizability of the classification model produced in Scenario 1 using another lung cancer microarray gene expression dataset.

Classification model From Scenario 1 (Bhattacharjee’s data) GEMS Various parameters Performance estimate (AUC) User Live Demo of GEMS (Scenario 4) Is constructed classification model generalizable in another microarray dataset? • Lung cancer dataset from • Beer, 2002: • Diagnostic task: • Lung cancer vs normal tissues • Microarray platform: • Affymetrix HuGeneFL • Number of oligonucleotides: • 7,129* • Number of patients: • 96

GEMS in a Nutshell • The system is fully automated, yet provides many optional features for the seasoned analyst. • The system is based on a nested cross-validation design that avoids overfitting. • GEMS’s algorithms were chosen after the two extensive algorithmic evaluations. • After the system was built, it was validated in cross-dataset applications and also using new datasets. • GEMS has an intuitive wizard-like user interface which abstracts data analysis process. • GEMS possesses a convenient client-server architecture.

Constantin F. Aliferis M.D., Ph.D., FACMI

Constantin F. Aliferis M.D., Ph.D., FACMI

Presentation Transcript

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D

Thomas C. Rindflesch, Ph.D., FACMI

L Liem , M.D. F Huygen, M.D., Ph.D Marc Russo , M.D. JP Van Buyten , M.D. I Smet, M.D.

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D .

Kaan Yücel M.D., Ph.D

R. Ryan Geyer, Ph.D. PI: Walter F. Boron M.D., Ph.D.

David K. Song, M.D., Ph.D. Kevin G. Volpp, M.D., Ph.D.

Yuval Shahar, M.D., Ph.D.

Constantin F. Aliferis M.D., Ph.D., FACMI

R. Ryan Geyer, Ph.D. PI: Walter F. Boron M.D., Ph.D.

M.D./Ph.D. Program