Constantin F. Aliferis M.D., Ph.D., FACMI

Introduction to translational and clinical bioinformaticsConnecting complex molecular information to clinically relevant decisions using molecular profiles Alexander Statnikov Ph.D. Director, Computational Causal Discovery laboratory Assistant Professor, NYU Center for Health Informatics and Bioinformatics, General Internal Medicine Constantin F. Aliferis M.D., Ph.D.,FACMI Director, NYU Center for Health Informatics and Bioinformatics Informatics Director, NYU Clinical and Translational Science Institute Director, Molecular Signatures Laboratory, Associate Professor, Department of Pathology, Adjunct Associate Professor in Biostatistics and Biomedical Informatics, Vanderbilt University

Goals • Understand spectrum of Bioinformatics and Medical informatics activities • Understand basic concepts of clinical/translational Bioinformatics • Understand basic concepts of molecular profiling • Introduction to high-throughput assays enabling molecular profiling • Introduction to computational data analytics/bioinformatics enabling molecular profiling • Understand analytic challenges and pitfalls/interpretation issues • Discuss case study of profiles used to diagnose/treat patients • Perform hands-on development of a molecular profile, finding novel biomarkers and testing profile/markers accuracy Discussion supported by general literature and heavily grounded on: • NYUMC informatics experts/research projects/grants/papers/entities/software systems • Commercially availiable modalities & assays

Overview • Session #1: Basic Concepts • Session #2: High-throughput assay technologies • Session #3: Computational data analytics • Session #4: Case study / practical applications • Session #5: Hands-on computer lab exercise

Session #1: Basic Concepts • Understand spectrum of Bioinformatics and Medical informatics activities - NYUMC informatics • Understand basic concepts of clinical/translational Bioinformatics • Understand basic concepts of molecular profiling • ALSO: - emails/names/interests - adjustments to plan

NYU Center for health Informatics & Bioinformatics: Broad Plan Health Informatics Infrastructure & Integrative Methods/Activities Bioinformatics Educational informatics CTSI BPIC (best Practices Integrative Consultation Core/Service • Microarray Informatics: • Upstream • Differential expression, • Pathway inference • Molecular profiles Evidence based medicine, and Information retrieval Informatics High Performance Computing Facility Literature Synthesis & Benchmarking studies • Research labs • Kluger • Molecular Signatures • EBM, IR & Scientometrics • Computational Causal Discovery Method-problem “matchmaking” Library Collaborative Design and execution of studies Next-gen sequencing informatics : Upstream analyses i. Chi-seq ii. RNA seq iii. Epigenetics iv. Microbiomics v. micro RNA studies vi. CNV & splice variation studies vii. Digital RNA viii. Denovo sequencing & re-sequencing Downstream analyses • Data Integration & Mining: • Data warehouse & interfacing with EMR • Omics LIMS • Genomic EMR • Biospecimen management • research protocol database systems and management team • Data mining service • Data Mining software Cancer Center MS/PhD (& Post-doc Fellowship) Program Genetics-Genomics • Continuing Education • Workshops & tutorials • Paper digest • Research Colloquium • Invited Speakers COEs Multi-modal & Integrative studies Integrate/Focus Existing Informatics and Increase Collaborations Proteomics Informatics

Current Capabilities: Areas Health Informatics Infrastructure & Integrative Methods/Activities Bioinformatics Educational informatics CTSI BPIC (best Practices Integrative Consultation Core/Service • Microarray Informatics: • Upstream • Differential expression, • Pathway inference • Molecular profiles Evidence based medicine, and Information retrieval Informatics High Performance Computing Facility Literature Synthesis & Benchmarking studies • Research labs • Kluger • Molecular Signatures • EBM, IR & Scientometrics • Computational Causal Discovery Method-problem “matchmaking” Library Collaborative Design and execution of studies Next-gen sequencing informatics : Upstream analyses i. Chi-seq ii. RNA seq iii. Epigenetics iv. Microbiomics v. micro RNA studies vi. CNV & splice variation studies vii. Digital RNA viii. Denovo sequencing & re-sequencing Downstream analyses • Data Integration & Mining: • Data warehouse & interfacing with EMR • Omics LIMS • Genomic EMR • Biospecimen management • research protocol database systems and management team • Data mining service • Data Mining software Cancer Center MS/PhD (& Post-doc Fellowship) Program Genetics-Genomics • Continuing Education • Workshops & tutorials • Paper digest • Research Colloquium • Invited Speakers COEs Multi-modal & Integrative studies Integrate/Focus Existing Informatics and Increase Collaborations Proteomics Informatics

Current & Future capabilities Health Informatics Educational informatics Content management, medical simulations Evidence based medicine, and Information retrieval Informatics • Filter Medline according to content and quality • Filter Web for health advice quality • Predict future citations of articles • Classify individual citations as instrumental or not • Identify special types of articles • Construct citation histories & Analyze impact of articles • Integrate and manage queries and related content • Combine and optimize knowledge source searches • New “find a researcher” • “Find a collaborator” Library Collaborative Apply, evaluate, refine next-gen IR methods • Data Integration & Mining: • Data warehouse & interfacing with EMR • Omics LIMS • Genomic EMR • Biospecimen management • research protocol database systems and management team • Data mining service • Data Mining software • Data warehouse needs; software acquisition; implementation • OMICS LIMS needs capture; vendor product assessment; funds; sofwtare purchase and implementation; integration with billing and EMR • Biospecimen management • Research protocol database system (eVelos) • Data base management team • Data mining service • Data mining engine: faculty; funds; prototype; implementation; evaluation

Current & Future capabilities Infrastructure & Integrative Methods/Activities CTSI (supported by rest of objectives) High Performance Computing Facility Sequencing server; hectar1; hectar2; Funds; needs; grants; personnel post; specs; room/networking/access; Personnel hires; hw install; licenses; BP; launch • Research labs • Kluger • Molecular Signatures • EBM, IR & Scientometrics • Computational Causal Discovery • Kluger  TF /Regulation studies; high-throughput outcome prediction, specialized clustering methods • Molecular Signatures  development of molecular signatures for diagnosis outcome prediction and personalized medicine, discovery of diagnostic/imaging biomarkers and putative drug targets , deployment of signatures, automated software, new methods • EBM, IR & Scientometrics development and evaluation of next-gen IR and scientometric models and studies • Computational Causal Discovery  discovery of pathways; studies of causal validity of bioinformatics discovery methods, multiplicity studies, automated software, active learning/experiment number minimization MS/PhD (& Post-doc Fellowship) Program Formal Training in Biomedical Informatics at pre and post-doctoral levels • Continuing Education • Workshops & tutorials • Paper digest • Research Colloquium • Invited Speakers • Continuing Education • Workshops & tutorials • Paper digest • Research Colloquium • Invited Speakers Integrate/Focus Existing Informatics and Increase Collaborations Faculty and Staff career development; Informatics Affiliates; Working Collaborations with Courant, Polytechnic, NYC Informatics and other non-NYUMC entities

Current & Future capabilities Bioinformatics BPIC (best Practices Integrative Consultation Core/Service • Literature Synthesis & Benchmarking studies • Method-problem “matchmaking” • Design and execution of studies • Study publication assistance Area-specific (Disease, Assay) Informatics Microarray Informatics: Experiment design, assay execution, differential expression, pathway mapping, pathway-specific testing (GSEA/GSA), de novo pathway discovery, phylogeny, clustering, hybrid experimental/observational designs; SNP arrays; ChIP-on-ChIP analyses, aCGH, tiled arrays, etc… Genetics-Genomics COEs Sequencing Informatics: Chip-Seq analysis, digital gene expression, de novo sequence assembly & reassembly, CNV analysis, epigenomic studies, microbiomics Cancer Center Proteomics Informatics: platform-specific pre-processing, differential abundance, peptide-protein mapping, protein identification, de novo protein interaction network inference, protein modification and structure studies,… • Multi-modal Integrative and Higher-level Informatics: • Molecular Signatures & linking high-dimensional data to phenotype  development of molecular signatures for diagnosis, outcome prediction and personalized medicine; in silico signature scanning, in silico signature equivalence, discovery of diagnostic/imaging biomarkers and putative drug targets , deployment of signatures, automated software, novel methods • Mechanistic /causative studies  discovery of pathways; multiplicity studies, TS/DBN designs, automated software, active learning/experiment number minimization • Integrating clinical lab, text, imaging and high throughput data in CTs/prospective studies or exploratory retrospective ones

Summary Contacts (Until Centralized Consultation Service is Launched) • Management of Clinical and protocol data  James Robinson • Educational Informatics  Mark Triola • Next-Gen Information Retrieval  Lawrence Fu, Constantin Aliferis, TBD • Informatics for Data Mining  Alexander Statnikov, Constantin Aliferis • Data Integration & Warehousing  John Chelico, Ross Smith, Constantin Aliferis • High Performance Computing  Constantin Aliferis, Ross Smith • Best Practices in Bioinformatics  Constantin Aliferis, Alexander Statnikov • Sequencing Informatics  Upstream: • Stuart Brown, Alexander Alekseyenko, Yuval Kluger, Jinhua Wang, TBD, TBD • Downstream: • Alexander Alekseyenko, Yuval Kluger, Jinhua Wang, Alexander Statnikov, Constantin Aliferis • Microarray Informatics  Jiri Zafadil, Yuval Kluger, Jinhua Wang, • Constantin Aliferis, Alexander Statnikov • Cancer Informatics  Yuval Kluger, Jinhua Wang, Stuart Brown, • Jiri Zafadil, Constantin Aliferis • Proteomics Informatics  Stuart Brown, Jinhua Wang, Constantin Aliferis, • Alexander Statnikov, TBD • General Tools  Stuart Brown • Specialized applications • (Genetics, Regulation, Pathways…)  Stuart Brown, Yuval Kluger, Alexander Statnikov, Constantin Aliferis • Molecular Signatures development, • biomarker discovery, • Multi-modal and • Integrative studies  Constantin Aliferis, Alexander Statnikov, • Yuval Kluger

Molecular Signatures Definition = computational or mathematical models that link high-dimensional molecular information to phenotype of interest

Molecular Signatures Gene markers New drug targets

Molecular Signatures: Main Uses • Direct benefits: Models of disease phenotype/clinical outcome & estimation of the model performance • Diagnosis • Prognosis, long-term disease management • Personalized treatment (drug selection, titration) (“predictive” models) • Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction • Make the above tasks resource efficient, and easy to use in clinical practice • Helps next-generation molecular imaging • Leads for potential new drug candidates • Ancillary benefits 2: Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types) • Leads for potential new drug candidates

Molecular Signatures The FDA calls them “in vitro diagnostic multivariate index assays” 1. “Class II Special Controls Guidance Document: Gene Expression Profiling Test System for Breast Cancer Prognosis”: • addresses device classification 2. “The Critical Path to New Medical Products”: - identifies pharmacogenomics as crucial to advancing medical product development and personalized medicine. 3. “Draft Guidance on Pharmacogenetic Tests and Genetic Tests for Heritable Markers” & “Guidance for Industry: Pharmacogenomic Data Submissions” • identifies 3 main goals (dose, ADEs, responders), • define IVDMIA, • encourages “fault-free” sharing of pharmacogenomic data, • separates “probable” from “valid” biomarkers, • focuses on genomics (and not other omics),

Less Conventional Uses of Molecular Signatures • Increased Clinical Trial sample efficiency, and decreased costs or both, using placebo responder signatures ; • In silico signature-based candidate drug screening; • Drug “resurrection” • Establishing existence of biological signal in very small sample situations where univariate signals are too weak; • Assess importance of markers and of mechanisms involving those • Choosing the right animal model • …?

Recent molecular mignatures available for patient care Agendia Clarient Prediction Sciences LabCorp University Genomics Genomic Health Veridex BioTheranostics Applied Genomics Power3 OvaSure Correlogic Systems

Molecular signatures in the market(examples)

MammaPrint • Developed by Agendia (www.agendia.com) • 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease • Independently validated in >1,000 patients • So far performed 12,000 tests • Cost of the test is $3,200 • In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm. • TIME Magazine’s 2007 “medical invention of the year”.

CupPrint Developed by Agendia (www.agendia.com) ~500-gene (~1900 probes) signature to identify primary site of 49 different types of carcinomas as well as other types of cancer such as sarcoma and melanoma. Several independent validation studies

ColoPrint • In development & validation by Agendia (www.agendia.com) • Multi-gene expression signature to determine the risk for recurrence in colorectal cancer patients • Planning to seek FDA approval References: • http://cancergenetics.wordpress.com/category/coloprint/ • http://www.bioarraynews.com/issues/7_34/features/141935-1.html • http://life-science-ventures.com/downloads/PressreleaseColoPrintfinalJuly10th2007.pdf

Oncotype DX Development synopsis Main reference: Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004; 351(27):2817-26. • Developed by Genomic Health (www.genomichealth.com ) • 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse • Independently validated in >1,000 patients • So far performed 55,000 tests • Cost of the test is $3,650 • Reimbursement. Information about reimbursement for molecular signatures from Aetna: http://www.aetna.com/cpb/medical/data/300_399/0352.html • Oncotype DX did not undergo FDA review. Here is an article that mentions FDA review of Oncotype DX (slightly outdated): http://www.sciencemag.org/cgi/content/full/303/5665/1754 • The following paper shows the health benefits and cost-effectiveness benefits of using Oncotype DX: http://www3.interscience.wiley.com/cgi-bin/abstract/114124513/ABSTRACT

CancerType ID Developed by AviaraDX (www.aviaradx.com) 92-gene signature to classify 39 tumor types Signature developed by GA/KNN “Compressed version” of CupPrint

Breast Cancer Index Developed by AviaraDX (www.aviaradx.com) Uses 7 genes (combines 5-gene MGI signature and 2-gene H/I signature) Stratifies breast cancer patients into groups with low or high risk of cancer recurrence and good or poor response to endocrine therapy. Validated in thousands of patients (treated & untreated)

GeneSearch Breast Lymph Node (BLN) Assay Developed by Veridex (www.veridex.com), a Johnson & Johnson company Test to detect if breast cancer has spread to the lymph nodes The GeneSearch BLN uses real-time reverse transcriptase-polymerase chain reaction (RT-PCR) to detect ammoglobin (MG) and cytokeratin 19 (CK 19) in lymph nodes. FDA approved Featured in TIME’s 2007 Top 10 Medical Breakthroughs list

MammoStrat Developed by Applied Genomics (http://www.applied-genomics.com) The test is based on 5 biomarkers. The test is used to classify individual patients as having an AGI-defined high-, moderate-, or low-risk of breast cancer recurrence following surgical removal of their primary tumor and treatment with tamoxifen alone. Independently validated in >1000 patients

NuroPro Developed by Power3 (http://www.power3medical.com/) Early detection of neurodegenerative diseases: Alzheimer’s disease, ALS (Lou Gehrig’s disease), and Parkinson’s disease. Validation study in progress. Based on 59 proteins.

BC-SeraPro Developed by Power3 (http://www.power3medical.com/) Test for diagnosis of breast cancer (breast cancer case vs. control). Validation study in progress. Based on 22 proteins. Uses linear discriminant analysis; outputs a probability score.

Key ingredients for developing a molecular signature Well-defined clinical problem & access to patients Computational & Biostatistical Analysis Molecular Signature High-throughput assays

Challenges in Computational Analysis of omics data for development of molecular signatures • Relatively easy to develop a predictive model + even easier to believe that a model is good when it is not  false sense of security • Several problems exist: some theoretical and some practical • Omics data has many special characteristics and is tricky to analyze!

OvaCheck • Developed by Correlogic (www.correlogic.com) • Blood test for the early detection of epithelial ovarian cancer • Failed to obtain FDA approval • Looks for subtle changes in patterns among the tens of thousands of proteins, protein fragments and metabolites in the blood • Signature developed by genetic algorithm • Significant artifacts in data collection & analysis questioned validity of the signature: • Results are not reproducible • Data collected differently for different groups of patients http://www.nature.com/nature/journal/v429/n6991/full/429496a.html

Problem with OvaCheck A B C Figure from Baggerly et al (Bioinformatics, 2004) D E F

Molecular Signatures Gene markers New drug targets

Brief History of main “omics” technology: gene expression microarrays • 1988: Edwin Southern files UK patent applications for in situ synthesized, oligo-nucleotide microarrays • 1991: Stephen Fodor and colleagues publish photolithographic array fabrication method • 1992: Undeterred by NIH naysayers, Patrick Brown develops spotted arrays • 1993: Affymax begets Affymetrix • 1995: Mark Schena publishes first use of microarrays for gene expression analysis • Edwin Southern founds Oxford Gene Technologies • 1996: First human gene expression microarray study published • Affymetrix releases its first catalog GeneChip microarray, for HIV, in April • 1997: Stanford researchers publish the first whole-genome microarray study, of yeast

Brief History of main “omics” technology: gene expression microarrays (The scientist 2005) • 1998: Brown's lab develops CLUSTER, a statistical tool for microarray data analysis; red and green "thermal plots" start popping up everywhere • 1999: Todd Golub and colleagues use microarrays to classify cancers, sparking widespread interest in clinical applications • 2000: Affymetrix spins off Perlegen, to sequence multiple human genomes and identify genetic variation using arrays • 2001: The Microarray Gene Expression Data Society develops MIAME standard for the collection and reporting of microarray data • 2003: Joseph DeRisi uses a microarray to identify the SARS virus • Affymetrix, Applied Biosystems, and Agilent Technologies individually array human genome on a single chip • 2004: Roche releases Amplichip CYP450, the first FDA-approved microarray for diagnostic purposes

An early kind of analysis: learning disease sub-types by clustering patient profiles p53 Rb

Clustering: seeking ‘natural’ groupings & hoping that they will be useful… p53 Rb

E.g., for treatment Respond to treatment Tx1 p53 Do not Respond to treatment Tx1 Rb

E.g., for diagnosis Adenocarcinoma p53 Squamous carcinoma Rb

Another use of clustering • Cluster genes (instead of patients): • Genes that cluster together may belong to the same pathways • Genes that cluster apart may be unrelated

Unfortunately clustering is a non-specific method and falls into the ‘one-solution fits all’ trap when used for prediction Do not Respond to treatment Tx2 p53 Rb Respond to treatment Tx2

Clustering is also non-specific when used to discover pathway membership, regulatory control, or other causation-oriented relationships It is entirely possible in this simple illustrative counter-example for G3 (a causally unrelated gene to the phenotype) to be more strongly associated and thus cluster with the phenotype (or its surrogate genes) more strongly than the true oncogenic genes G1, G2 G1 G2 Ph G3

Two improved classes of methods • Supervised learning predictive signatures and markers • Regulatory network reverse engineering  pathways

Supervised learning: use the known phenotypes (a.k.a “labels) in training data to build signatures or find markers highly specific for that phenotype

Regulatory network reverse engineering

Supervised learning: a geometrical interpretation

In 2-D looks good but what happens in: • 10,000-50,000 (regular gene expression microarrays, aCGH, and early SNP arrays) • >500,000 (tiled microarrays, new SNP arrays) • 10,000-300,000 (regular MS proteomics) • >10, 000, 000 (LC-MS proteomics) This is the ‘curse of dimensionality problem’

High-dimensionality (especially with small samples) causes: • Some methods do not run at all (classical regression) • Some methods give bad results (KNN, Decision trees) • Very slow analysis • Very expensive/cumbersome clinical application • Tends to “overfit”

Two (very real and very unpleasant) problems: Over-fitting & Under-fitting • Over-fitting ( a model to your data)= building a model that is good in original data but fails to generalize well to fresh data • Under-fitting ( a model to your data)= building a model that is poor in both original data and fresh data

Intuitive explanation of overfitting & underfitting • Play the game: find rule to predict who are the instructors in any given class (use today’s class to find a general rule)

Over/under-fitting are directly related to the complexity of the decision surface and how well the training data is fit Outcome of Interest Y This line is good! This line overfits! Training Data Future Data Predictor X

Constantin F. Aliferis M.D., Ph.D., FACMI

Constantin F. Aliferis M.D., Ph.D., FACMI

Presentation Transcript

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D

Thomas C. Rindflesch, Ph.D., FACMI

L Liem , M.D. F Huygen, M.D., Ph.D Marc Russo , M.D. JP Van Buyten , M.D. I Smet, M.D.

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D.

Kaan Yücel M.D., Ph.D .

Kaan Yücel M.D., Ph.D

R. Ryan Geyer, Ph.D. PI: Walter F. Boron M.D., Ph.D.

David K. Song, M.D., Ph.D. Kevin G. Volpp, M.D., Ph.D.

Yuval Shahar, M.D., Ph.D.

Constantin F. Aliferis M.D., Ph.D., FACMI

R. Ryan Geyer, Ph.D. PI: Walter F. Boron M.D., Ph.D.

M.D./Ph.D. Program