Discovering Combinatorial Biomarkers Vipin Kumar kumar@cs.umn.edu http://www.cs.umn.edu/~kumar

Discovering Combinatorial BiomarkersVipin Kumarkumar@cs.umn.eduhttp://www.cs.umn.edu/~kumar Department of Computer Science and Engineering ICCABS, Feb 2012

High-throughput technologies Clinical Data e.g. brain imaging Gene Expression & non-coding RNA SNP Metabolites Structural Variation • Data mining offers potential solution for analysis of these large-scale datasets • Novel associations between genotypes and phenotypes • Biomarker discovery for complex diseases • Personalized Medicine – Automated analysis of patients history for customized treatment Proteins DNA Methylation Adopted from E. Schadt

Biomarker Discovery and its Impact Biomarkers: Genes: BRCA1 (breast cancer) Protein variants IVS5-13insC (type 2 diabetes) Pathways/networks: P53 (cancers) Clinical Impact: Diagnosis Prognosis Treatment Miki et al. 1994 Chiefariet al. 2011 Oren et al. 2010 fMRI Schizophrenia vs controls Lim et al.

SNP as an illustration Published Genome-wide Associations through 06/2010 1,904 published GWA at p≤5*10-8 for 165 traits NHGRI GWA Catalog www.genome.gov/GWAStudies

SNP as an illustration Published Genome-wide Associations through 06/20111,449 published GWA at p≤5*10-8 for 237 traits 50% increase in one year NHGRI GWA Catalog www.genome.gov/GWAStudies 5

Challenge: Limitations of Single-locus Association Test High coverage but low odds ratio (1.2) High odds ratio (15.9) but low coverage (7%) No significant associations Many other studies

A Example where Single-locus Test Led to No Significant Associations Myeloma Survival Data • Given a SNP data set of Myeloma patients, find SNPs that are associated with short vs. long survival. 3404 SNPs • 3404 SNPs selected from various regions of the chromosome • 70 cases (Patients survived shorter than 1 year) • 73 Controls (Patients survived longer than 3 years) cases Controls Van Ness et al 2008 Top ranked SNP: -log10P-value = 3.8; Odds Ratio = 3.7 Myeloma SNP data has signal  the need of discovering combinations of SNPs

Single-locus Tests Ignore Genetic Interaction Non-additive effect “Genetic Interaction” Ripke et al. 2011 Extensively observed in model organisms, e.g. yeast, C. elegans, fly. Costanzo et al. 2010 Scholl et al. 2009 Ruzankina et al. 2009 Kamath, 2003 8

The focus of this talk: Higher-order Combinatorial Biomarker Complex biological system Complex human diseases Higher-order genetic buffering ...... A synthetic pattern Disease Triple mutations only exist in disease subjects Control

Discovering High-order Combinatorial BiomarkersChallenge I: Computational Efficiency Given nfeatures, there are 2n candidates! How to effectively handle thecombinatorialsearch space? Brute-force search e.g. MDR can only handle 10~100 SNPs. [Rita et al. 2001] The Apriori framework for efficient search of exponential space Millions of user, thousands of items Support based pruning Disqualified + + [Agrawal et al. 1994] Prune all the supersets

Discovering High-order Combinatorial BiomarkersChallenge I: Computational Efficiency • Traditional Apriori-based pattern mining techniques • Designed for sparse data • Unique challenges of genomic datasets • High density • A SNP dataset has a density of 33.33% • Three binary columns per SNP  the three genotypes • High dimensionality • Makes the search more challenging • Disease heterogeneity • Each combination supported by a small fraction of subjects A novel anti-monotonic objective function designed for mining low-support discriminative patterns from denseand high-dimensional data [Fang et al. TKDE 2010]

Discovering High-order Combinatorial BiomarkersChallenge II: Statistical Power • Computational challenges can be addressed by • Better algorithm design, • e.g. Apriori-based • High-performance computing • Statistical challenges call for additional efforts • Limited sample size • Huge number of hypothesis tests Many combinations are trivial extensions of their subsets Myeloma Survival Data Kidney Rejection Data Lung Cancer Data Subsets having lower association Subsets having higher association Targeting patterns with better association than their subsets reduces # of hypothesis tests [Fang, Haznadar, Wang, Yu, Steinbach, Church, Oetting, VanNess, Kumar, PLoS ONE, 2012]

High-order Combinatorial Biomarkers: an example Lung Cancer Data Size-5 Patients All heavy smokers Jump Best size-4 Best size-3 Best size-2 Best size-1 Control Data from Church et al. 2010 The five genes are functionally related www.ingenuity.com 13 [Fang, Pandey, Wang, Gupta, Steinbach and Kumar, IEEE TKDE, 2012] [Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoSONE, 2012]

Insights on High-order Functional Interactions Patterns with positive Jump are functionally more coherent Lung cancer dataset Size-5 Lungcancer Jump Best size-4 Best size-3 Best size-2 Best size-1 Control Kidney Rejection Data Combined Lung Cancer Data 14 [Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoSONE, 2012]

High-order Combinations Discovered from Different Types of Data mRNA: Breast Cancer SNP: acute kidney rejection Metabolites: COPD AE COPD Rejection Survived (5-year) Stable COPD No-rejection Control Data from Oetting et al. 2008 Data from Vijver et al. 2002 Data from Wendt et al. 2010 The proposed framework is general to handle different types of data 15 [Fang, Pandey, Wang, Gupta, Steinbach and Kumar, IEEE TKDE, 2012] [Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoSONE, 2012]

Biomarker Discovery using Error-tolerant Patterns 0 1 1 1 0 1 0 0 1 0 1 1 0 0 0 0 X 1 1 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 1 √ 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 1 0 0 1 1 1 1 • True patterns are fragmented due to noise and variability • Possible solution: Error-tolerant patterns • These patterns differ in the way errors/noise in the data are tolerated • [Yang et al 2001]; [Pei et al 2001]; [Seppanen et al 2004]; [Liu et al 2006]; [Cheng et al 2006]; [Gupta et al., KDD 2008]; [Poernomo et al 2009] See Gupta et al KDD 2008 for a survey

Error-tolerant pattern vs. Traditional association patterns • Four Breast cancer gene-expression data sets are used for experiments: 158 cases + + + 433 controls GSE1456 GSE7390 GSE6532 GSE3494 • Cases: patients with metastasis within 5 years of follow-up; • Controls: patients with no metastasis within 8 years of follow-up • Discriminative Error-tolerant and traditional association patterns case/control are discovered and evaluated by enrichment analysis using MSigDB gene sets Error-tolerant patterns • Greater fraction of error-tolerant patterns enrich at least one gene set (higher precision) • Greater fraction of gene sets are enriched by at least one error-tolerant pattern (higher recall) Traditional patterns Error-tolerant patterns Traditional patterns Gupta et al. BICoB2010; Gupta et al. BMC Bioinformatics 2011

Differential Coexpression Patterns • Differential Expression (DE) • Traditional analysis targets changes of expression level • Differential Coexpression (DC) • Changes of the coherence of gene expression • Combinatorial Search • Genetic Heterogeneity • calls for subspace analysis [Silva et al., 1995], [Li, 2002], [Kostka & Spang, 2005], [Rosemary et al., 2008], [Cho et al. 2009] etc. [Eisen et al. 1999] [Golub et al., 1999], [Pan 2002], [Cui and Churchill, 2003] etc.

Subspace Differential Coexpression Analysis Enriched with the TNF-α/NFkB signaling pathway (6/10 overlap with the pathway, corrected p value: 1.4*10-3) Suggests that the dysregulation of TNF-α/NFkB pathway may be related to lung cancer ≈ 60% ≈ 10% • Three lung cancer datasets [Bhattacharjee et al. 2001], [Stearman et al. 2005], [Su et al. 2007] Selected for highlight talk, RECOMB SB 2010 Best Network Model award, Sage Congress, 2010 [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]

Combinatorial Biomarkers: Summary • Higher-order combinations • Important for understanding complex human diseases • A novel framework • Improved computational efficiency • Enhanced statistical power • Naturally handles disease heterogeneity • Error-tolerance • Different types of differentiation: coexpression • General to handle different types of data • SNP • Gene expression • Metabolomicdata • Brian imaging data (e.g. fMRI) 20

References • G. Fang, R. Kuang, G. Pandey, M. Steinbach, C.L. Myers, and V. Kumar. Subspace differential coexpression analysis: problem definition and a general approach. Pacific Symposium on Biocomputing, 15:145-156, 2010. • G. Fang, G. Pandey, W. Wang, M. Gupta, M. Steinbach, and V. Kumar. Mining low-support discriminative patterns from dense and high-dimensional data. IEEE TKDE, 24(2):279-294, 2012. • G. Fang, MajdaHaznadar, Wen Wang, Haoyu Yu, Michael Steinbach, Tim Church, William Oetting, Brian Van Ness, and Vipin Kumar. High-order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions. PLoS ONE, page in press, 2012. • R. Gupta, N. Rao, and V. Kumar. Discovery of errortolerantbiclusters from noisy gene expression data. In BMC Bioinformatics, 12(S12):S1, 2011. • R. Gupta, SmitaAgrawal, NavneetRao, ZeTian, RuiKuang, Vipin Kumar, "Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining", In Proc. of the International Conference on Bioinformatics and Computational Biology (BICoB), 2010 • GowthamAtluri, Rohit Gupta, Gang Fang, GauravPandey, Michael Steinbach and Vipin Kumar, Association Analysis Techniques for Bioinformatics Problems, Proceedings of the 1st International Conference on Bioinformatics and Computational Biology (BICoB), pp 1-13, 2009. • S. Landman Vipin Kumar Michael Steinbach, Haoyu Yu. Identification of Co-occurring Insertions in Cancer Genomes Using Association Analysis. International Journal of Data Mining and Bioinformatics, in press, 2012. • M. Steinbach, H. Yu, G. Fang, and V. Kumar. Using constraints to generate and explore higher order discriminative patterns. Advances in Knowledge Discovery and Data Mining, pages 338-350, 2011. • S. Dey, GowthamAtluri, Michael Steinbach, Angus MacDonald, Kelvin Lim, and Vipin Kumar. A pattern mining based integrative framework for biomarker discovery. Tech report, Department of Computer Science, University of Minnesota, (002), 2012. • G. Pandey, C. Myers, and V. Kumar. Incorporating functional inter-relationships into protein function prediction algorithms. BMC bioinformatics, 10(1):142, 2009. • G. Pandey, B. Zhang, A.N. Chang, C.L. Myers, J. Zhu, V. Kumar, and E.E. Schadt. An integrative multi-network and multi-classifier approach to predict genetic interactions. PLoS computational biology, 6(9):e1000928, 2010 (Cited as one of the major computational biology breakthroughs of 2010 by a Nature Biotechnology feature article). • J. Bellay, G. Atluri, T.L. Sing, K. Toufighi, M. Costanzo, P.S.M. Ribeiro, G. Pandey, J. Baller, B. VanderSluis, M. Michaut, et al. Putting genetic interactions in context through a global modular decomposition. Genome Research, 21(8):1375-1387, 2011. 21

Acknowledgement Kumar Lab, Data Mining Gang Fang Wen Wang VanjaPaunic Yi Yang Benjamin Oatley Xiaoye Liu SanjoyDey GowthamAtluri GauravPandey Michael Steinbach Myers Lab, FuncGenomics Jeremy Bellay Chad Myers Van Ness Lab, Myeloma Brian Van Ness Lim Lab, Brain Imaging Kelvin Lim Kuang Lab, Compbio TaeHyun Hwang RuiKuang McDonald Lab, Behavior Angus McDonald Masonic Cancer Center Tim Church Bill Oetting Wendt Lab, Lung Disease Chris Wendt Mayo Clinic-IBM-UMR fellowship, Walter Barnes Lang fellowship, NSF: #IIS0916439, UMII seed grant, BICB seed grant, Computations enabled by the Minnesota Supercomputing Institute. BioMedicalGenomics Center at University of Minnesota, International Myeloma Foundation. Etiology and Early Marker Study program of the Prostate Lung Colorectal and Ovarian Cancer Screening Trial

Thanks!

Discovering Combinatorial Biomarkers Vipin Kumar kumar@cs.umn.edu http://www.cs.umn.edu/~kumar

Discovering Combinatorial Biomarkers Vipin Kumar kumar@cs.umn.edu http://www.cs.umn.edu/~kumar

Presentation Transcript

Data Mining for Network Intrusion Detection

High Performance Data Mining

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 7 Decision Trees Lecture slide

Association Analysis-based Extraction of Functional Information from Protein-Protein Interaction Data

RFI at GMRT: Detection, Suppression and Co-existence

FABRICATION OF SOLAR REFRIGERATOR

Grou p 12 Vedium Vasanth Kumar Vijay Abilash Suneel Kumar Ankush Chaddha Pankhuri Neb

Note on Diversification as a Strategy

TRANSPORTATION MODEL

Favorite Songs

Presented By :- Sunil Kumar Sharma Vipin kaushik Deepika Dhirendra Sharma

DPR- Preparation

By Mr. SAURAV KUMAR EEE # 0801314187 Mr. GAURAV KUMAR ECE # 0801314192 Under the guidance of

Scalable Benchmarks and Kernels for Data Mining and Analytics

SEMINAR ON

LEATHER INDUSTRY At a Glance

Author : Dr. KUMAR J. DOCTOR AIOS No: D03945 E-Poster Serial No. : 230

Venkateswarlu Gaddam, Gautam Sharma, Neelesh Kumar, Amod Kumar, S.K . Mahna NIT Kurukshetra

TWİNS

CSci 8980: Data Mining (Fall 2002)

DRAG FORCE ANALYSIS OF CAR AUTHORS ABHISHEK KUMAR RAVISHEK KUMAR SESSION 2005-2006