Download
functions networks and phenotypes by integrative genomics analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Xianghong Jasmine Zhou Molecular and Computational Biology University of Southern California PowerPoint Presentation
Download Presentation
Xianghong Jasmine Zhou Molecular and Computational Biology University of Southern California

Xianghong Jasmine Zhou Molecular and Computational Biology University of Southern California

172 Vues Download Presentation
Télécharger la présentation

Xianghong Jasmine Zhou Molecular and Computational Biology University of Southern California

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Functions, Networks, and Phenotypes by Integrative Genomics Analysis Xianghong Jasmine Zhou Molecular and Computational Biology University of Southern California IMS, NUS, Singapore, July 18, 2007

  2. 2001 Human 30~100K genes 1995 Bacteria ~1.6K genes 1997 Eukaryote ~6K genes 1998 Animal ~20K genes Objective: $1000 human genome • Gene Expression Datasets: the Transcriptome • Oligonucleotide Array • cDNA Array Protein abundance measurement (Mass Spec) Protein interactions (yeast 2-hybrid system, protein arrays) Protein complexes (Mass Spec) Information Flow Cellular systems DNA RNA Protein

  3. Rapid accumulation of microarray data in public repositories • NCBI Gene Expression Omnibus • EBI Array Express 137231 experiments 55228 experiments The public microarray data increases by 3 folds per year

  4. Multiple Microarray Technology Platforms Microarray Platforms

  5. Coexpression Networks Recurrent Patterns Annotation Transcriptional Annotation f ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- f j TF j a h a h c c e e b h k b k e d i g d g i h e f i f g j j a a h h c c e g i e b b k k d i g d i g Functional Annotation f f j j a f a h h c c a e e b b k k b d d i g i g d f f j j a a h h c c e e Gene Ontology b b k k d d i g i g Graph-based Approach for the Integrative Microarray Analysis Datasets experiments gene

  6. Frequent Subgraph Mining Problem is hard! Problem formulation: Given n graphs, identify subgraphs which occur in at least m graphs (m  n) Our graphs are massive!(>10,000 nodes and >1 million edges) The traditional pattern growth approach (expand frequent subgraph of k edges to k+1 edges) would not work, since the time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks.

  7. Novel Algorithms to identify diverse frequent network patterns • CoDense (Hu et al. ISMB 2005) • identify frequent coherent dense subgraphs across many massive graphs • Network Biclustering (Huang et al, ISMB 2007) • identify frequent subgraphs across many massive graphs • Network Modules (NeMo) (Yan et al. ISMB 2007) • identify frequent dense vertex sets across many massive graphs

  8. CODENSE: identify frequent coherent dense subgraphs across massive graphs Hu et al, ISMB 2005

  9. Identify frequent co-expression clusters across multiple microarray data sets f j a h c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … c e b k d g i f f f j j a j a h a c h c h c f j e e c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … e a c e h b b k b b k k k d i g d d i g i g d i g . . . . . . . . . c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … f j a h c e f b k j d i g a h c e f c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … j a b h c k e d b i k g d i g

  10. The common pattern growth approach Find a frequent subgraph of k edges, and expand it to k+1 edge to check occurrence frequency • Koyuturk M., Grama A. & Szpankowski W. An efficient algorithm for detecting frequent subgraphs in biological networks. ISMB 2004 • Yan, Zhou, and Han. Mining Closed Relational Graphs with Connectivity Constraints. ICDE 2005

  11. Problem of the Pattern-growth approach The time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks. The number of frequent dense subgraphs is explosive when there are very large frequent dense subgraphs, e.g., subgraphs with hundreds of edges.

  12. Problem of the Pattern-growth approach Pattern Expansion k k+1 f f f f j j f a a j h h j a c a h c j h c a c h c e e e e e b b k b k b k b k d i k d i g g d i d i g g d i g f f j f j f a j a f j a c j a c h h c a c h h c e e h e e e b b k b k b k b k k d d i g i g d d i g i g d i g f f j j f f a j a j f h h c a j c a h c h a c h c e e e e e b b k b k b k k b g i g d k i d g i g d i d g i d f f f f j j f a j a j h h a j c a c h h a c c h c e e e e e b b k b k b k b k k d d d i g i d g i d g i g i g

  13. Our solution We develop a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: • All edges occur in >= k graphs (frequency) • All edges should exhibit correlated occurrences in the given graph set. (coherency) • The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes

  14. f f f a a a h h c c c h e e e b b b f d d d i i i g g g a h c G1 G2 G3 e b f f f d a a i g h h h c c c a summary graph Ĝ e e e b b b d d d i g i i g g G4 G5 G6 CODENSE: Mine coherent dense subgraph (1) Builds a summary graph by eliminating infrequent edges

  15. CODENSE: Mine coherent dense subgraph (2) Identify dense subgraphs of the summary graph f f a Step 2 h h c c e e b d MODES i i g g summary graph Ĝ Sub(Ĝ) Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse conclusion is not true.

  16. CODENSE: Mine coherent dense subgraph (3) Construct the edge occurrence profiles for each dense summary subgraph E G1 G2 G3 G4 G5 G6 f c-e 0 0 1 1 0 1 h c Step 3 c-f 0 1 0 1 1 1 e c-h 0 0 0 1 1 1 i g c-i 0 0 1 1 1 0 Sub(Ĝ) e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles

  17. CODENSE: Mine coherent dense subgraph (4) builds a second-order graph for each dense summary subgraph g-h f-i E G1 G2 G3 G4 G5 G6 e-i h-i c-e 0 0 1 1 1 1 Step 4 c-f 0 1 0 1 1 1 g-i e-g c-h 0 0 0 1 1 1 e-h c-i 0 0 1 1 1 0 c-h f-h e-f 0 0 0 1 1 1 c-f e-f … … … … … … … c-i edge occurrence profiles c-e second-order graph S

  18. CODENSE: Mine coherent dense subgraph (5) Identify dense subgraphs of the second-order graph g-h g-h f-i e-i h-i e-i h-i Step 4 g-i e-g g-i e-g e-h e-h c-h f-h c-h f-h c-f e-f c-f e-f c-i c-e c-e second-order graph S Sub(S) Observation: if a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense.

  19. CODENSE: Mine coherent dense subgraph (6) Identify the coherent dense subgraphs g-h h e-i h-i e Step 5 i g g-i e-g e-h f h c c-h f-h e c-f e-f c-e Sub(G) Sub(S)

  20. Our solution The identified subgraphs by definition satisfy the three criteria: • All edges occur in >= k graphs (frequency) • All edges should exhibit correlated occurrences in the given graph set. (coherency) • The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes

  21. CODENSE: Mine coherent dense subgraph a f f f a a h h c c c h b e e e b b f f d d d i i i g g g a Step 2 Step 1 h h c c G1 G2 G3 e e b a f f f d MODES Add/Cut a a i i g g h h h c c c summary graph Ĝ b Sub(Ĝ) e e e b b d d d i g i i g g G4 G5 G6 Step 3 g-h g-h f-i h e-i e-i h-i h-i E G1 G2 G3 G4 G5 G6 e c-e 0 0 1 1 1 1 Step 6 Step 5 Step 4 i g g-i g-i e-g e-g c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 e-h e-h f MODES c-i 0 0 1 1 1 0 c-h c-h Restore G and MODES f-h h c f-h e-f 0 0 0 1 1 1 c-f c-f e-f e e-f … … … … … … … c-i c-e c-e edge occurrence profiles second-order graph S Sub(G) Sub(S)

  22. CODENSE The design of CODENSE can solve the scalability issue. Instead of mining each biological network individually, CODENSE compresses the networks into two meta-graphs (the summary graph and the second-order graph) and performs clustering in these two graphs only. Thus, CODENSE can handle any large number of networks.

  23. j j j a a h f f h b h b Step 1 Step 2 V c i c i e i e g g g d d a f h f h Step 4 h b Step 3 V c i e i i e d MODES: Mine overlapped dense subgraph G Sub(G) G’ condense HCS’ HCS’ Sub(G’) HCS’ restore Hu et al. ISMB 2005

  24. Applying CoDense to 39 yeast microarray data sets f f j j a h a h c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … c c e e b b k k d i d g i g f f j j a e c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … a c c h h e b b k k d d i i g g c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … f f j j a a h c h c e e b b k k i d g d i g f f j c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … j a a h h c c e e b b k k d d i g i g

  25. Functional annotation Annotation

  26. Functional Annotation (Validation) Method: leave-one-out approach - masking a known gene to be unknown, and assign its function based on the other genes in the subgraph pattern. Functional categories: 166 functional categories at GO level at least 6 Results: 448 predictions with accuracy of 50%

  27. Functional Annotation (Prediction) We made functional predictions for 169 genes, covering a wide range of functional categories, e.g. amino acid biosynthesis, ATP biosynthesis, ribosome biogenesis, vitamin biosynthesis, etc. A significant number of our predictions can be supported by literature.

  28. However… • How about frequent non-dense graphs? • Many biological modules may form paths • How about subgraphs which are coherent across only a subset of the graphs? • Not all modules are activated across all conditions, and genes may form modules with diff. other genes under diff. conditions

  29. Network Biclustering:Identify frequent subgraphs across massive graphs Huang et al, ISMB 2007

  30. Using 65 human co-expression network as an illustration example • 65 co-expression networks generated from 65 microarray data sets • each graph contains 8297 genes, and 1%-10% edges of a complete graph

  31. Basically, it is a biclustering problem graphs edges

  32. Network Biclustering • Objective function graphs c’: number of 1 in the bicluster c: number of 1 in the whole matrix mn: size of the bicluster : regularization factor edges However, the matrix is very large with millions of edges … We will first identify robust seed to narrow down the search space

  33. Hence, each graph can be treated as a collection of items Thus, Frequent subgraph Mining can be modeled as frequent item set mining Problem: current frequent item set mining algorithms can only efficiently mine across many small item sets In our problem, we have 65 very large item set… We use a trick…. Identify Bicluster seed The property of relation graphs: edge labels are unique.

  34. Graph set with more than 5 members and with > 1000 common edges Frequent pattern tree common edges common edges common edges {G1, G3, G5, G6, G7,…} {G2, G3, G5, G7, G8,…} …. {G8, G9, G15, G26, G29} {e1, e10, e56, e100, e1000,…} {e4, e12, e33, e56, e890,…} …. {e99, e220, e1545, e2629,…} Identify Bicluster seed E G1 G2 G3 G4 G5 G6 ... G60 G61 G62 G63 G64 G65 e1 1 1 0 1 0 1 ... 0 0 1 1 1 1 e2 1 1 0 1 0 1 … 0 1 0 1 1 1 e3 1 1 1 1 1 1 … 0 0 0 1 1 1 e4 0 0 1 1 1 0 ... 0 0 1 1 1 0 Edge occurrence profiles: e5 1 1 0 1 0 1 … 0 0 0 1 1 1 … … … … … … … … … … … … … … Very time consuming! It takes more than 2 weeks on 40 Pentium IV nodes Huang et al. ISMB 2007

  35. Simulated Annealing E G1 G2 G3 G4 G5 G6 G7 … G61 G62 G63 G64 G65 e1 1 1 0 1 1 1 0 0 0 1 1 1 1 e2 1 1 1 1 1 0 1 0 1 0 1 1 1 e3 1 1 1 1 1 1 0 0 0 0 1 1 1 e4 1 1 1 1 1 1 .0 0 0 1 1 1 0 e5 1 0 0 1 0 1 1 0 0 0 1 1 1 … … … … … … … … … … … … … … Identify connected components Expanding the Biclusters E G1 G2 G3 G4 G5 G6 G7 … G61 G62 G63 G64 G65 e1 1 1 0 1 1 1 0 0 0 1 1 1 1 e2 1 1 1 1 1 0 1 0 1 0 1 1 1 e3 1 1 1 1 1 1 0 0 0 0 1 1 1 e4 1 1 1 1 1 1 .0 0 0 1 1 1 0 e5 1 0 0 1 0 1 1 0 0 0 1 1 1 … … … … … … … … … … … … … …

  36. Systematic identification of functional modules in human genome • We identified 143,400 network modules with recurrence >= 5. They vary in size from 4 to 180. • 77.0% of the patterns are functionally homogenous (GO hyper-geometric P-value less than 0.01) • The figure shows that the functional homogeneity of modules increase with their recurrences.

  37. Results (I): Examples of highly recurring modules Recurrence 20 Defense against oxygen and nitrogen species Recurrence 18 Involved in spermatogenesis

  38. Loosely connected network patterns with high recurrence can represent functional modules

  39. Functional annotation Annotation We made functional predictions for 779 known and 116 unknown genes by random forest classification with 71% accuracy. Variables for random forest classification: functional enrichment P-value network topology score network connectivity pattern recurrence numbers average node degree unknown gene ratio Network size

  40. Network Modules (NeMo) Identify frequent dense vertex sets across many massive graphs Yan et al. ISMB 2007

  41. 105 human microarray data sets NeMo 6477 recurrent coexpression clusters (density > 0.7 and support > 10) Validation based on ChIp-chip data (9176 target genes for 20 TFs) Validation based on human-mouse Conserved Transfac prediction (7720 target genes for 407 TFs) 12.5% homogenous clusters (vs. 3.3% by randomization test) 15.4% homogenous clusters (vs. 0.2% by randomization test)

  42. Percentage of potential transcription modules validated by ChIP-Chip data increase with cluster density and recurrence

  43. Expression data Phenotype information Phenotype Concepts (e.g. diseases, perturbations, tissues ) in Unified Medical Language System (UMLS) Microarray Data

  44. Classifying microarray data based on phenotype … Adenocarcinoma Arthritis Asthma Glaucoma HIV For example, the current NCBI GEO database contains >60cancer datasets, among which 11leukemia datasets.

  45. Cancer Cancer Cancer Cancer Frequent pattern mining Module 1, Module 2, Module 3, … Module k Identify phenotype-specific functional or transcriptional modules • Unsupervised approach Module 1

  46. A total of 459 of the clusters are statistically significant with a hypergeometric P-value <0.01 in a specific phenotype: e.g. malignant neoplasms, cardiovascular diseases or nervous system disorders. 105 microarray data sets NeMo 6477 recurrent coexpression clusters (density > 0.7 and support > 10)

  47. An example 5 out of the 9 support datasets are leukemia datasets (P-value 0.0039). It is potentially regulated by E2F4, and majority genes are involved in cell cycle and DNA repair.

  48. Non-Adenocarcinoma Functional and transcriptional modules which are active ONLY in Adenocarcinoma Related data sets Identify phenotype-specific functional or transcriptional modules • Supervised approach … Adenocarcinoma Arthritis Asthma Glaucoma HIV

  49. f j a h c e b k d g i f j e a c h b k d i g f j a h c e b k d i g f f f j j j a a a h h h c c c e e e b b b k k k d d d i i i g g g A case study: Identify Network Modules Characterizing Cancer 32 Cancer Datasets c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … … c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … 25 Non-cancer Datasets c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … … c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 …

  50. Examples of identified modules Cell adhesion across all solid tumor datasets PDGF-signaling in breast cancer datasets Cell cycle Module across all cancer datasets