1 / 42

Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 01-06- 10 11. Harvard School of Public Health Department of Biostatistics. What tools enable biological discoveries?. Our job is to create computational microscopes:

aqua
Télécharger la présentation

Scalable data mining for functional genomics and metagenomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics

  2. What tools enable biological discoveries? Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

  3. Outline 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

  4. A computational definition offunctional genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Gene Data ↓ Function Function ↓ Function

  5. A framework for functional genomics 100Ms gene pairs → ← 1Ks datasets P(G2-G5|Data) = 0.85 Frequency Low Correlation High Correlation = + Frequency Not let. Let. Frequency Dissim. Similar Low Similarity High Similarity Low Correlation High Correlation

  6. Functional networkprediction and analysis Global interaction network HEFalMp Currently includes data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases Carbon metabolism network Extracellular signaling network Gut community network

  7. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

  8. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 = +

  9. Unsupervised data integration:TB virulence and ESX-1 secretion With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/

  10. Unsupervised data integration:TB virulence and ESX-1 secretion With Sarah Fortune X ? Graphle http://huttenhower.sph.harvard.edu/graphle/

  11. Outline 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

  12. What to do with your metagenome? Reservoir of gene and protein functional information Comprehensive snapshot of microbial ecology and evolution Who’s there? What are they doing? What do functional genomic data tell us about microbiomes? What can our microbiomes tell us about us?* (x1010) Public health tool monitoring population health and interactions Diagnostic or prognostic biomarker for host disease *Using terabases of sequence and thousands of experimental results

  13. The Human Microbiome Project All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, antibiotic resistant infection… • 300 “normal” adults, 18-40 • 16S rDNA + WGS • 5 sites/18 samples + blood • Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth • Skin: ears, inner elbows • Nasal cavity • Gut: stool • Vagina:introitus, mid, fornix • Reference genomes (~200+800) Kolenbrander, 2010 Hamady, 2009 2007 - ongoing

  14. HMP Organisms: Everyone andeverywhere is different ← Body sites + individuals → gut nose mouth arm vagina ear mucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue ← Organisms (taxa) → Aerobicity, interaction with the immune system, and extracellular medium appear to be major determinants Every microbiome is surprisingly different Even common organisms vary tremendously in abundance among individuals There are few, if any, organismal biotypes in health Most organisms are rare in most places

  15. HMP: Metabolic reconstruction Functional seq. KEGG + MetaCYC CAZy, TCDB,VFDB, MEROPS… 300 subjects 1-3 visits/subject ~6 body sites/visit 10-200M reads/sample 100bp reads BLAST Smoothing Witten-Bell BLAST → Genes Genes → Pathways MinPath(Ye 2009) WGS reads Genes(KOs) Taxonomic limitation Rem. paths in taxa < ave. ? Pathways(KEGGs) Pathways/modules Xipe Distinguish zero/low(Rodriguez-Mueller in review) Gap filling c(g) = max( c(g), median )

  16. HMP: Metabolic reconstruction Pathway coverage Pathway abundance

  17. HMP: Metabolic reconstruction Pathway abundance ← Samples → ← Pathways→

  18. HMP: Metabolic reconstruction Pathway coverage ← Samples → All body sites (“core”) ← Pathways→ Aerobic body sites Gastrointestinal body sites

  19. Metagenomic biomarker discovery Intervention/perturbation Healthy/IBD BMI Diet Biological story? Independent sample Batch effects? Populationstructure? Cross-validate Geneexpression Taxa &pathways SNPgenotypes Niches &Phylogeny Test forcorrelates Confounds/stratification/environment Featureselectionp >> n Multiplehypothesiscorrection

  20. LEfSe: Metagenomic classcomparison and explanation LEfSe LDA +Effect Size Nicola Segata http://huttenhower.sph.harvard.edu/lefse

  21. LEfSe: The TRUC murine colitis microbiota With Wendy Garrett

  22. MetaHIT: The gut microbiome and IBD With Ramnik Xavier, Joshua Korzenik 124 subjects: 99 healthy 21 UC + 4 CD Taxa Qin 2010 PhymmBrady 2009 WGS reads ReBLASTed against KEGG since published data obfuscates read counts Genes(KOs) Pathways/modules Pathways(KEGGs)

  23. MetaHIT: Taxonomic CD biomarkers Up in CD Down in CD Firmicutes UC Enterobacteriaceae

  24. MetaHIT: Functional CD biomarkers Subset of enriched pathways in CD patients Subset of enriched modules in CD patients Up in CD Down in CD Growth/replication Motility Transporters Sugar metabolism

  25. MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome Up in CD Enzymefamilies Inferredmetabolites Growth/replication Motility Transporters Sugarmetabolism Down in CD

  26. Outline • Network framework for scalable data integration • HEFalMp: human data integration • Meta-analysis for unsupervised functional network integration • HMP: microbiome in health, 18 body sites in 300 subjects • HUMAnN: metagenomic metabolic and functional pathway reconstruction • LEfSe: biologically relevant community differences 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

  27. Thanks! Human Microbiome Project George Weinstock Jennifer Wortman Owen White MakedonkaMitreva Erica Sodergren VivienBonazzi Jane Peterson Lita Proctor SaharAbubucker Yuzhen Ye Beltran Rodriguez-Mueller Jeremy Zucker QiandongZeng MathangiThiagarajan Brandi Cantarel Maria Rivera Barbara Methe Bill Klimke Daniel Haft Dirk Gevers Jacques Izard Nicola Segata PinakiSarder Ramnik Xavier HMP Metabolic Reconstruction Wendy Garrett Sarah Fortune Bruce Birren Mark Daly Doyle Ward Eric Alm Ashlee Earl Lisa Cosimi Levi Waldron LarisaMiropolsky Interested? We’re recruiting students and postdocs! http://huttenhower.sph.harvard.edu http://huttenhower.sph.harvard.edu/sleipnir

  28. Functional network prediction from diverse microbial data 486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments 114786 postprocessed interactions Integrated functional interaction networks in 15 species 154796 raw interactions E. Coli Integration ← Precision ↑, Recall ↓

  29. Predicting gene function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  30. Predicting gene function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  31. Predicting gene function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

  32. Comprehensive validation of computational predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Laboratory Experiments Growth curves Petite frequency Confocal microscopy

  33. Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

  34. Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months

  35. Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Chemotaxis

  36. Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis

  37. Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Chemotaxis Flagellar assembly

  38. Functional mapping:Associations among processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance

  39. Functional mapping:Associations among processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  40. Functional mapping:Associations among processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  41. Functional mapping:Associations among processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

More Related