1 / 50

Omics data integration & mining

BK21 BT · IT Integrationist Program Omics data integration & mining The Sixth Sino-Japan-Korea Bioinformatics Training Course Shanghai, Ma rch 27-30, 200 7 2007. 3. 29 Sangsoo Kim & KOBIC Omics Team What is the goal of Biosciences? Ultimately, the complete understanding of life phenomena

bernad
Télécharger la présentation

Omics data integration & mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BK21 BT·IT Integrationist Program Omics data integration & mining The Sixth Sino-Japan-Korea Bioinformatics Training Course Shanghai, March 27-30, 2007 2007. 3. 29 Sangsoo Kim & KOBIC Omics Team

  2. What is the goal of Biosciences? • Ultimately, the complete understanding of life phenomena • Complex organization • Regulatory mechanism (homeostasis) • Growth & development • Energy utilization • Response to the environmental stimuli • Reproduction (DNA guaranties exact replication) • Evolution (capacity of species to change over time)

  3. Spider Silk: Stronger than Steel • Life’s diversity results from the variety of molecules in cells • A spider’s web-building skill depends on its DNA molecules • DNA also determines the structure of silk proteins • These make a spiderweb strong and resilient

  4. The capture strand contains a single coiled silk fiber coated with a sticky fluid • The coiled fiber unwinds to capture prey and then recoils rapidly Coiled fiberof silk protein Coating of capture strand

  5. Evidence from flagelliform silk cDNA for the structural basis of elasticity and modular nature of spider silks J Mol Biol. 1998 Feb 6;275(5):773-84 • They report the cloning of substantial cDNA for flagelliform gland silk protein, which forms the core fiber of the catching spiral • The dominant repeat of this protein is Gly-Pro-Gly-Gly-X, which can appear up to 63 times in tandem arrays • They propose that the spring-like helix is the basis for the elasticity of silk

  6. Central dogma of molecular biology DNA RNA protein

  7. Paradigm Shift in Biosciences • So far, biologists have focused certain phenotypes and hunted the genes responsible, one at a time • New trend is • Catalog all the parts: genes and proteins • Understand how each part works • Model & simulate the collective behavior of the parts Genomics & Proteomics FunctionalGenomics Systems Biology

  8. genome transcriptome proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA RNA protein

  9. Base pairs of DNA (billions) Sequences (millions) 1982 1986 1990 1994 1998 2002 Year

  10. With $1,000 genome sequencing technologies in 10 years coupled with functional data, we need better IT solutions!

  11. Proliferation of Genomics • Explosion of data • Human genes: 25,000 • Human genome: 3x109 bp • DNA-protein or protein-protein interactions could increase data dramatically • Chimpanzee, mouse, rat, dog, cow, chicken, insects, worms, plants, fungi, algae, bacteria, archaea, viruses …

  12. Genome Projects (385 finished)as of June 4, 2006 Ongoing projects 608 eukaryotes 989 prokaryotes

  13. Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:DNA, protein:RNA, protein:protein recognition codes [5] Accurate ab initio protein structure prediction

  14. Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Source: Ewan Birney, Chris Burge, Jim Fickett

  15. Functional Genomics & Systems Biology • New data types: • Sequences • Structures • High throughput expression profiles in (10,000 x 100) matrix forms • Interactions, Pathways, Networks • Mathematical modeling & simulation of biological processes • Algorithms • Graphical visualization

  16. K-JIST 18C 19C 20C

  17. Genome Transcriptome Proteome Metabolome Genomics Transcriptomics Proteomics Metabolomics DNA RNA Protein Metabolite K-JIST Terminology More than 50-omes including “Unknownome”

  18. Omics data • In the Omics era, we see proliferation of genome/proteome-wide high throughput data that are available in public archives • Comparative genome sequences • Sequence variation & phenotypes • Epigenetics & chromatin structure • Regulatory elements & gene expression • Protein expression, modification & localization • Protein domain, structure, interaction • Metabolic, signal, regulatory pathways • Drug, toxicogenomics, toxicoproteomics

  19. Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857

  20. Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857

  21. Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857

  22. Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857

  23. Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857

  24. As an example, • Suppose you are interested in how much the CDK2 trascription control is conserved, you may need • Orthologs in various model organisms • Genome alignments of promoter regions among phylogenetic cousins • Among mammalians or vertebrates • Among yeast subsepecies • Transfac-type of TF binding database • ChIP-chip data for each organism • Orthology map of the TF’s and so on • You may add proteome and interactome • Only part of them are available at NCBI • Rest of them are available in the public domain as an supplementary materials or at the author’s web sites

  25. Integration of Omics data • Systematic mining • Cross-knowledge domain validation • Cross-species interpolation • Generation of hypotheses that can be tested • Biologically very interesting queries • Requires cross-functional knowledge • The way to go

  26. Organization of data

  27. Where to look for • Nature provides omics section • www.nature.com/omics • Science • Cell • PLoS Biology • Genes & Development • Stem Cell • Relevant articles (PubMed, Google Scholar)

  28. ENCyclopedia Of DNA Elements (ENCODE) funded by NHGRI

  29. NHGRI Current Topics in Genome Analysis 2006

  30. NHGRI Current Topics in Genome Analysis 2006

  31. ENCODE Genomes to seuqnce

  32. Phase 1 of ENCODE • NHGRI’s ENCODE project generates such data at a pilot scale • The data are deposited and integrated into the UCSC Genome Browser • It offers data mining capability via Table Browser • There is no ‘biological links’ among the 3,000+ tables (Ensembl’s BioMart is more ‘biological’) • It is upto the users how to combine the tables • It is limited to genomic coordinates, not intended for proteome work

  33. ENCODE Data Integrated in UCSC Genome Browser

  34. A ~2kb conserved, transcribable, Ac-histone, pol2-binding element in the 1st intron of ST7

  35. Turned out to be a pseudo gene!

  36. And also duplicated in other parts of genome!

  37. Omics Dataset Example

  38. Application Examples Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857

  39. Protein-DNA Interaction & Transcriptomics • Yeast rich medium gene modules network • ChIP-chip location and expression data • 106 modules containing 655 genes regulated by 68 TFs

  40. Protein-DNA Interaction & Transcriptomics

  41. Predicting Protein-Protein Interaction by combining multiple datasets

  42. Predicting Protein-Protein Interaction by combining multiple datasets

  43. Predicting Protein-Protein Interaction by combining multiple datasets

  44. How to participate • Domain knowledge group • Monitoring papers and websites of relevant data • Collect the omics data and transform into common formats • Develop hypotheses & mining strategies • Data integration group • Develop DB schema • Integration with bio-matrix & bio-engine • Querying biological concepts • Graphic visualization

  45. Practice Session - Cytoscape • Installation • One of the most widely used and broadly accessible software packages designed to facilitate omics data integration and analysis • Totorials • Interaction network display • Expression analysis • Literature searching

More Related