1 / 41

Managing Gene Annotation Information the search is over … one problem solved … another begins

Managing Gene Annotation Information the search is over … one problem solved … another begins. observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group. Interdisciplinary Center for Biotechnology Research.

iliana
Télécharger la présentation

Managing Gene Annotation Information the search is over … one problem solved … another begins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing GeneAnnotation Informationthe search is over… one problem solved… another begins observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group

  2. Interdisciplinary Center for Biotechnology Research • Established at the University of Florida in 1987 by the Florida Legislature • centralized organization of biomedical core facilities • supporting biotechnology-based research • How did information management become my problem?

  3. 1998 GSAC Miami Beach

  4. Why should I care about this problem? • Because my paycheck depends on it. • Avoid fatal failure in the funding loop. PI has $ for large gene-based project Other PI’s think this looks like a good idea PI applies for new funding Core Lab generates data Downstream data management & analysis PI writes papers, gives talks

  5. From Sequence to Function • The genomic sequence identifies the 'parts' • the next trick is understanding gene function • Post genomic era = functional genomics • Critical concept: genes of similar sequence may have similar functions • Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function

  6. BLAST • Most common starting point for gene identification • Similarity search of sequence repository (GenBank) • Output • Calculated scores (bit score and e-value) • Text string (definition line), ID Reference Tag • Sequence alignment • Advantages • Fast algorithm, very good at finding close homologs • Disadvantages • Not good at finding distant relatives • Cluster and Grid-enabled versions available

  7. HMMER • HMMER developed by Sean Eddy • Uses Hidden Markov Models • Searches unknown protein query sequence against a database of protein family models • Statistical models constructed from alignment of conserved protein regions (Pfam) • Advantages • Superior to BLAST for discovering more distant homology relations • Disadvantages • More computationally intensive than BLAST • GRID enabled

  8. OK! Great! Sequencing done. Homology searches complete.But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?

  9. Search for summarizing information that restores sanity CTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG

  10. BlastQuest A small idea with a big mission

  11. BlastQuest Requirements • Accessible to research groups at remote locations • Privacy constrained sharing of results among the scientists • Selective browsing of BLAST homology search results • Selective data filtering on statistical criteria • e-value or bit score • Selective data grouping on criteria such as GI number, or a defined number of top-scoring results • Ad hoc search capability onuser determined criteria: • text terms • boolean logic From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.

  12. Overview of BlastQuest Architecture

  13. Welcome to BlastQuest

  14. Choose among client projects

  15. Results Selection

  16. Grouped Results

  17. Ad Hoc Text Searching

  18. Internal BLAST Searches

  19. Viewing a Gene Ontology Tree

  20. Viewing a Gene Ontology Tree

  21. Viewing a Gene Ontology Tree

  22. KEGG Classification • Kyoto Encyclopedia of Genes and Genomes • “Wiring diagrams of life” • KEGG Protein Networks • Metabolic pathways • Regulatory pathways • Molecular complexes • Network-network relations • Network-environment relations

  23. Common to both Unique to non-Unigene Unique to Unigene

  24. Bacterial Genome Annotation Workbench Another simple idea driven by necessity

  25. Start

  26. Project Summary

  27. Contig Browser

  28. Contig summary

  29. Physical map linked to annotation

  30. Simple problems.Simple solutions.Why are these simple ideas important?

  31. Human Genome Project • HGP drove innovation in biotechnology • 2 major technological benefits • stimulated development of high throughputmethods • reliance on computational tools for data mining and visualization of biological information

  32. The HGP and the cost of DNA sequencing • “finished” quality DNA sequence • a DNA base call is considered finished if the probability of base call error is less than 1 in 10,000 • also known as phred > 40 • contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage • 1985: $10 per finished base • 2001: $1 per 10 finished bases

  33. Genbank August 22, 2005 Public Collections of DNA and RNA Sequence Reach 100 Gigabases

  34. Trends in the cost efficiency of DNA sequencing§ §Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335

  35. 454 Life Sciences Corporation The first commercial, massively parallel, DNA sequencing technology

  36. 454 Technology • Cyclic-array sequencing on in vitro amplified DNA molecules • individual molecules must be amplified to give a detectable sequencing signal • Instead of biological cloning, we amplify individual DNA fragments on solid state beads using PCR • Instead of terminator-based sequencing, pyrosequencing used to determine nucleotide order • “sequencing by synthesis”

  37. 454 Process Overview

  38. The bottom line … • efficiency of DNA sequencing increased 100X • cost per finished base declined 10- to 30-fold … so what happens next? • The “democratization” of large-scale genomic biology • Many projects are now possible that were once fiscally inviable • We must deal with basic local data management and information issues or lose this opportunity

  39. If you thought bioinformatics was important before By terminator-based sequencing we @ UF produce 60-70 Mbp per year By synthesis-based sequencing we produce 60-70 Mbp per day

More Related