1 / 127

National Center for Biotechnology Information

National Center for Biotechnology Information. A Field Guide to GenBank and NCBI’s Molecular Biology Resources. University of Colorado Health Sciences Center. August 30, 2005. Topics. About NCBI GenBank overview Primary vs derivative databases The Reference Sequence (RefSeq) project

makoto
Télécharger la présentation

National Center for Biotechnology Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources University of Colorado Health Sciences Center August 30, 2005

  2. Topics • About NCBI • GenBank overview • Primary vs derivative databases • The Reference Sequence (RefSeq) project • Entrez databases • Genome resources • Bookshelf -break- • Entrez text searching • BLAST sequence searching • VAST structure searching • An integrated example

  3. Bethesda, MD The National Institutes of Health

  4. The National Center for Biotechnology Information • Accepts submissions of primary data • Develops tools to analyze these data • Creates derivative databases based on the primary data • Provides free search, link, and retrieval of these data, primarily through the Entrez system

  5. NCBI WWW Users per Day

  6. Christmas & New Year Number of Users Per Day 1997 1998 1999 2000 2001 2002 2003

  7. Homepage - accessing the data all[filter]

  8. all[filter] 1/11/2005 3/15/2005 8/15/2005

  9. Entrez Nucleotide # records Primary Data • GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data • RefSeq 1.47 million (2.5 %) • RefSeq reviewed 60,000 • PDB (structures) 5,973 “Total” 59 million GenBank

  10. Release 149 August 2005 47 x 106 Records 52 x 109 Nucleotides 195 Gigabytes 816 files GenBank: NCBI’s Primary Sequence Database Over 100 billion bases! • full release every two months • incremental and cumulative updates daily • available only through internet • release notes: gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank

  11. What is GenBank? • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions (traditional records) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

  12. GenBank Divisions “Organismal” PRI (28) Primate ROD (15) Rodent PLN (13) Plant and Fungal BCT (11)Bacterial/Archeal INV (7) Invertebrate VRT (7)Other Vertebrate VRL (4)Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1)Unannotated • Organized by taxonomy (sort of) • Direct submissions (Sequin/Bankit) • Accurate (~1 error per 10,000 bp) • Well characterized “Functional” EST (377)Expressed Sequence Tag GSS (138) Genome Survey Sequence HTG (63) High Throughput Genomic PAT (17) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual • Organized by sequence type • Batch submissions (ftp/email) • Inaccurate • Poorly characterized

  13. EST GenBank GSS HTG STS GenBank Functional (Bulk) Divisions • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents Whole Genome Shotgun

  14. 5’ 3’ make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus 30,000 genes gatccantgccatacg ctcgccaattcnntcg • - isolate unique clones • sequence once from each end >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC RNA gene products

  15. GSS division or trace archive whole genome shotgun assemblies (traditional division) assembly Draft sequence (HTG division) GSS, WGS, HTG Whole BAC insert (or genome) shred sequence isolate clones

  16. LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. HTG Example: Honeybee Draft Sequences • Unfinished sequences of BACs • Gaps and unordered pieces • Finished sequences (Phase 3) move to traditional GenBank division

  17. Whole Genome Shotgun Projects • 351 projects • Bacteria (251) • Environmental sequences (6) • Archaea (6) • Eukaryotes (88), including: • Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human • Pufferfish (2) • Honeybee, Anopheles, Fruit Flies (3), Silkworm • Nematode (2) • Yeasts (8), Aspergillus (2) • Rice (2)

  18. Whole Genome Shotgun (WGS) Projects wgs master[properties]

  19. C GA ATT GA ATT C C C ATT C ACT GA TA Derivative Databases Sequencing Centers UniGene UniSTS Updated by NCBI EST GenBank STS Updated ONLY by submitters RefSeq HTG RefSeq: Entrez Gene and annotation pipelines GSS INV VRT PHG VRL PRI ROD PLN MAM BCT Labs

  20. Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]

  21. Entrez Nucleotide query: human[organism] AND lipase[title] Why Make Reference Sequences?

  22. 3927 bp 4150 bp 2323 bp 3927 bp 261 bp human[organism] AND lipase[title] AND endothelial[title] human[organism] AND lipase[title] AND endothelial[title]

  23. RefSeq Benefits genomes transcripts proteins • non-redundant; best representative • updates to reflect current sequence data and biology • distinct, stable accession series

  24. Reference Sequence: RefSeq AccessionSequence Type NM_123456789mRNA NP_123456789protein, from NM_ NR_123456non-coding RNA XM_123456predicted mRNA XP_123456predicted protein XR_123456predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456genomic, e.g., chromosomes NG_123455genomic, incomplete region NT_123456genomic, BAC assembly NW_123456genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated

  25. Annotation Process Genomic DNA (NC,NT, NW) Scanning.... Model mRNA(XM) (XR) Model protein (XP) Curated mRNA(NM) (NR) Curated Protein(NP) RefSeq Genbank Sequences

  26. Creating NM_ Records Genome annotation NM’s must have cDNA support transcript variant 1 transcript variant 2 transcript variant 3 Longest mRNA

  27. Where is RefSeq?

  28. GENSAT PubChem The Entrez System Gene UniGene CancerChromosomes UniSTS Homologene SNP PopSet Genome Nucleotide GEO Books Entrez Taxonomy PubMed MeSH OMIM Protein PMC Journals Domains 3D Domains Structure

  29. A Few Entrez Databases UniGeneClustersof ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEOGene Expression Omnibus microarray and other expression data CDDConserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)

  30. UniGene Gene-oriented clusters of expressed sequences • Automatic clustering using MegaBlast • Each cluster represents a unique gene • Informed by genome hits • Information on tissue types and map locations • Useful for gene discovery and selection of mapping reagents unique gene

  31. A Cluster of ESTs query 5’ EST hits 3’ EST hits

  32. UniGene Collections

  33. Example UniGene Cluster

  34. Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186)

  35. UniGene Cluster Hs.95351 SELECTED PROTEIN SIMILARITES

  36. UniGene Cluster Hs.95351 GENE EXPRESSION

  37. UniGene Cluster Hs.95351: expression

  38. UniGene Cluster Hs.95351: seqs

  39. Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

  40. Entrez GEO

  41. NCBI’s SNP Database • Primary and derivative (RefSNP) • Single nucleotide polymorphisms • Repeat polymorphisms • Insertion-deletion polymorphisms • Over 19 million refSNPs (rsXXXXXXX) (August, 2005)

  42. Searching dbSNP

  43. RefSNP

  44. RefSNP

  45. RefSNP

  46. RefSNP Search Mouse SNP between strains

  47. MapView No 3D OMIM SeqView GeneView RefSNP

  48. RefSNP

  49. Entrez GEO

  50. Submitted by Experimentalists Curated by NCBI Submitted by Manufacturer* GDS Grouping of experiments GSE Grouping of slide/chip data “a single experiment” GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GEO SEries: set of related samples GEO SaMple: experimental conditions Entrez GEO Datasets Entrez GEO

More Related