1 / 42

A Minimal Guide to NCBI Nucleotide Resources

A Minimal Guide to NCBI Nucleotide Resources. Types of Databases. Primary Databases Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases Built from primary data Content controlled by third party (NCBI)

tal
Télécharger la présentation

A Minimal Guide to NCBI Nucleotide Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Minimal Guide to NCBI Nucleotide Resources

  2. Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples:GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples:Refseq, TPA, RefSNP, UniGene, GEO Datasets, NCBI Protein, Structure, Conserved Domain

  3. Accessing the Data: Entrez all[filter]

  4. Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry International Sequence Database Collaboration

  5. Release 142 June 2004 35,532,003 Records 40,325,321,348 Nucleotides >140,000 Species 153 Gigabytes 634 files GenBank: NCBI’s Primary Sequence Database • full release every two months • incremental and cumulative updates daily • available only through internet • release notes: gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank

  6. LOCUS NM_000588 924 bp mRNA linear PRI 07-APR-2003 DEFINITION Homo sapiens interleukin 3 (colony-stimulating factor, multiple)(IL3), mRNA. ACCESSION NM_000588 VERSION NM_000588.3 GI:28416914 KEYWORDS . A GenBank Record

  7. /protein_id=“NP_000579.2” /db_xref=“GI:28416915 GenPept identifiers GenBank Record: Feature Table

  8. GenBank Record, Con’t

  9. Sequence Revision History

  10. Sequence Revision History: choose records NM_000588

  11. Display and Save Options

  12. FASTA format (NCBI)

  13. GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Abstract Syntax Notation: ASN.1

  14. Bulk Divisions • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents • Batch submissions (email and ftp) • Inaccurate • Poorly characterized

  15. NCBI’sDerivative Sequence Databases

  16. C GA ATT GA ATT C C C ATT C ACT GA TA Curators Primary vs. Derivative Databases UniGene Algorithms Sequencing Centers UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT RefSeq: LocusLink and Genomes Pipelines RefSeq Labs

  17. RefSeq protein = AAC77388 splice variant splice variant splice variant Δ = 5 aa = P11388 Why Make Reference Sequences? Entrez Protein query: topoisomerase II alpha[title] AND human[organism]

  18. RefSeq Benefits genomes transcripts • non-redundant, best representative • updates to reflect current sequence data and biology • distinct, stable accession series proteins

  19. Reference Sequence: RefSeq AccessionSequence Type NM_123456789mRNA NP_123456789protein, from NM_ NR_123456non-coding RNA XM_123456predicted mRNA XP_123456predicted protein XR_123456predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456genomic, e.g., chromosomes NG_123455genomic, incomplete region NT_123456genomic, BAC assembly NW_123456genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated REFSEQ Key

  20. RefSeq Status Codes REVIEWED: by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. VALIDATED: in an initial review to provide the preferred sequence standard; not yet subjected to final review at which time additional functional information may be provided. PROVISIONAL: the record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. PREDICTED: may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted. INFERRED: by genome sequence analysis. MODEL: provided via automated processing and not subjected to individual review or revision between builds.

  21. Third Party Annotation (TPA) Database • Annotations of existing GenBank sequences • Allows for community annotation of genomes • Direct submissions • BankIt • Sequin

  22. Other Databases at the NCBI dbSNP nucleotide polymorphisms GEOGene Expression Omnibus microarray and other expression data GEO DataSets curated reports of GEO data collections of biologically and mathematically comparable GEO Samples. Structureimported structures (PDB) Cn3D viewer, NCBI curation CDDconserved domain database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)

  23. NCBI’s SNP Database • Primary and derivative (RefSNP) • Single nucleotide polymorphisms • Repeat polymorphisms • Insertion-deletion polymorphisms • 24 Species • Over 11 million refSNPs (rsXXXXXXX)

  24. RefSNP • Non-redundant • Computational Analysis • BLAST hits to genome, mRNA, protein

  25. Using Entrez An integrated database search and retrieval system

  26. Entrez: Database Integration Word weight PubMed abstracts 3-D Structure 3 -D Structure Taxonomy VAST Genomes Phylogeny Protein sequences Nucleotide sequences BLAST BLAST

  27. Home Page: Global Entrez Portal hfe

  28. Global Entrez Search: HFE

  29. [Title] Entrez Nucleotide: HFE 218 records Not HFE

  30. 39 records Smarter Query hfe[title]ANDhuman[orgn] Curated HFE splice variants (11 total)

  31. Primary data hfe[title]ANDhuman[orgn] (con’t)

  32. Finding Primary Sequences • Entrez Nucleotide 99+%GenBank (primary data) • srcdb ddbj/embl/genbank[properties] = 39,849,856records <1%RefSeq (curated data) • srcdb refseq[properties] = 304,945records • Useful search terms in [Properties]: • srcdb : source database(e.g., srcdb genbank[prop]) • gbdiv : GenBank division(e.g., gbdiv est[prop]) • biomol : biomolecule type(e.g., biomol mrna[prop])

  33. Database Queries #1hfe 116 #2hfe[title]ANDhuman[orgn]42 #3 #2ANDsrcdb refseq[prop]11 #4 #2ANDsrcdb ddbj/embl/genbank[prop] 31 #5 #2ANDgbdiv pri[prop] 29 #4 #2ANDgbdiv est[prop] 2 Primate division gbdiv pri[prop] EST division gbdiv est[prop]

  34. Molecule Queries #1hfe 116 #2hfe[title]ANDhuman[orgn]42 #3 #2ANDbiomol mrna[prop]29 #4 #2ANDbiomol genomic[prop] 13 Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop]

  35. More Queries… Gene symbol:human hemochromatosis (HFE) hfe[sym] ANDhuman[organism] Protein name:topoisomerase genes from Archaea topoisomerase[gene/protein name]ANDarchaea[organism] Chromosome, Links:genes on human chromosome 2 with OMIM links 2[chromosome] ANDgene omim[filter] ANDhuman[organism] RefSeq status, variants:reviewed RefSeqs with transcript variants srcdb refseq reviewed[prop]ANDhas transcript variants[prop] Disease and Gene Ontology:membrane proteins linked to cancer integral to plasma membrane[gene ontology]ANDcancer[dis]

  36. Other Entrez Databases UniGene:rat clusters that have at least one mRNA rat[organism] NOT0[mrna count] SNP:uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn] UniSTS:markers on the Genethon map of human chromosome 12 Genethon[Map Name] ANDhuman[organism] AND12[chromosome] Structure:structures of bacterial kinases with resolutions below 2 Å bacteria[organism]ANDkinaseAND000.00:002.00[resolution]

  37. Search by Sequence

  38. Related Sequences Most similar Least similar

  39. Search by Sequence: protein

  40. BLink (BLAST Link)

  41. BLink Output

  42. BLink → Multiple sequence alignment

More Related