slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
NCBI is vast . Site map: ncbi.nlm.nih/Sitemap/index.html PowerPoint Presentation
Download Presentation
NCBI is vast . Site map: ncbi.nlm.nih/Sitemap/index.html

NCBI is vast . Site map: ncbi.nlm.nih/Sitemap/index.html

258 Views Download Presentation
Download Presentation

NCBI is vast . Site map: ncbi.nlm.nih/Sitemap/index.html

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. NCBI is vast. Site map:

  2. NCBI database overview

  3. Some things to keep in mind about NCBI databases: • Scope of data can be intimidating. Just need a little orientation. • Constantly growing and evolving • Vestigial features: • some things there for historical reasons – not useful anymore • some things have changed or lost meaning • example: accession numbers: • in past, they provided some information about sequence • now, they’re just a unique identifier • When in doubt: resort to the abundant help pages

  4. Genbank – Foundation of NCBI databases & resources • DB of allprimary DNAsequences • Contains everything: • genomes • plasmids • synthetic sequences • fragments (partial gene seqs) • ESTs • STSs • Many redundant/overlapping sequences. • Released every 2 months. • Most journals require submission of new sequences to Genbank

  5. Genbank: composed of 3 nucleotide databases

  6. International Nucleotide Sequence Database Collaboration • All nucleotide sequences shared between three sites: • Genbank (US) • EMBL (Europe) – European Molecular Biology Laboratory • DDBJ (Japan) – DNA Data Bank of Japan • Since 1986 have exchanged nucleotide records daily: • Data stored in common format (machine & human readable) • ASN.1 format for machine exchange • Genbank record format for human reading • Can access/submit data from any of the sites

  7. Genbank records are more than raw sequence data • Each record includes annotations of sequence – basic information about: • gene structure • coding features • regulatory sites • functions • db cross references • citations • authorship (submitter) • date • identifiers • etc.

  8. Chromosomal DNA region containing CFTR gene Try it – look up via Gene db

  9. Locus name: unique identifier. Usually just the accession number. Seq length: # of nucleotides (bp) in the record Molecule type Revision date: date of most recent update to sequence record Genbank division: Not that important. Tells which FTP site record is stored

  10. Definition: basic information about the sequence Accession number: Unique identifier for each nucleotide sequence. (May have multiple versions as separate records.) Region: refers to subsection of the complete sequence • GI: GeneInfo Identifier. Also a unique sequence record identifier. • Redundant identifier with Version number. • Assigned consecutively with each new sequence deposit or update. • New GI for any deposit or change. • Version: Format is “ACC#.VER#”. • Version numbers are incremented with each sequence revision. Accession number doesn’t change. • Unique for each sequence record.

  11. Unique nucleotide sequence Sequence records Accession Number Version number GI (GeneInfo identifier) Version number GI (GeneInfo identifier) Locus name Version number GI (GeneInfo identifier)

  12. Source: free form information. Usually scientific and common name of organism Organism: scientific name and taxonomic position Reference: Journal references. Numbered sequentially for cross reference. Ordered chronologically

  13. location: nucleotide position. Can use operators: (.), (..), (>), (<), (join), (completment), etc. feature_key: up to 20 letters or numbers • Feature Table: • Contains multiple entries in the form: • Feature key - a single word or abbreviation indicating function type • Location - instructions for finding the feature • Qualifiers - auxiliary information about a feature qualifiers: slash (/), then equal sign (=), then text description (in quotes of multiple lines)

  14. /db_xref: cross references into other related databases • 5’ to 3’ sequence data. • starts after the ORIGIN key • Locations and qualifiers in the feature table refer to this • Numbering is only relevant to this particular record For a detailed example and description of Genbank flat file format see: For more information on Feature Table syntax see:

  15. FASTA format: useful for raw, unannotated sequence Try it – select FASTA from display menu

  16. Searching Genbank Entrez (primary route). Includes CoreNucleotide, dbEST, dbSTS, and dbGSS. Can access Entrez by selecting nucleotide from virtually any search box menu: BLAST (select “nr” for comprehensive nucleotide search) Example: try blasting CFTR mRNA against mouse genome Direct search of component databases (dbEST, dbSTS, dbGSS)

  17. RefSeq – a derivative database • Curated, non-redundant database: • includes genomic DNA, transcript (RNA), and protein products, for major organisms. • derived from GenBank primary data, and the annotation is computational, from published literature, or from domain experts. • • Why RefSeq when already have Genbank? • Genbank is a mess – overlapping, partial, redundant sequences • Example: Search CFTR (human) in RefSeq (via Gene DB) vs. Genbank (via nucleotide query)

  18. The main features of the RefSeq collection include: • non-redundant • explicitly linked nucleotide and protein sequences • data validation and format consistency • distinct accession series • ongoing curation by NCBI staff and collaborators:updates to reflect current knowledge of sequence data and biology • March 20, 2006: RefSeq Release 16: • Proteins: 2,520,485 • Organisms: 3397 • RefSeq records have a standard accession number format (start with N or X) • Examples:

  19. Searching RefSeq: • Gene DB – access from pull-down menu next to search box • Example: CFTR search • Entrez limits: restrict nucleotide or protein searches to RefSeq • Using Entrez Limits: More info on RefSeq and searching:

  20. Third Party Annotation (TPA) Database • User-provided annotation for sequence data • Derived from GenBank primary data. • Kind of like Refseq, but built on user-submissions. • To help with curation bottleneck at NCBI – take advantage of the community of expertise and knowledge.

  21. Other useful DBs at NCBI (many are derivative DBs) Access via Entrez: Genes: curated gene-centric summary of gene information and related sequences Unigene: gene-centered clusters of transcript sequences from GenBank Genomes: whole genome sequences – no nucleotide fragments Taxonomy: classifications of organisms Example: find relatives of homosapiens and blast CFTR Map Viewer: graphical display of gene records on chromosome OMIM: Online Mendelian Inheritance in Man. Catalog of human genetic disorders. dbSNP: database of Single Nucleotide Polymorphisms Pubmed: database of biomedically-related journal abstracts And many more. See Entrez (All Datbases link from NCBI main page).

  22. Additional NCBI Databases and Tools

  23. Getting help: General help with NCBI searching: Handbook: NCBI: Resource Guide/Overview: Tools: Tutorials and Exercises:

  24. Other useful databases: KEGG (Kyoto Encyclopedia of Genes and Genomes) Ecocyc (Encyclopedia of E. coli Genes and Metabolism) BIND (Biomolecuar INteraction Database) UCSC Genome Browser TIGR databases Many Microbes Microarray Database And the list goes on and on and on …