National Center for Biotechnology Information

National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources University of Colorado Health Sciences Center August 30, 2005

Topics • About NCBI • GenBank overview • Primary vs derivative databases • The Reference Sequence (RefSeq) project • Entrez databases • Genome resources • Bookshelf -break- • Entrez text searching • BLAST sequence searching • VAST structure searching • An integrated example

Bethesda, MD The National Institutes of Health

The National Center for Biotechnology Information • Accepts submissions of primary data • Develops tools to analyze these data • Creates derivative databases based on the primary data • Provides free search, link, and retrieval of these data, primarily through the Entrez system

NCBI WWW Users per Day

Christmas & New Year Number of Users Per Day 1997 1998 1999 2000 2001 2002 2003

Homepage - accessing the data all[filter]

all[filter] 1/11/2005 3/15/2005 8/15/2005

Entrez Nucleotide # records Primary Data • GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data • RefSeq 1.47 million (2.5 %) • RefSeq reviewed 60,000 • PDB (structures) 5,973 “Total” 59 million GenBank

Release 149 August 2005 47 x 106 Records 52 x 109 Nucleotides 195 Gigabytes 816 files GenBank: NCBI’s Primary Sequence Database Over 100 billion bases! • full release every two months • incremental and cumulative updates daily • available only through internet • release notes: gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank

What is GenBank? • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions (traditional records) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

GenBank Divisions “Organismal” PRI (28) Primate ROD (15) Rodent PLN (13) Plant and Fungal BCT (11)Bacterial/Archeal INV (7) Invertebrate VRT (7)Other Vertebrate VRL (4)Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1)Unannotated • Organized by taxonomy (sort of) • Direct submissions (Sequin/Bankit) • Accurate (~1 error per 10,000 bp) • Well characterized “Functional” EST (377)Expressed Sequence Tag GSS (138) Genome Survey Sequence HTG (63) High Throughput Genomic PAT (17) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual • Organized by sequence type • Batch submissions (ftp/email) • Inaccurate • Poorly characterized

EST GenBank GSS HTG STS GenBank Functional (Bulk) Divisions • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents Whole Genome Shotgun

5’ 3’ make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus 30,000 genes gatccantgccatacg ctcgccaattcnntcg • - isolate unique clones • sequence once from each end >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC RNA gene products

GSS division or trace archive whole genome shotgun assemblies (traditional division) assembly Draft sequence (HTG division) GSS, WGS, HTG Whole BAC insert (or genome) shred sequence isolate clones

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. HTG Example: Honeybee Draft Sequences • Unfinished sequences of BACs • Gaps and unordered pieces • Finished sequences (Phase 3) move to traditional GenBank division

Whole Genome Shotgun Projects • 351 projects • Bacteria (251) • Environmental sequences (6) • Archaea (6) • Eukaryotes (88), including: • Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human • Pufferfish (2) • Honeybee, Anopheles, Fruit Flies (3), Silkworm • Nematode (2) • Yeasts (8), Aspergillus (2) • Rice (2)

Whole Genome Shotgun (WGS) Projects wgs master[properties]

C GA ATT GA ATT C C C ATT C ACT GA TA Derivative Databases Sequencing Centers UniGene UniSTS Updated by NCBI EST GenBank STS Updated ONLY by submitters RefSeq HTG RefSeq: Entrez Gene and annotation pipelines GSS INV VRT PHG VRL PRI ROD PLN MAM BCT Labs

Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]

Entrez Nucleotide query: human[organism] AND lipase[title] Why Make Reference Sequences?

3927 bp 4150 bp 2323 bp 3927 bp 261 bp human[organism] AND lipase[title] AND endothelial[title] human[organism] AND lipase[title] AND endothelial[title]

RefSeq Benefits genomes transcripts proteins • non-redundant; best representative • updates to reflect current sequence data and biology • distinct, stable accession series

Reference Sequence: RefSeq AccessionSequence Type NM_123456789mRNA NP_123456789protein, from NM_ NR_123456non-coding RNA XM_123456predicted mRNA XP_123456predicted protein XR_123456predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456genomic, e.g., chromosomes NG_123455genomic, incomplete region NT_123456genomic, BAC assembly NW_123456genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated

Annotation Process Genomic DNA (NC,NT, NW) Scanning.... Model mRNA(XM) (XR) Model protein (XP) Curated mRNA(NM) (NR) Curated Protein(NP) RefSeq Genbank Sequences

Creating NM_ Records Genome annotation NM’s must have cDNA support transcript variant 1 transcript variant 2 transcript variant 3 Longest mRNA

Where is RefSeq?

GENSAT PubChem The Entrez System Gene UniGene CancerChromosomes UniSTS Homologene SNP PopSet Genome Nucleotide GEO Books Entrez Taxonomy PubMed MeSH OMIM Protein PMC Journals Domains 3D Domains Structure

A Few Entrez Databases UniGeneClustersof ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEOGene Expression Omnibus microarray and other expression data CDDConserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)

UniGene Gene-oriented clusters of expressed sequences • Automatic clustering using MegaBlast • Each cluster represents a unique gene • Informed by genome hits • Information on tissue types and map locations • Useful for gene discovery and selection of mapping reagents unique gene

A Cluster of ESTs query 5’ EST hits 3’ EST hits

UniGene Collections

Example UniGene Cluster

Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186)

UniGene Cluster Hs.95351 SELECTED PROTEIN SIMILARITES

UniGene Cluster Hs.95351 GENE EXPRESSION

UniGene Cluster Hs.95351: expression

UniGene Cluster Hs.95351: seqs

Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

Entrez GEO

NCBI’s SNP Database • Primary and derivative (RefSNP) • Single nucleotide polymorphisms • Repeat polymorphisms • Insertion-deletion polymorphisms • Over 19 million refSNPs (rsXXXXXXX) (August, 2005)

Searching dbSNP

RefSNP

RefSNP Search Mouse SNP between strains

MapView No 3D OMIM SeqView GeneView RefSNP

RefSNP

Entrez GEO

Submitted by Experimentalists Curated by NCBI Submitted by Manufacturer* GDS Grouping of experiments GSE Grouping of slide/chip data “a single experiment” GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GEO SEries: set of related samples GEO SaMple: experimental conditions Entrez GEO Datasets Entrez GEO

National Center for Biotechnology Information