NCBI Molecular Biology Resources

NCBI Molecular Biology Resources NCBI Databases November 2008

The National Center for Biotechnology Information Bethesda,MD • Created in 1988 as a part of the • National Library of Medicine at NIH • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

Web Access:www.ncbi.nlm.nih.gov

NCBI Databases and Services • GenBank primary sequence database • Free public access to biomedical literature • PubMed free Medline (3 million searches per day) • PubMed Central full text online access • Entrez integrated molecular and literature databases • BLAST highest volume sequence search service (100 – 200 K searches per day) • VAST structure similarity searches • Software and Databases

Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

NCBI Nucleotide Sequences Primary • GenBank / EMBL / DDBJ 149,949,987 Derivative • RefSeq 3,457,825 • Third Party Annotation 6,378 • PDB 9,021 Total 153,423,040

What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • Historical • Reflective of submitter point of view (subjective) • Redundant • GenBank Data • Direct submissions (traditional records) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry International Sequence Database Collaboration

Release 168 October 2008 96,400,790 Records 97,381,682,336 Bases Whole Genome Shotgun 46,108,952 Records 136,085,973,423 Bases 142,509,742 233,467,655,759 Total Records Total Bases GenBank: NCBI’s Primary Sequence Database • full release every two months • incremental updates daily • available only via ftp ftp.ncbi.nih.gov/genbank/

The Growth of GenBank November 2008 WGS: 136 billion bases Doubling time 12-14 months GenBank Release: 97 billion bases

Organization of GenBank:Traditional Divisions Records are divided into 18 Divisions. • 12 Traditional • 6 Bulk PRI Primate PLN Plant and Fungal BCT Bacterial and Archeal INV Invertebrate ROD Rodent VRL Viral VRT Other Vertebrate MAM Mammalian PHG Phage SYN Synthetic (cloning vectors) ENV Environmental Samples UNA Unannotated • Traditional Divisions: • Direct Submissions • (Sequin and BankIt) • Accurate • Well characterized Entrez query: gbdiv_xxx[Properties]

Organization of GenBank:Bulk Divisions Records are divided into 18 Divisions. • 12 Traditional • 6 Bulk EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic STS Sequence Tagged Site HTC High Throughput cDNA PAT Patent • BULK Divisions: • Batch Submission • (Email and FTP) • Inaccurate • Poorly characterized Entrez query: gbdiv_xxx[Properties]

Header Feature Table Sequence A TraditionalGenBank Record LOCUS AF124527 2540 bp mRNA linear PLN 29-JAN-2004 DEFINITION Prunus persica ethylene receptor (ETR1) mRNA, complete cds. ACCESSION AF124527 VERSION AF124527.1 GI:6841074 KEYWORDS . SOURCE Prunus persica (peach) ORGANISM Prunus persica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids I; Rosales; Rosaceae; Amygdaloideae; Prunus. REFERENCE 1 (bases 1 to 2540) AUTHORS Bassett,C.L., Artlip,T.S. and Callahan,A.M. TITLE Characterization of the peach homologue of the ethylene receptor, PpETR1, reveals some unusual features regarding transcript processing JOURNAL Planta 215 (4), 679-688 (2002) PUBMED 12172852 REFERENCE 2 (bases 1 to 2540) AUTHORS Bassett,C.B., Artlip,T.S. and Nickerson,M.L. TITLE Direct Submission JOURNAL Submitted (29-JAN-1999) Appalachian Fruit Research Station, USDA-ARS, 45 Wiltshire Road, Kearneysville, WV 25430, USA FEATURES Location/Qualifiers source 1..2540 /organism="Prunus persica" /mol_type="mRNA" /cultivar="Loring" /db_xref="taxon:3760" /dev_stage="III B/C fruit" gene 1..2540 /gene="ETR1" CDS 269..2485 /gene="ETR1" /codon_start=1 /product="ethylene receptor" /protein_id="AAF28893.1" /db_xref="GI:6841075" /translation="MEACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVK KSAVFPYRWVLVQFGAFIVLCGATHLINLWTFSMHSRTVAIVMTTAKVLTAVVSCATA LMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQEETGRHVRMLTHEIRSTLDRH TILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIHLPVINQVF SSNRALKISPNSPVARMRPLAGKHMPGEVVAVRVPLLHLSNFQINDWPELSTKRYALM VLMLPSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLAR REAETAIRARNDFLAVMNHEMRTPMHAIIALSSLLQETELTPEQRLMVETILKSSHLL ATLINDVLDLSRLEDGSLQLEIATFNLHSVFREVHNLIKPVASVKKLSVSLNLAADLP VQAVGDEKRLMQIVLNVVGNAVKFSKEGSISITAFVAKSESLRDFRAPEFFPAQSDNH FYLRVQVKDSGSGINPQDIPKLFTKFAQTQSLATRNSGGSGLGLAICKRFVNLMEGHI WIESEGPGKGCTAIFIVKLGFAERSNESKLPFLTKVQANHVQTNFPGLKVLVMDDNGS VTKGLLVHLGCDVTTVSSIDEFLHVISQEHKVVFMDVCMPGIDGYELAVRIHEKFTKR HERPVLVALTGNIDKMTKENCMRVGMDGVILKPVSVDKMRSVLSELLEHRVLFEAM" ORIGIN 1 gcacgagggc tcaccgagcg agctagctct tcaggagtca aggcttctgg gtgaggggaa 61 gaagaagaag cttctttgat gtgttggggt gccaatctaa agaggaagaa gaaggcctct 121 aatgtattga ggtcggctgt ctgggctgcc gatctgtgtt gaatggatag tttggtagag 181 atgcttcaac gacatagggt ggctgaaaag ggtttgaaga aagtgaagga ggaaaccaag ... 2401 tatactgaaa cctgtctcag ttgataaaat gaggagtgtt ttatcagaac tgttggagca 2461 tcgagtttta tttgaggcta tgtaagatat aggaaaattg ttctagtgaa ggaaagattt 2521 aaatggaaaa aaaaaaaaaa // The Flatfile Format

Traditional GenBank Record • Accession • Stable • Reportable • Universal ACCESSION U07418 VERSION U07418.1 GI:466461 Version Tracks changes in sequence GI number NCBI internal use well annotated the sequence is the data

Bulk Divisions • Batch Submission and htg (email and ftp) • Inaccurate • Poorly Characterized • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents

GenBank Bulk Sequence: EST poorly characterized

Total 59 million records Human 8.1 million Mouse 4.9 million Pig 2.2 million Maize 2.0 million Arabidopsis 1.5 million Cow 1.5 million Zebrafish 1.4 million Soybean 1.4 million Xenopus tropicalis 1.3 million Rice 1.2 million Ciona intestinalis 1.2 million Wheat 1.0 million Rat 1.0 million Expressed Sequence Tags in Entrez

Whole Genome Shotgun Projects • >900 Projects • >800 Taxa • 585 Bacteria • 8 Archaea • 17 metagenomes • 255 eukaryotes • 86 fungi • 89 animals • 7 flowering plants ftp.ncbi.nih.gov/genbank/wgs/

Mammalian WGS • Now 50 species, including… • Duck-billed platypus • Nine-banded armadillo • Northern tree shrew • Domestic rabbit • Pika • Guinea pig • Mouse • Rat • Thirteen-lined ground squirrel • Small-eared galago • Mouse lemur • Orangutan • Human • Chimpanzee • Gorilla • Rhesus macaque • Tenrec • African elephant • Dog • Cat • Horse • European hedgehog • Eurasian shrew • Little brown bat • Cow • Gray short-tailed opossum

Plant WGS

Derivative Databases

Entrez Protein: Derivative Database

GenPept: GenBank CDS translations FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... GenPept >gi|1079787|gb|AAA82079.1| DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|4557757|ref|NP_000240.1| MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... NCBI RefSeq >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... Swiss-Prot >gi|741682|prf||2007430A DNA mismatch repair protei... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... PRF Redundant Proteins 20 Proteins Etc.

Protein Sequences from Structures >gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ

RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more • Model transcripts and proteins • Assembled Genomic Regions (contigs) • human genome • mouse genome • rat genome • Chromosome records • Human genome • microbial • organelle • chicken • honeybee • sea urchin srcdb_refseq[Properties] ftp://ftp.ncbi.nih.gov/refseq/release/

NCBI Eukaryotic Genomes Since 1999 Map Viewer UniGene HomoloGene Contigs, Transcripts and Proteins Microbial Genomes Outside Eukaryotic Genomes (Plants, Fungi) Since 1993 Comparative Proteomics Clusters of Orthologous Groups (COGs) Protein Clusters Chromosomes and Proteins Genomes: Two Paths

Selected RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455Microbial replicons, organelle genomes, human chromosomes Assemblies NT_123456Contig NW_123456WGSSupercontig

Arabidopsis MLH1 Sequences Genomic . AC006583 AC011816 . mRNA Genomic Annotations AL161471 AL161472 : : AL161595 AL161596 U07343 AU127758 BC006850 AJ270058 AJ270060 CAB78038 Protein NM_116983 NT_022517 (36974983..37032341) NM_000249 Transcript NC_000003 (37009983..37067341) Two Paths to RefSeq Human MLH1 Sequences NC_003075 Submitted Genomes and Annotation NCBI Annotated Genomes and Selected Model Organisms

GenBank to RefSeq: NCBI Organisms

RefSeqs: Annotation Reagents Genomic DNA (NC, NT, NW) Scanning.... Model mRNA(XM) (XR) Model protein (XP) = ? Curated mRNA(NM) (NR) Curated Protein(NP) RefSeq GenBank Sequences

RefSeq Benefits • Non-redundancy • Explicitly linked nucleotide and protein sequences • Updates to reflect current sequence data and biology • Data validation • Format consistency • Distinct accession series • Stewardship by NCBI staff and collaborators

Mouse Assembly UniGene Transcript Other GenBank RefSeq Contig BAC RefSeq Transcript

Expressed Sequences UniGene GEO

NCBI Expressed Sequences 62,282,583 mRNA sequences 60,705,055 GenBank (58,955,534 EST Division) 1,575,789 Reference Sequences

What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Now informed by genome hits • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes known genes and uncharacterized ESTs • Useful for gene discovery and selection of mapping reagents

EST hits: Human mRNA Thrombin mRNA 5’ EST hits 3’ EST hits

Chordates Plants Invertebrates Fungi et al. UniGene

Gene Catalog:Fathead Minnow MLH1Cluster Uncharacterized ESTs

Associating Sequences: Human Thrombin

Expression Data

Other NCBI Databases Structure:imported structures (PDB) Cn3D viewer, NCBI curation CDD:conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD) dbSNP:nucleotide polymorphism Gene:gene records Unifies LocusLink and Microbial Genomes HomoloGene:neighboring function for Gene

MMDB: Molecular Modeling Data Base • Derived from experimentally determined PDB records • Value added to PDB records including: • Addition of explicit chemical graph information • Validation (secondary structure elements) • Inclusion of Taxonomy, Citation • Conversion to ASN.1 data description language • Structure neighbors determined by Vector Alignment Search Tool (VAST)

Cn3D 4.1: Bacillus thuringiensis Toxin

IL-4 & Leptin VAST: Structure Neighbors Vector Alignment Search Tool 4 For each protein chain, 2 locate SSEs (secondary structure elements), 5 6 and represent them as individual vectors. 1 3 align the vectors Human IL-4

Protein Domains • Structural Domain • Discrete independently folding unit of a protein • Conserved Domain (sequence-based) • Protein region with recognizable position-specific pattern of sequence conservation • Sequence-based domains often roughly correspond to structural domains • Domains often have distinct, identifiable functions

NCBI’s Conserved Domain Database • PSI-BLAST –based score matrices • Searchable with RPS-BLAST • Sources • SMART • PFAM • COGs • NCBI curated domains • structure informed alignments

Src Domains Four 3d domains Three conserved domains

SH2 SH2 TyrKC SH3 Cn3D Structure vs Conserved Domain Conserved phosphotyrosine binding residues

NCBI’s SNP Database • Primary Database and Derivative (RefSNP) • Single Nucleotide Polymorphism • Repeat polymorphisms • Insertion-Deletion Polymorphisms • 29 Species • Over 46 million submissions (submitted SNPs) • Over 26 million reference SNPs

NCBI Molecular Biology Resources