Download
ncbi molecular biology resources n.
Skip this Video
Loading SlideShow in 5 Seconds..
NCBI Molecular Biology Resources PowerPoint Presentation
Download Presentation
NCBI Molecular Biology Resources

NCBI Molecular Biology Resources

207 Vues Download Presentation
Télécharger la présentation

NCBI Molecular Biology Resources

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. NCBI Molecular Biology Resources NCBI Databases November 2008

  2. The National Center for Biotechnology Information Bethesda,MD • Created in 1988 as a part of the • National Library of Medicine at NIH • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

  3. Web Access:www.ncbi.nlm.nih.gov

  4. NCBI Databases and Services • GenBank primary sequence database • Free public access to biomedical literature • PubMed free Medline (3 million searches per day) • PubMed Central full text online access • Entrez integrated molecular and literature databases • BLAST highest volume sequence search service (100 – 200 K searches per day) • VAST structure similarity searches • Software and Databases

  5. Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

  6. NCBI Nucleotide Sequences Primary • GenBank / EMBL / DDBJ 149,949,987 Derivative • RefSeq 3,457,825 • Third Party Annotation 6,378 • PDB 9,021 Total 153,423,040

  7. What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • Historical • Reflective of submitter point of view (subjective) • Redundant • GenBank Data • Direct submissions (traditional records) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

  8. Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry International Sequence Database Collaboration

  9. Release 168 October 2008 96,400,790 Records 97,381,682,336 Bases Whole Genome Shotgun 46,108,952 Records 136,085,973,423 Bases 142,509,742 233,467,655,759 Total Records Total Bases GenBank: NCBI’s Primary Sequence Database • full release every two months • incremental updates daily • available only via ftp ftp.ncbi.nih.gov/genbank/

  10. The Growth of GenBank November 2008 WGS: 136 billion bases Doubling time 12-14 months GenBank Release: 97 billion bases

  11. Organization of GenBank:Traditional Divisions Records are divided into 18 Divisions. • 12 Traditional • 6 Bulk PRI Primate PLN Plant and Fungal BCT Bacterial and Archeal INV Invertebrate ROD Rodent VRL Viral VRT Other Vertebrate MAM Mammalian PHG Phage SYN Synthetic (cloning vectors) ENV Environmental Samples UNA Unannotated • Traditional Divisions: • Direct Submissions • (Sequin and BankIt) • Accurate • Well characterized Entrez query: gbdiv_xxx[Properties]

  12. Organization of GenBank:Bulk Divisions Records are divided into 18 Divisions. • 12 Traditional • 6 Bulk EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic STS Sequence Tagged Site HTC High Throughput cDNA PAT Patent • BULK Divisions: • Batch Submission • (Email and FTP) • Inaccurate • Poorly characterized Entrez query: gbdiv_xxx[Properties]

  13. Header Feature Table Sequence A TraditionalGenBank Record LOCUS AF124527 2540 bp mRNA linear PLN 29-JAN-2004 DEFINITION Prunus persica ethylene receptor (ETR1) mRNA, complete cds. ACCESSION AF124527 VERSION AF124527.1 GI:6841074 KEYWORDS . SOURCE Prunus persica (peach) ORGANISM Prunus persica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids I; Rosales; Rosaceae; Amygdaloideae; Prunus. REFERENCE 1 (bases 1 to 2540) AUTHORS Bassett,C.L., Artlip,T.S. and Callahan,A.M. TITLE Characterization of the peach homologue of the ethylene receptor, PpETR1, reveals some unusual features regarding transcript processing JOURNAL Planta 215 (4), 679-688 (2002) PUBMED 12172852 REFERENCE 2 (bases 1 to 2540) AUTHORS Bassett,C.B., Artlip,T.S. and Nickerson,M.L. TITLE Direct Submission JOURNAL Submitted (29-JAN-1999) Appalachian Fruit Research Station, USDA-ARS, 45 Wiltshire Road, Kearneysville, WV 25430, USA FEATURES Location/Qualifiers source 1..2540 /organism="Prunus persica" /mol_type="mRNA" /cultivar="Loring" /db_xref="taxon:3760" /dev_stage="III B/C fruit" gene 1..2540 /gene="ETR1" CDS 269..2485 /gene="ETR1" /codon_start=1 /product="ethylene receptor" /protein_id="AAF28893.1" /db_xref="GI:6841075" /translation="MEACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVK KSAVFPYRWVLVQFGAFIVLCGATHLINLWTFSMHSRTVAIVMTTAKVLTAVVSCATA LMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQEETGRHVRMLTHEIRSTLDRH TILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIHLPVINQVF SSNRALKISPNSPVARMRPLAGKHMPGEVVAVRVPLLHLSNFQINDWPELSTKRYALM VLMLPSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLAR REAETAIRARNDFLAVMNHEMRTPMHAIIALSSLLQETELTPEQRLMVETILKSSHLL ATLINDVLDLSRLEDGSLQLEIATFNLHSVFREVHNLIKPVASVKKLSVSLNLAADLP VQAVGDEKRLMQIVLNVVGNAVKFSKEGSISITAFVAKSESLRDFRAPEFFPAQSDNH FYLRVQVKDSGSGINPQDIPKLFTKFAQTQSLATRNSGGSGLGLAICKRFVNLMEGHI WIESEGPGKGCTAIFIVKLGFAERSNESKLPFLTKVQANHVQTNFPGLKVLVMDDNGS VTKGLLVHLGCDVTTVSSIDEFLHVISQEHKVVFMDVCMPGIDGYELAVRIHEKFTKR HERPVLVALTGNIDKMTKENCMRVGMDGVILKPVSVDKMRSVLSELLEHRVLFEAM" ORIGIN 1 gcacgagggc tcaccgagcg agctagctct tcaggagtca aggcttctgg gtgaggggaa 61 gaagaagaag cttctttgat gtgttggggt gccaatctaa agaggaagaa gaaggcctct 121 aatgtattga ggtcggctgt ctgggctgcc gatctgtgtt gaatggatag tttggtagag 181 atgcttcaac gacatagggt ggctgaaaag ggtttgaaga aagtgaagga ggaaaccaag ... 2401 tatactgaaa cctgtctcag ttgataaaat gaggagtgtt ttatcagaac tgttggagca 2461 tcgagtttta tttgaggcta tgtaagatat aggaaaattg ttctagtgaa ggaaagattt 2521 aaatggaaaa aaaaaaaaaa // The Flatfile Format

  14. Traditional GenBank Record • Accession • Stable • Reportable • Universal ACCESSION U07418 VERSION U07418.1 GI:466461 Version Tracks changes in sequence GI number NCBI internal use well annotated the sequence is the data

  15. Bulk Divisions • Batch Submission and htg (email and ftp) • Inaccurate • Poorly Characterized • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents

  16. GenBank Bulk Sequence: EST poorly characterized

  17. Total 59 million records Human 8.1 million Mouse 4.9 million Pig 2.2 million Maize 2.0 million Arabidopsis 1.5 million Cow 1.5 million Zebrafish 1.4 million Soybean 1.4 million Xenopus tropicalis 1.3 million Rice 1.2 million Ciona intestinalis 1.2 million Wheat 1.0 million Rat 1.0 million Expressed Sequence Tags in Entrez

  18. Whole Genome Shotgun Projects • >900 Projects • >800 Taxa • 585 Bacteria • 8 Archaea • 17 metagenomes • 255 eukaryotes • 86 fungi • 89 animals • 7 flowering plants ftp.ncbi.nih.gov/genbank/wgs/

  19. Mammalian WGS • Now 50 species, including… • Duck-billed platypus • Nine-banded armadillo • Northern tree shrew • Domestic rabbit • Pika • Guinea pig • Mouse • Rat • Thirteen-lined ground squirrel • Small-eared galago • Mouse lemur • Orangutan • Human • Chimpanzee • Gorilla • Rhesus macaque • Tenrec • African elephant • Dog • Cat • Horse • European hedgehog • Eurasian shrew • Little brown bat • Cow • Gray short-tailed opossum

  20. Plant WGS

  21. Derivative Databases

  22. Entrez Protein: Derivative Database

  23. GenPept: GenBank CDS translations FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

  24. >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... GenPept >gi|1079787|gb|AAA82079.1| DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|4557757|ref|NP_000240.1| MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... NCBI RefSeq >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... Swiss-Prot >gi|741682|prf||2007430A DNA mismatch repair protei... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... PRF Redundant Proteins 20 Proteins Etc.

  25. Protein Sequences from Structures >gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ

  26. RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more • Model transcripts and proteins • Assembled Genomic Regions (contigs) • human genome • mouse genome • rat genome • Chromosome records • Human genome • microbial • organelle • chicken • honeybee • sea urchin srcdb_refseq[Properties] ftp://ftp.ncbi.nih.gov/refseq/release/

  27. NCBI Eukaryotic Genomes Since 1999 Map Viewer UniGene HomoloGene Contigs, Transcripts and Proteins Microbial Genomes Outside Eukaryotic Genomes (Plants, Fungi) Since 1993 Comparative Proteomics Clusters of Orthologous Groups (COGs) Protein Clusters Chromosomes and Proteins Genomes: Two Paths

  28. Selected RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455Microbial replicons, organelle genomes, human chromosomes Assemblies NT_123456Contig NW_123456WGSSupercontig

  29. Arabidopsis MLH1 Sequences Genomic . AC006583 AC011816 . mRNA Genomic Annotations AL161471 AL161472 : : AL161595 AL161596 U07343 AU127758 BC006850 AJ270058 AJ270060 CAB78038 Protein NM_116983 NT_022517 (36974983..37032341) NM_000249 Transcript NC_000003 (37009983..37067341) Two Paths to RefSeq Human MLH1 Sequences NC_003075 Submitted Genomes and Annotation NCBI Annotated Genomes and Selected Model Organisms

  30. GenBank to RefSeq: NCBI Organisms

  31. RefSeqs: Annotation Reagents Genomic DNA (NC, NT, NW) Scanning.... Model mRNA(XM) (XR) Model protein (XP) = ? Curated mRNA(NM) (NR) Curated Protein(NP) RefSeq GenBank Sequences

  32. RefSeq Benefits • Non-redundancy   • Explicitly linked nucleotide and protein sequences • Updates to reflect current sequence data and biology • Data validation • Format consistency • Distinct accession series • Stewardship by NCBI staff and collaborators

  33. Mouse Assembly UniGene Transcript Other GenBank RefSeq Contig BAC RefSeq Transcript

  34. Expressed Sequences UniGene GEO

  35. NCBI Expressed Sequences 62,282,583 mRNA sequences 60,705,055 GenBank (58,955,534 EST Division) 1,575,789 Reference Sequences

  36. What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Now informed by genome hits • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes known genes and uncharacterized ESTs • Useful for gene discovery and selection of mapping reagents

  37. EST hits: Human mRNA Thrombin mRNA 5’ EST hits 3’ EST hits

  38. Chordates Plants Invertebrates Fungi et al. UniGene

  39. Gene Catalog:Fathead Minnow MLH1Cluster Uncharacterized ESTs

  40. Associating Sequences: Human Thrombin

  41. Expression Data

  42. Other NCBI Databases Structure:imported structures (PDB) Cn3D viewer, NCBI curation CDD:conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD) dbSNP:nucleotide polymorphism Gene:gene records Unifies LocusLink and Microbial Genomes HomoloGene:neighboring function for Gene

  43. MMDB: Molecular Modeling Data Base • Derived from experimentally determined PDB records • Value added to PDB records including: • Addition of explicit chemical graph information • Validation (secondary structure elements) • Inclusion of Taxonomy, Citation • Conversion to ASN.1 data description language • Structure neighbors determined by Vector Alignment Search Tool (VAST)

  44. Cn3D 4.1: Bacillus thuringiensis Toxin

  45. IL-4 & Leptin VAST: Structure Neighbors Vector Alignment Search Tool 4 For each protein chain, 2 locate SSEs (secondary structure elements), 5 6 and represent them as individual vectors. 1 3 align the vectors Human IL-4

  46. Protein Domains • Structural Domain • Discrete independently folding unit of a protein • Conserved Domain (sequence-based) • Protein region with recognizable position-specific pattern of sequence conservation • Sequence-based domains often roughly correspond to structural domains • Domains often have distinct, identifiable functions

  47. NCBI’s Conserved Domain Database • PSI-BLAST –based score matrices • Searchable with RPS-BLAST • Sources • SMART • PFAM • COGs • NCBI curated domains • structure informed alignments

  48. Src Domains Four 3d domains Three conserved domains

  49. SH2 SH2 TyrKC SH3 Cn3D Structure vs Conserved Domain Conserved phosphotyrosine binding residues

  50. NCBI’s SNP Database • Primary Database and Derivative (RefSNP) • Single Nucleotide Polymorphism • Repeat polymorphisms • Insertion-Deletion Polymorphisms • 29 Species • Over 46 million submissions (submitted SNPs) • Over 26 million reference SNPs