NCBI Molecular Biology Resources

NCBI Molecular Biology Resources A Field Guide Part 1 January 12, 2007

NCBI Resources • The NCBI Entrez System • NCBI Sequence Databases • Primary data: GenBank • Derivative data: RefSeq, Gene • Protein Structure and Function • Sequence polymorphisms and phenotypes ** Intermission ** • NCBI Genomic Resources • BLAST

The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH • national resource for molecular biology information (biological information direct from organisms) • gather data both nationally and internationally • develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease

Data sources: traditional literature and data obtained from the direct study of organisms The information landscape inbiological and medical research has grown far beyond literature to include a wide variety of databases generated by research fields such as molecular biology and genomics. NCBI: • accepts submissions of bibliographic records (example) and primary research data (example nucleotide sequence for colon cancer gene, MLH1) • organizes the information into databases, maintains them, makes them available to the world • develops software to retrieve and analyze the data • conducts basic research to make new biological discoveries using the databases and software tools Figure 1 from Geer RC., Broad issues to consider for library involvement in bioinformatics. J Med Libr Assoc. 2006 Jul; 94(3):286–98. E-152.–5. PMID: 16888662

What does NCBI do? • NCBI accepts submissions of primary data • NCBI develops tools to analyze these data • NCBI uses these tools to create derivative databases based on the primary data • NCBI provides free search, link, and retrieval of these data, primarily through the Entrez system

Web Access www.ncbi.nlm.nih.gov Text Entrez query Sequence BLAST Protein Structure VAST Small Mol. Structure PubChem

The NCBI ftp site 30,000 files per day 620 Gigabytes per day

Help for Programmers • NCBI Toolbox:In-house source code useful for incorporating • NCBI-like functionality into their programs. • Three main parts: Data Model, Data Encoding • and Programming Libraries. • Examples: BLAST, Cn3D, Sequin, Data format conversion scripts http://www.ncbi.nlm.nih.gov/IEB/ToolBox/index.cgi • E-Utilities:Guidelines for Entrez “URL calls” used to access data. • Designed for use in scripts. • Examples: ESearch, EPost, ESummary, EFetch and ELink http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html Caution:Overuse may result in blocked IPs!

Global Entrez Search Page All[Filter]

What is Entrez? • A system of 31 linked databases • A text search engine • A tool for finding biologically linked data • A retrieval engine • A virtual workspace for manipulating large datasets

Entrez Databases • Each record is assigned a UID • unique integer identifier for internal tracking • GI number for Nucleotide • Each record is given a Document Summary • a summary of the record’s content (DocSum) • Each record is assigned links to biologically related UIDs • Each record is indexed by data fields • [author], [title], [organism], and many others

Links Linking in Entrez Follow links to related data in the same database or in others! • Hard Links: Curated links based on biology • nucleotidetaxonomy (based on organism identifier) • proteindomain relatives (based on domain assignment) • domains  pubmed (based on supporting literature) • pcsubstance  structures/mmdb (based on source information) • Soft Links: Pre-computed analyses • nucleotiderelated sequences (BLAST neighbors) • protein conserved domains (CDD/RPS-BLAST search) • pccompoundpccompound (structure-based neighboring)

Entrez: Database Integration Word weight PubMed abstracts Phylogeny VAST 3-D Structure 3 -D Structure Taxonomy Genomes Neighbors Related Structures BLAST Protein sequences Nucleotide sequences BLAST Neighbors Related Seqs. BLink, Domains Hard Link Neighbors Related Sequences

Links: Database Integration at NCBI Taxonomy Nucleotide Structure PubMed Protein CDD SNP Gene Gene Nucleotide Protein Structure CDD SNP Taxonomy PubMed

Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, dbSNP, GEO, PubChem Substance and PubChem Bioassays • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, RefSNP, GEO Datasets, PubChem Compound

An Entrez Database - Nucleotide • GenBank: Primary Data (98.2%) • original submissions by experimentalists • submitters retain editorial control of records • archival in nature • RefSeq: Derivative Data (1.8%) • curated by NCBI staff • NCBI retains editorial control of records • record content is updated continually

Literature Databases

NM_000249: PubMed Books

Books Link

Part 2. Data Flow and Processing Part 1. The Databases Part 3. Querying and Linking the Data Part 4. User Support A part of the NCBI Bookshelf

PubMed Central PubMed Central is a digital archive of life sciences journal literature. Integrated into the Entrez retrieval system, PMC provides free and unrestricted access to the full text of over 160 life sciences journals, with more to come.

NCBI Journal Database Detailed journal information

OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases

Primary vs. Derivative Databases C C GA ATT GA UniGene GA C ATT GA Algorithms C TATAGCCG Sequencing Centers ACGTGC TTGACA ATTGACTA ACGTGC CGTGA UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT ACGTGC RefSeq: Gene and Genomes Pipelines Curators TATAGCCG AGCTCCGATA CCGATGACAA Labs

What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • Each record is assigned a stable accession number • GenBank Data • Direct submissions (traditional records ) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

CIB EBI The International Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ NIG • Submissions • Updates SRS EMBL getentry

GenBank Releases (non-WGS) Release 156 October 2006 62765195 Records 66925938907 Nucleotides >150,000 Species 245 Gigabytes 1032 files • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/

WGS: 63.2 billion bases Non-WGS: 59.8 billion bases The Growth of GenBank Release 152

GenBank Divisions PRI Primate ROD Rodent PLN Plant and Fungal BCT Bacterial/Archeal VRT Other Vertebrate INV Invertebrate VRL Viral MAM Mammalian PHG Phage SYN Synthetic UNAUnannotated Traditional • Direct Submissions (Sequin/Bankit) • Accurate (~1 error per 10,000 bp) • Well characterized • Organized by taxonomy Bulk EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic PAT Patent sequences STS Sequence Tagged Site HTCHigh Throughput cDNA CONConstructed entries • From sequencing projects • Batch submissions (ftp/email) • Inaccurate • Poorly Characterized • Organized by sequence type

Entrez Nucleotide Subsets CoreNucleotide 29225247 EST 39288168 GSS 15655087 TOTAL 84168502

Header Feature Table Sequence A Traditional GenBank Record LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // The Flatfile Format

An Example Record – M17755 Indexing for Nucleotide UID 4680720 Field Indexed Terms [primary accession] M17755 [title] Homo sapiens thyroid peroxidase (TPO) mRNA… [organism] Homo sapiens [sequence length] 3060 [modification date] 1999/04/26 [properties] biomol mrna gbdiv pri srcdb genbank

M17755: Feature Table TPO [gene name] CDS position in bp thyroiditis [text word] thyroid peroxidase [protein name] protein accession

Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!

Entrez Protein • GenPept (DDBJ, EMBL, GenBank) 6259705 • RefSeq 2997502 • Swiss Prot 236666 • PDB 86934 • PIR 30413 • PRF 12079 • Third Party Annotation 4969 Total 9628271

Protein Sources and Links PIR no mRNA! RefSeq  NM_000547 SWISS-PROT no mRNA! GenPept  M17755

Sequence Revisions Firstseen at NCBI, not first seen at GenBank! Version and GI change only if the sequence changes The accession number always retrieves the most recent version

Update without a Sequence Change June 15, 1989! GenBank came to NCBI in 1992!

Update with a Sequence Change

GenBank File Formats ASN.1 – The Raw Data flat file XML FASTA

Toolbox Sources ftp> open ftp.ncbi.nih.gov . . ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> #ifdef ENABLE_ID1 #include <accid1.h> #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},

Text Queries in Entrez term1 term2 term1[limit]OPterm2[limit]OP … where limit =Entrez indexing field (organism, author, …) OP= Boolean operator =AND, OR, NOT Wildcards: Ranges: cancer[title] vs. cancer*[title] 1:200[MW] Complex queries: ((A[limit1] OR B[limit2]) AND C[limit3]) NOT D[limit4]

Entrez Tabs Provides a simple form for applying commonly used Entrez limits Limits Allows access to the full indexing of each Entrez database and aids in constructing complex queries Preview/Index Provides access to previous searches in the current Entrez database History Clipboard A temporary storage area for selected records Details Displays the detailed parsing of the current Entrez query, and lists errors and terms without matches

Programming Entrez: E-Utilities http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html ESearch Entrez query UID list or History ESummary UID list or History Document summaries EFetch Formatted data UID list or History ELink UID list or History UID list or History EPost History UID list

Finding Primary Sequences • Search Entrez CoreNucleotide • 94.8% GenBank (primary data) • 5.2% RefSeq (curated data) Possible queries we’ve seen so far… M17755 [primary accession] TPO [gene name] thyroid peroxidase [title] thyroiditis [text word] Homo sapiens [organism] thyroid peroxidase [protein name] 3060 [sequence length] 1999/04/26 [modification date] biomol mrna [properties] gbdiv pri [properties] srcdb genbank [properties]

A Starting Query Find nucleotide records for human thyroid peroxidase 276 records human thyroid peroxidase (("Homo sapiens“[Organism] OR human[All Fields]) AND thyroid peroxidase[All Fields]) Field Limit! 262 records human[organism] AND thyroid peroxidase ("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields]) 14 records aren’t human sequences!!

NCBI Molecular Biology Resources