Evolution of Sequence Databases: A Comprehensive Overview of Nucleotide and Protein Data

Sequence databases and retrieval systems Guy Perrière [ replaced by Manolo Gouy ] Pôle Bio-Informatique Lyonnais Laboratoire de Biométrie et Biologie Évolutive UMR CNRS n° 5558 Université Claude Bernard – Lyon 1

In the beginning • First paper compilation in 1965 (Atlas of Protein Sequences). • Development of real databanks at the begin-ning of the 80’s: • Fast access. • Make possible analyses that require a lot of data: • Codon usage. • Molecular phylogeny.

General databanks • Nucleotide sequences: • EMBL/GenBank/DDBJ. • Protein sequences: • Simple translations of coding regions: • GenPept (from GenBank). • TrEMBL (from EMBL). • Systems containing additional data: • SWISS-PROT. • PIR.

EMBL • Created in 1980 at the European Molecular Biology Laboratory in Heidelberg. • Maintained since 1994 at the European Bioinformatics Institute (EBI) near Cambridge. • Web server: http://www.ebi.ac.uk/embl

GenBank • Set up in 1979 at the Los Alamos National Laboratory in New Mexico, US. • Maintained since 1992 at the National Cen-ter for Biotechnology Information (NCBI) in Bethesda. • Web server: http://www.ncbi.nlm.nih.gov/Genbank/index.html

DDBJ • Active since 1984 at the National Institute of Genetics (NIG) in Mishima, Japan. • Web server: http://www.ddbj.nig.ac.jp

EMBL / GenBank / DDBJ • The International Nucleotide Sequence Database Collaboration :EMBL / GenBank / DDBJ • New sequences are exchanged daily between the three centers : --> the three banks have an identical content. • Data mainly provided by direct submissions from the authors through Internet: • Web forms. • Email.

11 10 9 8 7 GenBank EMBL PIR SWISS-PROT 6 5 09/90 03/83 06/84 09/85 12/86 03/88 06/89 12/91 03/93 06/94 09/95 12/96 03/98 06/99 09/00 12/01 03/03 Data growth log (number of residues)

GenBank/EMBL size (April 2003) • 31109 nucleotides. • 24106 sequences. • 1.8 million genes (proteins and RNA). • 313,000 bibliographic references. • 100 gigabytes on disk. • Growth of 63 % in 12 months.

Homo sapiens Mus musculus Zea mays Rattus norvegicus Brassica oleracea Arabidopsis thaliana Danio rerio Drosophila melanogaster Oryza sativa 27.3% 20.1% 3.0 % 2.9 % 2.3 % 2.0 % 2.0 % 1.4 % 0.9 % Taxonomic sampling (April 2003) • There are 135,560 species for which at least one sequence is available. • Nine species (0.007 %) correspond to 62 % of the total. • 77,900 species are represented by only one sequence! The nine most represented species in GenBank/EMBL

Distribution format • The banks are distributed as a set of text files called divisions ( 292 for EMBL). • A division contains sequences related to: • A taxon (e.g., bacteria, invertebrates, mammals). • A class of sequences (EST, HTG, GSS). • Within a division, each sequence is called an entry.

Entry structure • Information is introduced in structured fields. • The format differs in its form between EMBL and GenBank/DDBJ … • but not in substance.

ID, AC, SV and DT fields Contain identifiers and the creation and the last modification dates for the entries. ID BSAMYL standard; DNA; PRO; 2680 BP. XX AC V00101; J01547 XX SV V00101.1 XX DT 13-JUL-1983 (Rel. 03, Created) DT 12-NOV-1996 (Rel. 49, Last updated, Version 11)

DE, KW, OS and OC fields Definition, Keywords, Taxonomy. DE Bacillus subtilis amylase gene. XX KW amyE gene; amylase; amylase-alpha; KW regulatory region; signal peptide. XX OS Bacillus subtilis OC Bacteria; Firmicutes; Bacillus/Clostridium group; OS Bacillus/Staphylococcus group; Bacillus. The NCBI maintains a unified taxonomy, largely based on sequence information.

RN, RX, RA and RT fields contain bibliographic information. RN [1] RP 1-2680 RX MEDLINE; 83143299. RA Yang M., Galizzi, A., Henner, D.J.; RT "Nucleotide sequence of the amylase gene from RT Bacillus subtilis"; RL Nucleic Acids Res. 11:237-249(1983). …

FT field contains the descriptions of functional regions. keylocation and qualifiers FT promoter 369..374 FT /note="put. promoter sequence P2 [3] (amyR1)" FT RBS 414..419 FT /note="rRNA-binding site rbs-1 [3]" FT CDS 498..2480 FT /gene="amyE" FT /db_xref="SWISS-PROT:P00691" FT /product="alpha-amylase precursor" FT /EC_number="3.2.1.1” FT /protein_id="CAA23437.1" FT /translation="MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMKDIHDAG ...

Sequence Subsequence Intron/exon structure FT CDS join(242..610,3397..3542,5100..5351) FT /codon_start=1 FT /db_xref="SWISS-PROT:P01308" FT /note="precursor" FT /gene="INS" FT /product="insulin" ...

SQ field Contains the sequence iself SQ Sequence 2680 BP; 825 A; 520 C; 642 G; 693 T; 0 other; gctcatgccg agaatagaca ccaaagaaga actgtaaaaa cgggtgaagc agcagcgaat 60 agaatcaatt gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca gggggattcc 120 ccatacattc ttcgcttggc tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct 180 gtttctgctt cggtatgtga ttgtgaagct ggcttacaga agagcggtaa aagaagaaat 240 (...) gatggtttct tttttgttca taaatcagac aaaacttttc tcttgcaaaa gtttgtgaag 2580 tgttgcacaa tataaatgtg aaatacttca caaacaaaaa gacatcaaag agaaacatac 2640 cctgcaagga tgctgatatt gtctgcattt gcgccggagc 2680 //

Errors in databanks • There are a lot of errors in the nucleotide sequence databanks: • In annotations: • Inaccuracies, omissions, and even mistakes. • Inconsistencies between entries. • In the sequences themselves: • Sequencing errors. • Cloning vectors inserted.

{ { { Partial and complete sequence duplications Redundancy • Another major pro-blem is redundancy. • A lot of entries are partially or entirely duplicated: • 20% of vertebrate se-quences in GenBank. • Duplicated entries are often different in their sequence.

Protein sequence databases • Translation of Coding DNA Sequences (CDS) from EMBL/GenBank/DDBJ. • Consultation of publications or patents. • Very small number of direct protein sequence submission by authors. • In SwissProt and PIR: additional annotations.

SWISS-PROT • Created by Amos Bairoch in 1986 at the Department of Medical Biochemistry in Geneva. • Maintained by the Swiss Institute of Bioinformatics (SIB) and funded by GeneBio, and, very recently, by NIH. • Web server: http://www.expasy.ch/sprot/sprot-top.html

SWISS-PROT characteristics • Almost no redundancy. • Cross-references with 60 other databanks. • High-quality annotations: • Systematic control by a team of annotators. • Help from a set of > 200 volunteer experts. • Embedded in Expasy, a www proteomics server (http://www.expasy.org) .

Annotations • Protein function. • Post-translational modifications. • Structural or functional domains. • Secondary and quaternary structures. • Similarities with other proteins. • Conflicts between positions for CDS. • Disease-related mutations

Associated databanks • TrEMBL, built using only annotated CDS from the EMBL data library. • ENZYME, for the international enzyme nomenclature. • PROSITE, for biologically significant sites, patterns and profiles. • SWISS-2DPAGE, for two-dimensional polyacrylamide gel electrophoresis maps.

PIR • PIR (The Protein Information Resource) was created by Margaret Dayhoff in 1965. • Aims: • To provide exhaustive and non-redundant protein sequence data. • To give a classification using taxonomic and similarity data: entries grouped in super-families, families and subfamilies.

Data maintenance • Three organisms collect and organize the data introduced in PIR: • The National Biomedical Research Foundation (NBRF) in the United States. • The Martinsried Institute for Protein Sequence (MIPS) in Germany. • The Japan International Protein Sequence Information Database (JIPID) in Japan.

Results • The exhaustivity is not better than what is obtained with SWISS-PROT+TrEMBL. • Still contains redundancy. • Less comprehensive annotation. • Low number of cross-references. • PIR has recently joined forces with EBI and SIB to establish the UniProt (United Protein Databases), the central resource of protein sequence and function.

Specialized databanks • A lot of specialized databanks have been developed, which are devoted to: • Complete genomes. • Families of homologous genes. • Non-sequence data. • These systems are under the responsibility of curators: • Data quality and homogeneity control.

Complete genomes • There is a large number of databanks devoted to specific organisms. • These banks are associated to sequencing or mapping projects. • For some model organisms there are often several concurrent systems.

Organism Bacillus subtilis Escherichia coli Various prokaryotes Saccharomyces cerevisiae Drosophila melanogaster Plasmodium falciparum Caenorhabditis elegans Arabidopsis thaliana Available databanks NRSub (Non-Redundant B. subtilis) SubtiList Colibri EcoGene (E. coli Gene Database) ECDC (E. coli Database Collection) CMR (Comprehensive Microbial Resource) EMGLib (Enhanced Microbial Genomes Library) Micado (Microbial Advanced Database Organization) MYGD (MIPS Yeast Genome Database) SGD (Saccharomyces Genome Database) YPD (Yeast Proteome Database) FlyBase PlasmoDB (P. falciparum Database) WormBase WormPD (Worm Protein Database) TAIR (The Arabidopsis Information Resource) Examples

Gene family databanks • Built with automated procedures: • Similarity search between sets of proteins (BLASTP, FASTP, Smith-Waterman). • Clustering into homologous families using similarity criteria. • Include various data: • Protein (and sometimes nucleotide) sequences. • Multiple sequence alignments and trees. • Taxonomy.

ProtFam • Developed at MIPS. • Built with PIR sequences. • Includes four levels of classification: • Superfamilies (based on function and similarity criteria). • Families (50% similarity). • Subfamilies (80% similarity). • Entries (≥95% similarity).

ProtFAm characteristics • Allows to visualize alignments and dendrograms for the families. • Integrates Pfam domains. • Allows users to classify their own protein sequences. • Web server: http://mips.gsf.de

ProtoMap • Initially developed at the Hebrew University of Jerusalem ; now hosted at Cornell University. • Built with SWISS-PROT & TrEMBL sequences. • Combines 3 sequence similarity measures (BLASTP, FASTA and Smith-Waterman).

ProtoMap characteristics • Alignments and trees are visualized with Java applets. • Users can submit sequences and classify them. • Web server: http://protomap.cornell.edu/index.html

Specialized systems • HOVERGEN (Homologous Vertebrate Genes Database) : • Based on GenBank CDS. • HOBACGEN (Homologous Bacterial Genes Database) for prokaryotes and yeast: • Based on SWISS-PROT/TrEMBL. • HOBACGEN-CG for completely sequenced genomes: • Based on SWISS-PROT/TrEMBL.

Other specialized systems • COG (Clusters of Orthologous Groups), also for complete genomes: • Based on GenBank CDS. • NuReBase (Nuclear Receptors Database) for mammalian nuclear receptors: • Based on EMBL CDS. • RTKdb (Tyrosine Kinase Receptors): • Based on EMBL CDS.

100 Q9S2Y9 22 P96218 Q9KPJ4 30 97 Q9KC46 100 GLTB_BACSU 75 GLTB_SYNY3 Reciprocal best BLAST hit 57 Q9PJA4 Q9RXX2 GLTB_ECOLI 100 100 Q9KPJ1 85 P95456 100 100 AAG08421 100 Q9PA10 O67512 GLTS_SYNY3 56 Q22275 100 Q9VVA4 GLT1_YEAST Are COGs real orthologs? • Escherichia coli • Bacillus subtilis • Pseudomonas • aeruginosa • Vibrio cholerae • Synechocystis sp. Glutamate synthase large subunit

Beyond protein families ProtFam, Hovergen, Hobacgen, COGs gather protein sequences homologous on their whole length Patterns, profiles, domains, … are covered in Terry Attwood’s lecture.

Data Gene expression Mapping Protein quantification 3D structures Polymorphism Molecular interactions Available systems GXD (Mouse Gene Expression Database) The Stanford Microarray Database GDB (Genome Data Base) EMG (Encyclopedia of Mouse Genome) MGD (Mouse Genome Database) INE (Integrated Rice Genome Explorer) SWISS-2DPAGE PDD (Protein Disease Database) Sub2D (B. subtilis 2D Protein Index) PDB (Protein Data Bank) MMDB (Molecular Modelling Data Base) NRL_3D (Non-Redundant Library of 3D Structures) SCOP (Structural Classification of Proteins) ALFRED (Allele Frequency Database) DIP (Database of Interacting proteins) BIND (Biomolecular Interaction Network Database) Non-sequence data

Sequence Data retrieval • Made mainly through Internet access: • With client software (e.g., Entrez, HobacFetch). • By remote connections to servers providing on-line access to the banks (INFOBIOGEN). • Using World-Wide Web servers and browsers

Advantages and limitations • Users do not have to cope with the usual databases problems: • Storing of large amounts of data. • Daily updates. • Software upgrades. • Simplicity of use. • Net access is sometimes very slow at peak hours: • consider using other servers besides NCBI

The ACNUC retrieval system • Direct access to functional regions described in feature tables (CDS, tRNA, rRNA). • Selection of entries using various criteria: • Sequence names and accession numbers. • Bibliographic criteria. • Keywords. • Taxonomy. • Organelle. • Developed at Lyon University

ACNUC : possible accesses • Graphical interface distributed along with the databases themselves. http://pbil.univ-lyon1.fr/databases/acnuc.html • Web access at Pôle Bio-Informatique Lyonnais (PBIL): http://pbil.univ-lyon1.fr/search/query.html

ACNUC characteristics • Allows to query any bank in PIR, SWISS-PROT, EMBL, or GenBank formats. • Keywords and species browsing. • Complex queries. • Links with sequence analysis programs on the Web server (alignment, codon usage).

click click

The Query form

Building queries to the sequence data bases click

Evolution of Sequence Databases: A Comprehensive Overview of Nucleotide and Protein Data

Evolution of Sequence Databases: A Comprehensive Overview of Nucleotide and Protein Data

Presentation Transcript

SEQUENCE RETRIEVAL SYSTEM SRS

DISTRIBUTED INFORMATION RETRIEVAL

Databases

BINF6201/8201 Biological Sequence Databases 09-30-2010

Angela Perri

XML Retrieval

XML Retrieval

HOGENOM a phylogenomic database

Genome-scale phylogenomics

Multimedia Information Retrieval Systems

Content-Based Retrieval (CBR) -in multimedia systems

Protein sequence retrieval AND other database information

Spatial-match Iconic Image Retrieval with Ranking in Multimedia Databases

Topic 1

Other biological databases

Sequence Databases – 20 June 2008

Mining Sequence Patterns in Transactional Databases

Use of Machine Learning in Chemoinformatics

Information systems and databases

Tree Pattern Matching in Phylogenetic Trees

Nucleotide Sequence Databases