1 / 29

BINF6201/8201 Biological Sequence Databases 09-30-2010

BINF6201/8201 Biological Sequence Databases 09-30-2010. Types of Biological Sequence Databases Bioinformatics Analyses Rely on Various Information in Public Databases. Nucleic acid sequence databases. General purpose sequence databases. Protein sequence databases.

mikaia
Télécharger la présentation

BINF6201/8201 Biological Sequence Databases 09-30-2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BINF6201/8201 Biological Sequence Databases 09-30-2010

  2. Types of Biological Sequence Databases Bioinformatics Analyses Rely on Various Information in Public Databases Nucleic acid sequence databases General purpose sequence databases Protein sequence databases Biological sequence databases Protein sequence motif and domain databases Special purpose sequence databases Protein structure databases Specialized genome databases Organism-specific databases

  3. Nucleotide Sequence Databases • Features • All published nucleotide sequences are requested to be deposited in the one of these three databases; • Data are exchanged among these three databases on daily basis; • Sequences stored in the GenBank at NCBI can be downloaded by anonymous ftp at ftp://ftp.ncbi.nih.gov. • International Nucleotide Sequence Database (INSD) • GenBank at NCBI • European Molecular Biology Laboratories (EMBL) Nucleotide Sequence Database at European Bioinformatics Institute (EBI) • DNA database of Japan (DDBI)

  4. Nucleotide Sequence Databases • DNA sequence file formats Fasta format: >gi|46048570|ref|NM_203332.1| Rattus norvegicus glutamine/glutamic acid-rich protein A (Grpcb), mRNA CAGTTGGGAGAAAGTCCATAGACTCCTCCAAGATGCTTGTGGTCCTGCTCACAGCAGCCTTGCTGGCTCT GAGCTCAGCTCAGGGCACGGATGAAGAGGTCAACAATGCTGAGACCAGTGATGTACCAGCAGATTCTGAA CAACAACCCGTGGACTCGGGTTCAGATCCACCTTCTGCTGATGCAGATGCAGAGAATGTTCAAGAGGGTG AATCAGCCCCACCAGCAAATGAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGGA ACCCACACAGGCAGAGAATCAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGGAA CCCACACAGGCAGAGAATCAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGCAAC CCACACAGGCAGAGAATCAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGGAATC CACACAGGCAGAGAATCAAGAGCCTTCTGACTCTGCTGGGGAAGGACAGGAAACTCAACCTGAGGAAGGA AATGTAGAGTCACCTCCCTCTTCTCCTGAAAACTCACAAGAACAACCACAGCAAACAAATCCAGAGGAGA AACCGCCTGCTCCTAAGACTCAGGAAGAGCCACAGCACTATAGAGGTCGTCCTCCAAAGAAGATTTTTCC TTTTTTCATTTACAGAGGAAGACCAGTAGTAGTATTCAGGCTCGAGCCTAGGAATCCATTCGCCAGAAGA TTTTAGAGAGTACCTGAGAAGATTATGACCTTCAGATGTGTAGGTCAACAAATCACTGTTGATTGTCTAT AATATTCCAATAAAAATTTTCAGCATGC

  5. Nucleotide Sequence Databases • DNA sequence file formats: GenBank format

  6. Protein Sequence Databases The NCBI Entrez Database • Stores translated protein sequences of nucleotide sequences from the GenBank/EMBL/DDBJ; • Also incorporates protein sequences from SWISS-PROT and PIR; • Each protein has a unique gene identification (gi) number; • Contain redundant sequences; thus multiple gi’s are associated with the same sequence; • The most complete protein sequence database; • Annotated by submitters; • Sequences are linked to other NCBI databases (PubMed, taxonomy); • NR (non-redundant) is based on Entrez contains a unique set of these sequences. • RefSeq contains non-redundant sequences from major organisms for which sufficient data is available, in particular, sequenced genomes.

  7. Protein Sequence Databases PIR-PSD (Protein Information Resource-Protein Sequence Database) • The world's first database of classified and functionally annotated protein sequences that grew out of the Atlas of Protein Sequence and Structure (1965-1978) edited by Margaret Dayhoff; • Produced and distributed by the Protein Information Resource in collaboration with MIPS (Munich Information Center for Protein Sequences) and JIPID (Japan International Protein Information Database); • PIR-PSD has been the most comprehensive and expert-curated protein sequence database in the public domain for over 20 years. • In 2002, PIR joined EBI (European Bioinformatics Institute) and SIB (Sweden Institute of Bioinformatics) to form the UniProt consortium. • PIR-PSD sequences and annotations have been integrated into UniProt Knowledgebase.

  8. Protein Sequence Databases SWISS-PROT and TrEMBL Databases • Swiss-Prot: a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases; originally created by Amos Bairock at Sweden Bioinformatics Institute • TrEMBL: a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot. • Since 2002, both SWISS-PROT and TrEMBL become a part of Uniprot databases.

  9. Protein Sequence Databases • Uniprot databases (Universal Protein Resource) --- Created by unification of information in three well-known protein databases through a funding from NIH. • PIR (Protein Information Resources); --- heir to the oldest protein sequence database, Margaret Dayhoff's Atlas of Protein Sequence and Structure • 2. SWISS-PROT ; --- Created by Amos Bairock at Sweden Bioinformatics Institute • TrEMBL databases; --- Most comprehensive catalogue of information on proteins;

  10. Protein Sequence Databases UniProt database (Universal Protein Resource) • UniProt is comprised of three components, each is optimized for different purposes: • 1. The UniProt Knowledgebase (UniProtKB) • 2. The UniProt Reference Clusters (UniRef) databases • 3. The UniProt Archive (UniParc) • Nucleic Acids Res. 2007 Jan;35(Database issue):D193-7. Epub 2006 Nov 16.

  11. Protein Sequence Databases UniProt Knowledgebase (UniProtKB) • In addition to capturing the core data mandatory for each UniProt entry, other available information is also added, including biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data. • The UniProtKB consists of two sections: UniProtKB/Swiss-Prot: manually-annotated records supported by literature and curator-evaluated computational analysis UniProtKB/TrEMBL: computationally analyzed records that await full manual annotation

  12. Protein Sequence Databases UniProt Reference Clusters (UniRef) Databases • The UniRef databases provide clustered sets of sequences from UniProt knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view; • The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged UniProt entries, and links to the corresponding UniProt and UniParc records. • UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues using an algorithm, so that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the representative sequence.

  13. Protein Sequence Databases UniProt Archive (UniParc) • UniParc is a comprehensive non-redundant protein sequence collection; • Protein sequences are loaded daily from many different publicly accessible sources to include all known sequence whose host is known • While a protein sequence may exist in multiple databases, and even more than once in a given database (with different identifiers), UniParc stores each unique sequence only once and assigns it a unique UniParc identifier. • Cross-references back to the source databases are provided, and include source accession numbers, sequence versions, and status (active or obsolete).

  14. Protein structure databases • Protein Data Bank • Stores 3-D structures of biological macromolecules determined by experimentally (X-ray crystallography and NMR). http://www.pdb.org/pdb/static.do?p=general_information/pdb_statistics/index.html

  15. Protein Structure Databases The SCOP (Structural Classification of Protein)Database • Developed by Alexey Murzin and Cyrus Chothia. • Hierarchical classification of structural domains of individual PDB entries; • The SCOP is organized as a tree structure with the hierarchy of Class: According to the arrangement of secondary structure Fold: Proteinshave same major secondary structures in same arrangement with the same topological connections Superfamily:low sequence identities but structures suggest that acommon evolutionary origin is probable. Family:Cluster of homologous proteins

  16. SCOP Classes • All alpha proteins [46456] (226) [protein domains] (folds) • All beta proteins [48724] (149) • Alpha and beta proteins (a/b) [51349] (134) Mainly parallel beta sheets (beta-alpha-beta units) • Alpha plus beta proteins (a+b) [53931] (286) • Mainly antiparallel beta sheets (segregated alpha and beta regions) • Multi-domain proteins (alpha and beta) [56572] (48) Folds consisting of two or more domains belonging to different classes • Membrane and cell surface proteins and peptides [56835] (49) Does not include proteins in the immune system • Small proteins [56992] (79) Usually dominated by metal ligand, heme, and/or disulfide bridges • Coiled coil proteins [57942] (7) • Not a true class • Low resolution protein structures [58117] (24)   Not a true class • Peptides [58231] (116) • Peptides and fragments. Not a true class • Designed proteins [58788] (42) Experimental structures of proteins with essentially non-natural sequences.

  17. Protein Classification Databases The CATH Database • Developed by Janet Thornton and Christine A. Orengo. • Hierarchical classification of structural domains of individual PDB entries; • The CATH is organized as a tree structure with the hierarchy of Class: According to the arrangement of secondary structure Architecture: orientation of secondary structure elements Topology:topological connections between secondary structure elements Homologous superfamily:Cluster of homologous proteins

  18. Protein Sequence Motif Databases • Protein sequence motif: a set of conserved amino acid residues that are important for protein function and are located in a short distance from one other. • The EF-hand motif • |-helixI-----|------loop1------|--helixII--| • En**nn**-nX**Y-*Z*G*Ix**zn**nn**n Legend: E = glutamate; n = hydrophobic residue; * = any residue; X = first calcium ligand; Y = second calcium ligand; Z = third calcium ligand; G = glycine; # = fourth calcium ligand, provided by a backbone carbonyl; I = isoleucine (although other aliphatic residues are also found at this position); -X = fifth calcium ligand; -Z = sixth and seventh calcium ligands, provided by a bidentate glutamate or aspartate residue.

  19. Protein Sequence Motif Databases The PROSITE database • Created by Amos Bairock at Sweden Bioinformatics Institute • Consists of a large collectionof biologically meaningful signatures that aredescribed as patterns or profiles. • Release 20.17, of 24-Jul-2007 contains --- 1489 documentation entries --- 1319 patterns --- 739 profiles

  20. Protein Sequence Motif Databases The PRINTS Database • PRINTS is a compendium of protein fingerprints. • A fingerprint is a group of conserved motifs used to characterize a protein family; • Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. • Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: the database thus provides a useful adjunct to PROSITE.

  21. Protein Domain Databases • Protein domain: a structurally compact, independently folding unit that forms a stable 3-D structure and shows some evolutional conservation. • Genetically mobile: domain recombination is one of the major forces to create novel proteins in evolution.

  22. Protein Domain Databases • Classification of protein domains from multiple sequence alignments

  23. Protein Domain Databases The Pfam Database • Pfam is a collection of protein family alignments which were constructed semi-automatically using hidden Markov models (HMMs), Pfam-A. • Sequences that are not covered by Pfam-A are clustered and aligned automatically, and are released as Pfam-B.. • Pfam-A has been incorporated into EBI’s InterPro and NCBI’s CDD • Version 22.0 (June 2007) contains alignments and models for 9318 protein families, based on the Swiss-prot 51.7 and TrEMBL 34.7 protein sequence databases. • The HMMer packageprovides tools for --- scanning a sequence against the Pfam database --- creating a new HMM from a multiple sequence alignment

  24. Protein Domain Databases The SMART Database • SMART (Simple Modular Architecture Research Tool): HMMs based on high-quality manually derived alignments of protein domain families. • Contains more than 500 domain families found in signaling, extracellular and chromatin-associated proteins • Domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. • Has been incorporated into EBI’s InterPro and NCBI’s CDD.

  25. Protein Domain Databases The ProDom Database http://www.toulouse.inra.fr/prodom.html • The ProDom protein domain database consists of an automatic compilation of homologous domains. • Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. • The ProDom database has been designed as a tool to help analyze domain arrangements of proteins and protein families. • Strong emphasis has been put on the graphical user interface which allows for interactive analysis of protein homology relationships. • Has been incorporated into EBI’s InterPro.

  26. Protein Domain Databases The SUPERFAMILY Database • A library of profile hidden Markov models that represent all proteins of known structure. • The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. • SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes. • Has been incorporated into EBI’s InterPro.

  27. Protein Domain Databases The Gene3D Database • Gene3D is a library of hidden Markov models that represent all proteins of known structure. • The seed alignments for the models are derived from the proteins found within the homologous superfamily classification level in CATH, which groups together domains that are thought to share a common ancestor. • In CATH, similarities at Homologous superfamily-level are identified first by sequence comparisons and subsequently by structure comparisons using SSAP. • Gene3D has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the Gene3D website. • Has been incorporated into EBI’s InterPro.

  28. Protein Domain Databases The InterPro Database (Integrated Resource of Protein Families, Domains and Sites) • InterPro is the unification of the following databases for protein domain and family classification: • PROSITE: regular expressions and profiles. • PRINTS: fingerprints (groups of aligned, un-weighted motifs). • ProDom: uses PSI-BLAST to find homologous sequences, that are clustered in the same ProDom entry. • Pfam: • SMART: • TIGRFAMs: • PIRSF: • SUPERFAMILY: • PANTHER: • Gene3D: InproScan program: for scanning the InterPro database The whole database can be downloaded. Hidden Markov models(HMMs)

  29. The number of unique protein folds and families are limited The number of unique protein folds are saturated since 2008 http://www.pdb.org/pdb/static.do?p=general_information/pdb_statistics/index.html The number of unique protein families are saturated since 2008 http://www.pdb.org/pdb/static.do?p=general_information/pdb_statistics/index.html

More Related