1 / 100

UniProt

Protein Sequence Database:. UniProt. Jennifer McDowall. Overview. The UniProt databases UniProt/SwissProt annotation UniProt/TrEMBL automatic annotation Using the uniprot.org website Computational access. 1) The UniProt databases. Source of protein sequence data.

haines
Télécharger la présentation

UniProt

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Sequence Database: UniProt Jennifer McDowall

  2. Overview The UniProt databases UniProt/SwissProt annotation UniProt/TrEMBL automatic annotation Using the uniprot.org website Computational access

  3. 1) The UniProt databases

  4. Source of protein sequence data • Protein sequencing is rare • Most protein sequence derived from nucleotide data Large-scale sequencing projects Individual scientists Patent Offices Nucleotide sequencing Submit Protein sequencing Submit Protein sequence database Nucleotide sequence database Derive protein sequence

  5. Protein sequence is mainly derived data submit DNA sequence ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT transcribe Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC translate Derived protein sequence MRSNECCCAMSC

  6. Protein sequence is mainly derived data submit DNA sequence ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT Predicted start Predicted stop may not have direct evidence Predicted splice sites transcribe Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC translate Derived protein sequence MRSNECCCAMSC

  7. How to find the information you need? GAATCATCGTCTACG High quality protein sequence AATCATCACGAT ATAGACATCA CGCAGCACCAT GACGCGCATAACT • Non-redundant data • Splice isoforms, disease variants, PTMs • Sequence archiving essential GCAGCATCAG TAGCGAGCAGCAGCA TAGAGGCTATCAGCA CTATCTGT CAGCATC CTAAGCGACA AGATCGC Protein identification TATCTACAG GATCTACGA • Stable identifiers • Consistent nomenclature Protein annotation protein function biological processes • Information molecular interactions pathways

  8. UniProt Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database http://www.uniprot.org/

  9. UniProt Consortium

  10. Where does the data come from? ENA UniParc exchange data daily Sequence sources

  11. Where does the data come from? ENA UniParc History of sequences PDB Sequence sources Metagenomic & environmental Taxonomy known RefSeq Ensembl UniMES UniProtKB/ TrEMBL VEGA Patents Manual annotation Remove redundancy Model organisms UniProtKB/ SwissProt High quality annotation more…

  12. Where does the data come from? ENA UniParc PDB Sequence sources Metagenomic & environmental Taxonomy known RefSeq Ensembl UniMES UniProtKB/ TrEMBL VEGA Patents UniRef Clusters UniMES Clusters Model organisms UniProtKB/ SwissProt more…

  13. 4 components of UniProt • Complete history of sequences (no annotation) • Cross-links to external sequence sources UniParc • Swiss-Prot: non-redundant, manual annotation • TrEMBL: redundant, automatic annotation UniProtKB UniMES • Sequences from metagenomic projects • Combines sequences (speed searching) • UniRef100, UniRef90, UniRef50 UniRef

  14. Browsing a UniParc entry Accession Download data List of databases containing sequence Deleted entries identified (greyed out) Navigate to individual entries Sequence

  15. Browsing a UniProtKB/SwissProt entry Download data Names (synonyms) and taxonomy Protein attributes Annotation Ontologies Protein interactions Splice variants Sequence features Sequence References Navigate to external data sources e.g. Ensembl General information

  16. Browsing a UniRef90 entry Faster and more sensitive sequence search with no loss of information Status (SwissProt and/or TrEMBL) Cluster name List of entries in cluster Taxonomy of each entry % identity of sequences in cluster

  17. Taxonomic distribution of species Within Eukaryota: All kingdoms: Other mammals (27%) Bacteria (61%) Other Vertebrata (10%) Homo (12%) Archaea (4%) Viruses (3%) Other (8%) Viridiplantae (18%) Nematoda (2%) Insecta (5%) Eukaryota (32%) Fungi (18%)

  18. SwissProt – most represented species Mainly model organisms

  19. Protein Existence tag !! Not sequence validation !! Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Total 13% 12% 70% 5% -

  20. Protein existence categories !! Not sequence validation !! Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Human 59% 37.5% 1% 0.5% 2%

  21. 2) UniProtKB/SwissProt annotation

  22. Annotation sources for UniProtKB * Manual curation * Literature-based annotation * Sequence analysis GO Functional info Protein identification data PRIDE Protein families and domains InterPro Molecular interactions IntAct IntEnz Enzymes Microbial protein families HAMAP Post-translational modifications RESID Protein classification Data sources Some data sources for annotation InterPro classification Signal prediction Transmembrane prediction UniProtKB Other predictions * Automated annotation

  23. Features of UniProtKB Splice variants Sequence features Sequence Ontologies Annotations References Nomenclature

  24. Organism-specificDBs DictyBase AGD EchoBASE CGD EcoGene CTD euHCVdb CYGD FlyBase HGNC GeneCards HPA GeneFarm MGI GrameneMIM H-InvDBRGD LegioListSGD LepromaTAIR ListiListZFIN MaizeGDBMypuList OrphanetPharmGKB PhotoListPseudoCAP SagaListSubtiList TubercuListWormBase WormPepXenbase GeneDB_Spombe ArachnoServerBuruList Enzyme & pathwayDBs BioCyc BRENDA Reactome Pathway_Interaction_DB ProteomicDBs PeptideAtlas PRIDE ProMEX Genome annotation DBs EnsemblKEGG GeneID NMPDR VectorBase UCSC GenomeReviewsTIGR Family and domainDBs Gene3D PIRSF HAMAP PRINTS InterPro ProDom PANTHER PROSITE PfamTIGRFAMs SMART A wealth of external links PhylogenomicDBs HOGENOM OMA HOVERGEN PhylomeDB InParanoidOrthoDB 125 links! PolymorphismDBs dbSNP Ontologies GO 2D gelDBs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE World-2DPAGE 3D structure DBs DisProtHSSP PDB PDBsum SMR Gene expression DBs ArrayExpressBgee GermOnlineCleanEx Genevestigator Others BindingDB PMAP-CutDB DrugBank NextBio Protein-protein interaction DBs DIP IntAct STRING PTM DBs GlycoSuiteDB PhosphoSite PhosSite SequenceDBs EMBL IPI PIR RefSeq UniGene Proteinfamily/group DBs CAZy MEROPS PeroxiBaseREBASE PptaseDBTCDB

  25. SwissProt manual annotation Protein sequence • Merge available CDS (coding sequence) • Annotate sequence discrepancies • Report sequencing errors... Biological information • Extract literature information • Orthologue data propagation • Protein sequence analysis...

  26. Problem #1: sequence correction ~20% of Swiss-Prot entries required correction • Typical problems: • Unsolved conflicts (sequencing errors) • Erroneous gene model predictions • Wrong initiation sites • Frameshifts...

  27. Sequence quality from genome projects • Drosophila: • Well-curated • 1.8% of gene models incorrect • Arabidopsis: • Annotated when sequenced, but no update • 19.5% of gene models incorrect • Tetraodon nigroviridis: • Automatic run through (no manual intervention) • >90% of gene models incorrect

  28. Sequence curation Sequencing errors Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines

  29. Problem #2: proteome complexity 1 SwissProt entry = 1 gene (1 species) genome ~20,000 human protein-coding genes proteome >1,000,000 human proteins alternative splicing, alternative initiation, mRNA editing... Post-translational modification transcriptome ~100,000 human transcripts Annotation of sequence differences

  30. Merging entries Because of: • Errors • Erroneous gene model predictions; sequence errors • Natural variation • Polymorphisms; Alternative start sites; Alternative splicing • Multiple entries for the same protein exist in TrEMBL (redundancy) • Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated accordingly.

  31. Example Multiple alignment of the end of the available GCR sequences: Annotation of the sequence differences (protein diversity):

  32. Merging entries

  33. Sequence curation Alternative Splicing

  34. Sequence curation Alternative Splicing

  35. Sequence curation Alternative Splicing

  36. Sequence curation Alternative Splicing

  37. Sequence curation Alternative Splicing

  38. Sequence curation Identification of amino acid variants ....and of PTMs ....and also

  39. Sequence curation Domain annotation Binding sites

  40. SwissProt manual annotation Protein sequence • Merge available CDS (coding sequence) • Annotate sequence discrepancies • Report sequencing errors... Biological information • Extract literature information • Orthologue data propagation • Protein sequence analysis...

  41. Sources of annotated information UniProtKB/SwissProt gathers information from multiple sources: • Publications (literature/PubMed) • Prediction proteins (Prosite, Anabelle) • Contact with experts • Other databases • Nomenclature committees

  42. Nomenclature Synonyms useful for literature searching

  43. Nomenclature Provides synonyms and cleavage products of bifunctional proteins

  44. Annotation comments >30 comment fields Controlled vocabularies used whenever possible…

  45. Disease association Mendelian Inheritance in Man provides information on genetic disease associations Pharmacogenomics database

  46. Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…

  47. Sequence annotation (Features) Feature (e.g. domain) highlighted on sequence

  48. Gene Ontology 1. Biological Process • Cell division • Mitosis • Organelle fission A commonly recognized series of events 2. Molecular Function • Protein kinase activity • Insulin binding • Insulin receptor activity An elemental activity or task or job 3. Cellular Component • Mitochondrion • Mitochondrial matrix • Mitochondrial membrane Where a gene product is located

  49. Gene Ontology Annotation for human Rhodopsin:

  50. Imported annotation Binary interactions are taken from the database Interactors of human p53

More Related