Proteomics Resources at the EBI

Proteomics Resources at the EBI Sandra Orchard EMBL-EBI

What do Protein scientists require? 1. Protein Identification A high quality, non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs to act as a reference set. Stable identifiers and sequence archiving essential 2. Protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source 3. Reference data sets Comparative datasets to compare tissue specificity patterns, normal/disease protein sets

Where do we go from here? Sequence similarity programs run against UniProt What is UniProt? Based on the original work on PIR, Swiss-Prot and TrEMBL Funded mainly by NIH Collaboration between EBI, SIB and PIR

UniProt Consortium

UniRef 50 UniRef 90 IPI Proteome Sets UniRef 100 UniSave UniProtKB UniMes UniParc PDB Sub/ Peptide Data FlyBase WormBase Patent Data INSDC (incl. WGS, Env.) RefSeq Ensembl VEGA Database sources UniProt data sources and data flow

UniProtKB • UniProt Knowledgebase: • Aims to describe in a single record all protein products derived from a certain gene from a certain species • 2 sections • UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed • UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed www.uniprot.org

What does UniProtKB give you? • Curated protein sequences – correction of frameshifts, premature stop sites, incorrect initiator methionine…….. stable identifiers, with archiving and versioning • Consistent nomenclature – plus synonyms • Identification of splice variants and/or alternative promoter usage - stable identifiers, with archiving and versioning

What does UniProtKB give you? 4.Identification of variants (at amino acid level) and of PTMs – where known, consequence is given - stable identifiers, with archiving and versioning 5. Annotation of literature experimental data in 27 defined fields. Increasing use of controlled vocabularies, without loss of detail

What does UniProtKB give you? 6. Extensive cross-referencing, a central portal to a wealth of external resources - 81 external databases cross-referenced to UniProtKB

Simple Text Search

1. Sequence curation, stable identifiers, versioning and archiving www.ebi.ac.uk/uniprot/unisave

Sequence curation, stable identifiers, versioning and archiving • For example – erroneous gene model predictions…. …frameshifts ..premature stop codons, readthroughs, erroneous initiator methionines…..

2. Consistent nomenclature (& synonyms)

3. Identification of splice variants

4. Identification of variants (at amino acid level)…. …and of PTMs … and also

Domain annotation Binding sites

Splice variants Experimental mutations Sequence conflicts

5. Annotation of literature experimental data in 27 defined fields.

Controlled vocabularies used whenever possible… ..but ability to further describe each specific situation retained

Disease specific annotation added to human entries… … with supporting cross-referencing

6. Extensive cross-referencing, a central portal to a wealth of external resources… .. Additional annotation (Gene Ontology)..

Reactome

wwPDB

InterPro – defines protein family membership and enables domain annotation

UniProtKB/TrEMBL • Redundant – only 100% identical sequences merged • Automated clean-up of annotation from original nucleotide sequence entry • Additional value added by using automatic annotation

Automatic Annotation • Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot • Identifies all members of this family using pattern/motif/HMMs in InterPro • Transfers common annotation to related family members in TrEMBL

BLAST more sequences Conserved signatures Protein Sequence Characterisation Basic information Build up consensus sequences of families, domains, motifs or sites

Simplest (limited) More information Finding Conserved Signatures • Pattern • Fingerprint • Sequence clustering • Profile • HMM

Integration of signatures InterPro Foundations of InterPro Manual curation

(100) 1) PROSITE IPR000001 (100) PFAM (100) IPR000001 2) PROSITE (50) IPR000002 PFAM 3) (100) IPR000001 PROSITE IPR000001 (100) IPR000002 PFAM IPR000002 (100) PROSITE 4) (100) PFAM Integration Process Same positions Same protein hits Same positions Different protein hits Different positions Same protein hits Different positions

(100) Protein kinase PFAM PFAM (75) Serine kinase SMART Protein kinase * (100) Protein kinase PFAM (25) PROSITE Tyrosine kinase SMART PROSITE Serine kinase Tyrosine kinase SMART PROSITE Children No proteins in common Signature Relationships 1) Parent - Child (subgroup of more closely related proteins) * Parent Applies to domains and families

Receptor family PFAM N-terminal domain C-terminal domain SMART PROSITE Contains (Smart and Prosite) PFAM Receptor Family Found in (Pfam) SMART PROSITE N-terminal domain C-terminal domain Signature Relationships 2) Contains – Found in (Describes domain composition) Both families and domains can contain domains

Specialisation of Databases

PDB sequence InterPro sequence-structure comparison MSD Residue-by-residue mapping UniProt amino acid position Structural Representation in InterPro

PDB structures displayed as striped patterns Structural classification in CATH and SCOP CATH SCOP and ModBase Homology models from Swiss-model Swiss-M ModB Structural Representation

Signatures predictive of protein annotation Structural data for specific proteins Sequence-Structure Display

Proteomics Resources at the EBI

Proteomics Resources at the EBI

Presentation Transcript

The EBI Enzyme Portal

The Proteomics Core at Wayne State University

EMBL-EBI

EBI Network

Small M olecules Resources at the EBI

EBI resources introductory course

The EBI Enzyme Portal

Virtualisation and Cloud Computing at EBI

WWW/EBI

EBI

The Huber Group at EBI

The Sequence Read Archive at EBI

EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics

EBI Roadshow

EBI web resources III: Web-based tools in Europe (EBI, ExPASy , EMBOSS, DTU )

EBI web resources II: Ensembl and InterPro

Literature Resources at the EBI

EBI web resources I: databases and tools

The Proteomics Core at Wayne State University

Describing Bioinformatic Metadata at EBI