The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation

The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Icobicobi 2004 Angra Dos Reis, RJ, Brasil Darren A. Natale, Ph.D. Project Manager and Senior Scientist, PIR Research Assistant Professor, GUMC

1) UniProt Overview 2) PIRSF Protein Classification System 3) Family-Driven Protein Annotation Major Topics

Central Resource of Protein Sequence and Function International Consortium: PIR, EBI, SIB Unifies PIR-PSD, Swiss-Prot, TrEMBL UniProt: Universal Protein Resource http://www.uniprot.org

UniParc: Comprehensive Sequence Archive with Sequence History UniProt:Knowledgebase with Full Classification and Functional Annotation UniRef:Condensed Reference Databases for Sequence Search UniProt Databases

An archive for tracking protein sequences Comprehensive: All published protein sequences Non-Redundant: Merge identical sequence strings Traceable: Versioned, with ‘Active’ or ‘Obsolete’ status tag Concise: no annotation of function, species, tissue, etc. 2.5 million unique entries from 6 million source-database entries UniParc

Annotated: Fully manually-curated (Swiss-Prot section) and automatically-annotated based on family-driven rules (TrEMBL section) Cross-referenced: Links to over 50 external databases (classification, domain, structure, genome, functional, boutique) Non-redundant: Merge in a single record all protein products derived from a certain gene in a given species UniProt Knowledgebase • High Information Content: • Isoform Presentation: Alternatively Spliced Forms, Proteolytic Cleavage, and Post-Translational Modification (each with FTid) • Nomenclature: Gene/Protein Names (Nomenclature Committees) • Family Classification and Domain Identification: InterPro and PIRSF • Functional Annotation: Function, Functional Site, Developmental Stage, Catalytic Activity, Modification, Regulation, Induction, Pathway, Tissue Specificity, Sub-cellular Location, Disease, Process

Activity • Pathway • Disease • Cross-Refs UniProt Report Modified Swiss-Prot “NiceProt” view • ID & Accession • Name & Taxon • References

Position-specific features: • Active sites • Binding sites • Modified residues • Sequence variations • Additional Info • Expanded detail UniProt Report (II)

Non-Redundant: Merge sequences and subsequences UniRef100: 100% sequence identity from all species, including sub-fragments Superset of Knowledgebase: Includes splice variants and selected UniParc sources (e.g. EnsEMBL, IPI, and patent data) Optimized: For Faster Searches using Reduced Data Sets UniRef90: 90% sequence identity (36% size reduction) UniRef50: 50% sequence identity (63% size reduction) UniRef Databases

100% sequence identity from all species, including sub-fragments Splice Variants as separate entries UniRef100 Report Sub-fragments Splice variants

Phenylalanine hydroxylase & Tryptophan hydroxylase Merged sequences likely have the same function 50% UniRef90/50 Reports 90% Representative sequence

Publicly available Dec. 15, 2003 Text/Sequence Searches against UniProt, UniRef, UniParc Links to Useful Tools Download UniProt, UniRefs FAQs and Information User Help/feedback forms UniProt Web Site http://www.uniprot.org

Problem: • Most new protein sequences come from genome sequencing projects • Many have unknown functions • Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Solution: • Highly curated and annotated protein classification system Facilitates: • Automatic annotation of sequences based on protein families • Systematic correction of annotation errors • Name standardization in UniProt • Functional predictions for uncharacterized proteins The Need for Classification This all works only if the system is optimized for annotation

Levels of Protein Classification

Domain shuffling What about these? Protein Evolution Sequence changes With enough similarity, one can trace back to a common origin

CM? PDH? PDT? CM/PDH? CM/PDT? Consequences of Domain Shuffling PIRSF006786 PIRSF001501 CM = chorismate mutase PDH = prephenate dehydrogenase PDT = prephenate dehydratase ACT = regulatory domain CM (AroQ type) PDH CM (AroQ type) PDH PIRSF001499 ACT PDH PIRSF005547 PDT ACT PIRSF001424 CM (AroQ type) PDT ACT PIRSF001500

- - - - Acylphosphatase ZnF ZnF YrdC Peptidase M22 Whole Protein = Sum of its Parts? PIRSF006256 On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme

Classification Goals We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR Credit: Dr. Y. Wolf, NCBI

Domain Classification Allows ahierarchythat can trace evolution to thedeepest possible level, the last point of traceable homology and common origin Can usually annotate onlygeneral biochemical function Whole-protein Classification Cannot build a hierarchy deep along the evolutionary tree because ofdomain shuffling Can usually annotatespecific biological function(preferred to annotate individual proteins) Complementary Approaches • Can map domains onto proteins • Can classify proteins even when domains are not defined

Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology The Ideal System… • Allows for simultaneous use of the whole protein and domain information (domains mapped onto proteins) • Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families • Expertly curated membership, family name, function, background, etc. • Evidence attribution (experimental vs predicted)

PIRSF: Anetworkstructure fromsuperfamiliestosubfamilies Reflectsevolutionary relationshipsof full-lengthproteins PIRSF Classification System • Definitions: • Homeomorphic Family:Basic Unit • Homologous: Common ancestry, inferred by sequence similarity • Homeomorphic: Full-length similarity & common domain architecture • Network Structure: Flexible number of levels with varying degrees of sequence conservation; allows multiple parents • Advantages: • Annotate both general biochemical and specific biological functions • Accurate propagation of annotation and development of standardized protein nomenclature and ontology

PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

1. Variable number ofrepeats Variable Domain Architecture Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

2. Presence/absence of auxiliary domains Easily lost or acquired Usually small mobile domains Different versions of domain architecture arising many times Variable Domain Architecture Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

3. Domain duplication Variable Domain Architecture Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

Classification Tool: BlastClust • Curator-guided clustering • Retrieve all proteins sharing a common domain • Single-linkage clustering using BlastClust • Fixed-length coverage enforces homeomorphicity • Iterative procedure allows tree view

Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF Phylogenetic tree and alignment view allows further sequence analysis PIRSF Family Report (I) Curated family name Description of family Sequence analysis tools

PIRSF Family Report (II) Integrated value-added information from other databases Mapping to other protein classification databases

Improve Annotation Quality Annotate biological function of whole proteins Annotate uncharacterized hypothetical proteins (functional predictions helped by newly-detected family relationships) Correct annotation errors Improve under- or over-annotated proteins Standardize Protein Names in UniProt Site annotation PIRSF Protein Classification provides a platform for UniProt protein annotation Family-Driven Protein Annotation

Enhanced Annotations in UniProt Corrections Upgraded underannotations Predicted functions for “hypothetical” proteins

PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous Family-Driven Protein Annotation Objective: Optimize for protein annotation • PIRSF Classification Name • Hierarchy • Hierarchy • Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase) • Name Rules • Name Rules • Define conditions under which names propagate to individual proteins • Enable further specificity based on taxonomy or motifs • Names adhere to Swiss-Prot conventions (though we may make suggestions for improvement) • Site Rules • Site Rules • Define conditions under which features propagate to individual proteins

PIR Name Rules • Account for functional variations within one PIRSF, including: • Lack of active site residues necessary for enzymatic activity • Certain activities relevant only to one part of the taxonomic tree • Evolutionarily-related proteins whose biochemical activities are known to differ Monitor such variables to ensure accurate propagation • Propagate other properties that describe function: EC, GO terms, misnomer info, pathway • Name Rule types: • “Zero” Rule • Default rule (only condition is membership in the appropriate family) • Information is suitable for every member • “Higher-Order” Rule • Has requirements in addition to membership • Can have multiple rules that may or may not have mutually exclusive conditions

Example Name Rules Note the lack of a zero rule for PIRSF000881

Name Rule in Action at UniProt • Current: • Automatic annotations (AA) are in a separate field • AA only visible from www.ebi.uniprot.org • Future: • Automatic name annotations will become DE line if DE line • will improve as a result • AA will be visible from all consortium-hosted web sites

Name Rule Propagation Pipeline Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node) Nothing to propagate No Name rule exists? Yes Protein fits criteria for any higher-order rule? Assign name from Name Rule 1 (or 2 etc) No Yes Assign name from Name Rule 0 No Yes PIRSF has zero rule? Nothing to propagate

Position-Specific Site Features: active sites binding sites modified amino acids PIR Site Rules • Current requirements: • at least one PDB structure • experimental data on functional sites: CATRES database (Thornton) • Rule Definition: • Select template structure • Align PIRSF seed members with structural template • Edit MSA to retain conserved regions covering all site residues • Build Site HMM from concatenated conserved regions

Propagate Information Feature annotation using controlled vocabulary Evidence attribution (experimental vs. computational prediction) Attribute sources and strengths of evidence Site Rule Algorithm • Match Rule Conditions • Membership Check (PIRSF HMM threshold) • Ensures that the annotation is appropriate • Conserved Region Check (site HMM threshold) • Site Residue Check (all position-specific residues in HMMAlign)

Match Rule Conditions Only propagate site annotation if all rule conditions are met

PIRSF Family Report (III) Defined rules for annotation Site rules allow precise annotation of features for UniProt proteins within the PIRSF

Site Rules Feed Name Rules • Functional variation within one PIRSF: • binding sites with different specificity drive choice of applicable rule to ensure appropriate annotation Functional Site rule: tags active site, binding, other residue-specific information ? Functional Annotationrule: gives name, EC, other activity-specific information

Dr. Cathy Wu, Director Curation team Dr. Winona Barker Dr. Darren Natale Dr. CR Vinayaka Dr. Zhangzhi Hu Dr. Anastasia Nikolskaya Dr. Xianying Wei Dr. Raja Mazumder Dr. Sona Vasudevan Dr. Lai-Su Yeh Informatics team Dr. Leslie Arminski Yongxing Chen, M.S. Jian Zhang, M.S. Dr. Hsing-Kuo Hua Sehee Chung, M.S. Amar Kalelkar Dr. Hongzhan Huang Baris Suzek, M.S. Students Jorge Castro-Alvear Vincent Hormoso Rathi Thiagarajan Christina Fang Natalia Petrova UniProt Collaborators Dr. Rolf Apweiler/EBI Dr. Amos Bairoch/SIB PIR Team

Curator’s Decision Maker

The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation

The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation

Presentation Transcript

Protein Classification

Protein classification

Protein structure Classification

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Pro

Protein Classification II

UniProt - The Universal Protein Resource

Protein structure classification

PROTEIN STRUCTURE CLASSIFICATION

Protein Classification

Protein Classification

UniProt: Universal Protein Resource

UniProt - The Universal Protein Resource

Gene/Protein Function Annotation

Protein Classification

Genome Annotation: A Protein-centric Perspective

PIRSF Classification System

SCOP – Protein structure classification CATH – Protein structure classification

PIRSF Classification System

Protein classification

Automated Protein Crystallization

Protein Functional Annotation