prediction of protein function n.
Skip this Video
Loading SlideShow in 5 Seconds..
Prediction of protein function PowerPoint Presentation
Download Presentation
Prediction of protein function

Prediction of protein function

271 Views Download Presentation
Download Presentation

Prediction of protein function

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Prediction of protein function Lars Juhl JensenEMBL Heidelberg

  2. Overview • Part 1 • Homology-based transfer of annotation • Function prediction from protein domains • Part 2 • Prediction of functional motifs from sequence • Feature-based prediction of protein function • Part 3 • Prediction of functional interaction networks

  3. Why do we need to predict function?

  4. What do we mean by function? • The concept “function” is not clearly defined • A structural biologist, a cell biologist, and a medical doctor will have very different views • Many levels of granularity • For the overall definition of “function”, the knowledge and description can be more or less specific • Functional categories are somewhat artificial • People like to put things in boxes …

  5. Descriptions of protein function • Controlled vocabularies • Gene Ontology • SwissProt keywords • KEGG pathways • EcoCyc pathways • Interaction networks • More accurate data models • Reactome • Systems Biology Markup Language (SBML)

  6. Molecular function • Molecular function describes activities, such as catalytic or binding activities, at the molecular level • GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place • Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity

  7. Biological process • A biological process is series of events accomplished by one or more ordered assemblies of molecular functions • An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport • It can be difficult to distinguish between a biological process and a molecular function

  8. Cellular component • A cellular component is just that, a component of a cell that is part of some larger object • It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) • The cellular component categories are probably the best defined categories since they correspond to actual entities

  9. Homology-basedtransfer of annotation Lars Juhl JensenEMBL Heidelberg

  10. Detection of homologs • Pairwise sequence similarity searches • BLAST (fastest) • FASTA • Full Smith-Waterman (most sensitive) • Profile-based similarity searches • PSI-BLAST • Hidden Markov Models (HMMs) • Sequence similarity should always be evaluated at the protein level

  11. Sequence similarity, sequence homology, and functional homology • Sequence similarity means that the sequences are similar – no more, no less • Sequence homology implies that the proteins are encoded by genes that share a common ancestry • Functional homology means that two proteins from two organisms have the same function • Sequence similarity or sequence homology does not guarantee functional homology

  12. Orthologs vs. paralogs

  13. Functional consequencesof gene duplication • Neofunctionalization • One copy has retained the ancestral function and can be treated as a 1–to–1 ortholog (functional homolog) • The other copy have changed their function and behave much like paralogs • Subfunctionalization • Each copy has taken on a part of the ancestral function • A functional homolog cannot be defined • Each ortholog typically has the same molecular function in a different sub-process or location

  14. 1–to–1 orthology • A single gene in one organism corresponds to a single gene in another organism • These can generally be assumed to encode functionally equivalent proteins • Same molecular function • Same biological process • Same localization • 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms

  15. 1–to–many orthology • A single gene in one organism corresponds to multiple genes in another organism • Any mixture of neo- and sub-functionalizations can have occurred • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • 1–to–many orthology is very common between simple model organisms and higher eukaryotes

  16. Many–to–many orthology • Many genes in each organism have arisen from a single gene in their last common ancestor • Different neo- and sub-functionalizations have likely taken place in each lineage • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • Many–to–many orthology is common between higher eukaryotes that are distantly related

  17. Detection of orthologs • Reconstruction of phylogenetic trees • The theoretically most correct way • Works for analyzing particular genes of interest • Methods based on reciprocal matches • What currently works at the genomic scale • Manual curation • Detection of very remote orthologs may require that knowledge on gene synteny and/or protein function is taken into account

  18. Construction of gene trees • Identify the relevant proteins • Sequence similarity and possibly additional information • Construct a blocked multiple sequence alignment • Use, for example, Muscle and Gblocks • Reconstruct the most likely phylogenetic tree • Use, for example, PhyML • Orthologs and paralogs can be trivially extracted based on a gene tree

  19. Reciprocal matches • Simple “best reciprocal match” is a bad choice • Can only deal with one-to-one orthology • Detection of in-paralogs • Similarity higher with species than between species • Orthologs can now be detected based on best reciprocal matches between in-paralogous groups • One or more out-group organisms can optionally be used to improve the definition of orthologs

  20. Orthologous groups • Orthologs and paralogs are in principle always defined with respect to two organisms • Orthologous groups instead try to encompass an entire set of organisms • The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover

  21. Definition of orthologous groups

  22. COGs, KOGs, and NOGs • The COGs and KOGs were manually curated • These were automatically expanded to more species • Tri-clustering • Detection of in-paralogs • Identification of triangles of best reciprocal matches • Merging of triangles that share an edge • Broad phylogenetics coverage • COGs and NOGs cover all three domains of life • KOGs cover all eukaryotes

  23. Clustering based on similarity • All-against-all sequence similarity is calculated • A standard clustering method is applied to define groups of homologous genes • TribeMCL • Hierarchical clustering • These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs

  24. Meta-servers • Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge • These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest • However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies

  25. Function predictionfrom protein domains Lars Juhl JensenEMBL Heidelberg

  26. When homology searches fail • Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function • No functional information can thus be transferred based on simple sequence homology • By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function

  27. Protein domains • Many eukaryotic proteins consist of multiple globular domains that can fold independently • These domains have been mixed and matched through evolution • Each type of domain contributes towards the molecular function of the complete protein • Numerous resources are able to identify such domains from sequence alone using HMMs

  28. Which domain resource should I use? • SMART is focused on signal transduction domains • Pfam is very actively developed and thus tends to have the most up-to-date domain collection • InterPro is useful for genome annotation since the domains are annotated with GO terms • CDD is conveniently integrated with the NCBI BLAST web interface

  29. Predicting globular domains and intrinsically disordered regions • Not all globular domains have been discovered and the databases are thus not comprehensive • Methods exist for predicting from sequence which regions are globular and which are disordered • GlobPlot uses a simple propensity scale • DisEMBL, DISOPRED, and PONDR all use ensembles of artificial neural networks • Many disordered regions are important for protein function and they should thus not be ignored