Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction

Integrating Ontological Prior Knowledge into Relational Learningfor Protein Function Prediction Stefan ReckowMax Planck Institute of PsychiatryVolker TrespSiemens, Corporate Technology TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA

Proteins and Protein Ontologies

Protein and Protein Functions • motivation • proteins – molecular machines in any organism • understanding protein function is essential for all areas of bio-sciences • diverse sources of knowledge about proteins • challenges • experimental determination of functions difficult and expensive • homologies can be misleading • most proteins have several functions

Protein function prediction What function does this protein have? catalytic activity (catalyzes a reaction) isomerase activity intramolecular oxidoreductase activity specificity intramolecular oxidoreductase activity, interconverting aldoses and ketoses triose-phosphate isomerase activity (catalyzes a very specific reaction)

“Function” Ontologies • ontologies are a way of bringing order in the function of proteins • an ontology is a description of concepts of a domain and their relationships • hierarchical representation (subclass-relationship) • tree • directed, acyclic graph

“Complex” Ontology • complex: structure formed by a group of two or more proteins to perfom certain functions concertedly

Ontologies as Great Source of Prior Knowledge in Machine Learning • A considerable amount of community effort is invested in designing ontologies • Typically this prior knowledge is deterministic (logical constraints) • Machine Learning should be able to exploit this knowledge • Interactions of proteins is an important information for predicting function: statistical relational learning

Statistical Relational Learning with the IHRM

Statistical Relational Learning (SRL) • SRL generalizes standard Machine Learning to domains where relations between entities (and not just entity attributes) play a significant role • Examples: PRM, DAPER, MLN, RMN, RDN • The IHRM is an easily applicable general model, performs a cluster analysis of relational domains and requires no structural learning • Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proc. 22nd UAI, 2006 • Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. AAAI 2006

Standard Latent Model for Protein Mixture Models Protein1 Protein2 • In a Bayesian approach, we can permit an infinite number of states in the latent variables and achieve a Dirichlet Process Mixture Model (DPM) • Advantage: the model only uses a finite number of those states; thus no time consuming structural optimization is required

Infinite Hidden Relational Model (IHRM) Protein1 • Permits us to include protein-protein interactions into the model interact Protein3 interact interact Protein2

Ground Network function motif complex Z2 motif interact interact complex Z3 interact Z1 function function motif complex

Experimental Results KDD Cup 2001 • Yeast genome data • 1243 genes/proteins: 862 (training) / 381 (test) • Attributes • Chromosome • Motif (351) [1-6]: A gene might contain one or more characteristic motifs (information about the amino acid sequence of the protein) • Essential • Structural class (24) [1-2] The protein coded by the gene might belong to one or more structural categories (24) [1-2] • Phenotype (11)[1-6] observed phenotypes in the organism • Interaction • Complex (56)[1-3] The expression of the gene can complex with others to form a larger protein • Function (14)[1-4] (cell growth, cell organization, transport, … ) • genes were anonymous

Results Comparison with Supervised Models ROC curve Accuracy Model

IHRM Result Node: gene Link: interaction Color: cluster.

Integrating Ontological Prior Knowledge into the IHRM

Integration of ontologies Deductive closure

Integration of ontologies Zi independent concepts dependent concepts function motif complex translocon cytoskeleton actin filaments microtubules signal peptidase

Experiments: Including “Complex” Ontology Data collected from CYGD of MIPS • 1000 genes/proteins: 800 (Training) / 200 (Test) • Attributes • chromosome, motif, essential, structural class, phenotype, interaction, complex, function • interactions from DIP • usage of ontological knowledge on complex • five levels of hierarchal • in our model 258 nodes (concepts) using 66 top level categories • every protein has at least one complex annotation • After including ontological constraints: about three annotations per protein on average

Results 800 (training) / 200 (test) 200 (training) / 200 (test) w/o ontology: 0.895 with ontology: 0.928 w/o ontology: 0.832 with ontology: 0.894 AUC

Results explicit modeling of dependencies

Results • Grey: in test set • proteins concerned with secretion and transportation • The "Golgi apparatus" works together with the "endoplasmatic reticulum (ER)" as the transport and delivery system of the cell. • "SNARE" proteins help to direct material to the correct destination • Test proteins also "cellular transport" • proteins acting in cell division • control proteins • "Septins“: Septins have several roles throughout the cell cycle and carry out essential functions in cytokinesis • The three highlighted proteins fit into this cluster ( "cell fate" and "cell type differentiation“)

Results sampling convergence

Results Distribution of proteins in the clusters

Results • Grey: former singletons • Cellular Transport Cluster • The former singleton "Clathrin light chain", as a major constituent of coated vesicles (a component for transport) fits into this cluster quite well • Tasks occurring during DNA replication • The former singleton "DNA polymerase", as a main actor in replication, obviously is assigned the correct cluster here

Conclusion • application of the IHRM to function prediction • competitive with supervised learning methods • insights into the solution • advantages of integrating ontological knowledge • improvement of the clustering structure • robustness: stable results with varying parameterization • deductive closure prior to learning is a general powerful principle • future challenges • usage of several or more complex ontologies • further analysis of dependent vs. independent concepts • Acknowledgements: Karsten Borgwardt (MPIs Tübingen); Hans-Peter Kriegel (LMU)

Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction

Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction

Presentation Transcript

Function Prediction from Protein Sequence

Analysis and Prediction of Protein Function

Prior Learning Assessment in the Relational Age

Prior Knowledge

Prior Knowledge!

Prior Knowledge

Prior Knowledge

Integrating Students’ Prior Knowledge into Pedagogy

Prior Knowledge

Integrating ontological and linguistic knowledge for Conceptual Information Extraction

Protein Structure and Function Prediction

INTEGRATING RELATIONAL PATTERNS INTO PSYCHOTHERAPY

Consistent probabilistic outputs for protein function prediction

Prediction of protein function

Machine Learning Algorithms for Protein Structure Prediction

Disease Prediction Based on Prior Knowledge

Protein Function Prediction Based on Domain Content

Protein Function Prediction

Protein Function Prediction from Protein Interactions

Biological Signal Detection for Protein Function Prediction

Tap into Your Prior Knowledge

Prior Knowledge