260 likes | 387 Vues
Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction. Stefan Reckow Max Planck Institute of Psychiatry Volker Tresp Siemens, Corporate Technology. TexPoint fonts used in EMF.
E N D
Integrating Ontological Prior Knowledge into Relational Learningfor Protein Function Prediction Stefan ReckowMax Planck Institute of PsychiatryVolker TrespSiemens, Corporate Technology TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA
Protein and Protein Functions • motivation • proteins – molecular machines in any organism • understanding protein function is essential for all areas of bio-sciences • diverse sources of knowledge about proteins • challenges • experimental determination of functions difficult and expensive • homologies can be misleading • most proteins have several functions
Protein function prediction What function does this protein have? catalytic activity (catalyzes a reaction) isomerase activity intramolecular oxidoreductase activity specificity intramolecular oxidoreductase activity, interconverting aldoses and ketoses triose-phosphate isomerase activity (catalyzes a very specific reaction)
“Function” Ontologies • ontologies are a way of bringing order in the function of proteins • an ontology is a description of concepts of a domain and their relationships • hierarchical representation (subclass-relationship) • tree • directed, acyclic graph
“Complex” Ontology • complex: structure formed by a group of two or more proteins to perfom certain functions concertedly
Ontologies as Great Source of Prior Knowledge in Machine Learning • A considerable amount of community effort is invested in designing ontologies • Typically this prior knowledge is deterministic (logical constraints) • Machine Learning should be able to exploit this knowledge • Interactions of proteins is an important information for predicting function: statistical relational learning
Statistical Relational Learning (SRL) • SRL generalizes standard Machine Learning to domains where relations between entities (and not just entity attributes) play a significant role • Examples: PRM, DAPER, MLN, RMN, RDN • The IHRM is an easily applicable general model, performs a cluster analysis of relational domains and requires no structural learning • Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proc. 22nd UAI, 2006 • Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. AAAI 2006
Standard Latent Model for Protein Mixture Models Protein1 Protein2 • In a Bayesian approach, we can permit an infinite number of states in the latent variables and achieve a Dirichlet Process Mixture Model (DPM) • Advantage: the model only uses a finite number of those states; thus no time consuming structural optimization is required
Infinite Hidden Relational Model (IHRM) Protein1 • Permits us to include protein-protein interactions into the model interact Protein3 interact interact Protein2
Ground Network function motif complex Z2 motif interact interact complex Z3 interact Z1 function function motif complex
Experimental Results KDD Cup 2001 • Yeast genome data • 1243 genes/proteins: 862 (training) / 381 (test) • Attributes • Chromosome • Motif (351) [1-6]: A gene might contain one or more characteristic motifs (information about the amino acid sequence of the protein) • Essential • Structural class (24) [1-2] The protein coded by the gene might belong to one or more structural categories (24) [1-2] • Phenotype (11)[1-6] observed phenotypes in the organism • Interaction • Complex (56)[1-3] The expression of the gene can complex with others to form a larger protein • Function (14)[1-4] (cell growth, cell organization, transport, … ) • genes were anonymous
Results Comparison with Supervised Models ROC curve Accuracy Model
IHRM Result Node: gene Link: interaction Color: cluster.
Integration of ontologies Deductive closure
Integration of ontologies Zi independent concepts dependent concepts function motif complex translocon cytoskeleton actin filaments microtubules signal peptidase
Experiments: Including “Complex” Ontology Data collected from CYGD of MIPS • 1000 genes/proteins: 800 (Training) / 200 (Test) • Attributes • chromosome, motif, essential, structural class, phenotype, interaction, complex, function • interactions from DIP • usage of ontological knowledge on complex • five levels of hierarchal • in our model 258 nodes (concepts) using 66 top level categories • every protein has at least one complex annotation • After including ontological constraints: about three annotations per protein on average
Results 800 (training) / 200 (test) 200 (training) / 200 (test) w/o ontology: 0.895 with ontology: 0.928 w/o ontology: 0.832 with ontology: 0.894 AUC
Results explicit modeling of dependencies
Results • Grey: in test set • proteins concerned with secretion and transportation • The "Golgi apparatus" works together with the "endoplasmatic reticulum (ER)" as the transport and delivery system of the cell. • "SNARE" proteins help to direct material to the correct destination • Test proteins also "cellular transport" • proteins acting in cell division • control proteins • "Septins“: Septins have several roles throughout the cell cycle and carry out essential functions in cytokinesis • The three highlighted proteins fit into this cluster ( "cell fate" and "cell type differentiation“)
Results sampling convergence
Results Distribution of proteins in the clusters
Results • Grey: former singletons • Cellular Transport Cluster • The former singleton "Clathrin light chain", as a major constituent of coated vesicles (a component for transport) fits into this cluster quite well • Tasks occurring during DNA replication • The former singleton "DNA polymerase", as a main actor in replication, obviously is assigned the correct cluster here
Conclusion • application of the IHRM to function prediction • competitive with supervised learning methods • insights into the solution • advantages of integrating ontological knowledge • improvement of the clustering structure • robustness: stable results with varying parameterization • deductive closure prior to learning is a general powerful principle • future challenges • usage of several or more complex ontologies • further analysis of dependent vs. independent concepts • Acknowledgements: Karsten Borgwardt (MPIs Tübingen); Hans-Peter Kriegel (LMU)