280 likes | 418 Vues
This project presents a comprehensive ontology designed to unify diverse protein-protein interaction (PPI) data from various sources, enabling effective data mining and improving the reliability of information. The approach utilizes ontology as a descriptive data model to define entities and relationships within the PPI domain. It integrates databases such as DIP, BIND, MINT, and IntAct while facilitating seamless querying, data transformation, and additional data source incorporation. The project aims to enhance understanding of PPI dynamics and offers a platform for future enhancements in biological research.
E N D
An Ontology for Protein-Protein Interaction Data Karen Jantz CIS Honors Project December 7, 2006
Overview • Problem Statement • Objectives • Approach • Background • Methodology • Evaluation • Demonstration • Conclusion
Problem Statement • Several sources for protein-protein interaction data • Different schemata • Different purposes • Different strengths/weaknesses
Objectives • Unify the data • Enable data mining • Evaluate reliability of data across data sources • Gain new information about the entire data set • Enable others to easily add other data sources to the set
Approach: ontology • ontology – n. • that which exists(philosophy) • that which is represented (artificial intelligence) • A descriptive data model • Defines the entities and relationships within a domain • Based upon data • Human-readable
Approach: ontology Data integration • Enables simultaneous querying across multiple databases • Data transformation • Enables interchange between database formats • Data mining • Enables reasoning and learning over the entire data set
Background: Data Sources • DIP (Jing Xia) • Database of Interacting Proteins • Most reliable data set • Jing Xia • BIND (Abhijit Erande, Aaron Schoenhofer) • Biomolecular Interactions Network Databank • Very large data set • Contains interactions, molecular complexes, and pathways
Background: Data Sources • MINT • Molecular INTeractionsdatabase • experimentally verified protein interactions • Evaluates confidence level • IntAct • Not limited to binary interactions • Allows user submissions • mips CYGD • Munich Information Centerfor Protein Sequences: Comprehensive Yeast Genome Database • Limited to yeast • Focuses on sequencing
Background: Tools • Protégé • Open-Source Project • Graphical ontology editor • Interacts with OWL Reasoner • Detailed API for modifying ontologies programmatically
Background: Tools • Prompt • A Protégé Plugin • Enables ontology mapping • Enables ontology comparison
Background: Related Work • PSI-MI • Controlled vocabulary for PPI data • Not a proposed database structure • Decreases the strength of information • Helpful in defining relationships and keys
Methodology: Overview Web Interface Q: What interactions have been observed between with protein A? Q: What experiments give evidence for a given interaction? Unified Ontology Unified Data Set transformation DIP BIND MIPS MINT IntAct
Methodology: Design • Review the singular database schemata and determine strengths/weaknesses • View data files • Native formats • PSI-MI formats • Create a unified schema of the data sources • Create the unified ontology in Protégé • Create each singular database as a subset of the unified ontology
Methodology: Data Import • DOMParser • Load data from XML • Protégé-OWL API • Insert entities into singular databases
Methodology: Transformation • Use Prompt to create a mapping for each specific data source to the unified ontology • Use Prompt mappings to insert individuals from each singular ontology into the unified model
Methodology: Transformation • Duplicate Data • Need to fill in attributes on existing records • Write ‘Algorithm Plugin’ for Prompt to determine when individuals are the same
Methodology: Query Interface • Export Protégé data into MySQL • Web interface for collecting data • Working with domain experts to determine useful views, queries
Evaluation • Performance • Transformation Time in Protégé • Query Time for Web Interface • Size • Minimize redundancy in data model • Minimize duplicate data
Evaluation • Correctness • Domain Experts • Dr. Brown, Dr. Wang • Maintain proper data relationships • Utility • Enrich data
Future Work • Complete transformations • Import data • Evaluate ontology • Add other databases to model
Conclusions • Adequate start • Needs improvement, evolution, more data sources • As the project matures, the ontology will be ready for use in the biological domain • Will be able to more easily gain information about protein-protein interactions
References • AAAI.org - AITopics: “Ontology” • http://www.aaai.org/AITopics/html/ontol.html • Protégé • http://protege.stanford.edu/overview/protege-owl.html • Prompt • http://protege.cim3.net/cgi-bin/wiki.pl?Prompt • PSI-MI • http://psidev.sourceforge.net/mi/xml/doc/user
References • BIND • http://www.bind.ca • DIP • http://www.dip.doe-mbi.ucla.edu • IntAct • http://www.ebi.ac.uk/intact/site/ • MINT • http://mint.bio.uniroma2.it/mint/Welcome.do • MIPS • http://mips.gsf.de/genre/proj/yeast