Proteome Data Integration: Characteristics and Challenges

Proteome data integrationcharacteristics and challenges K. Belhajjame1, R. Cote4, S.M. Embury1, H. Fan2, C. Goble1, H. Hermjakob, S.J. Hubbard1, D. Jones3, P. Jones4, N. Martin2, S. Oliver1, C. Orengo3, N.W. Paton1, M. Pentony3, A. Poulovassilis2, J. Siepen, R.D. Stevens1, C. Taylor4, L. Zamboulis2, and W. Zhu4 1University of Manchester 2Birkbeck College 3University College London 4European Bioinformatics Institute

Outline • Experimental proteomics • ISPIDER architecture • Example use cases • Conclusion All Hands Meetings, 2005

Separation 2D gel electrophoresis Protein digestion Enzymatic digestion Mass Spectrometry Maldi TOF Protein DB Identification Protein ID Experimental proteomics • An essential component for elucidation of the biological functions of proteins • The study of the set of proteins produced by an organism with the aim of understanding their behaviour under varying conditions All Hands Meetings, 2005

Experimental proteomics • Development of new technologies for: • protein separation (2D-SDS-PAGE, HPLC, Capillary Electrophoresis) • mass spectrometry (Multi-Dimensional protein identification) • Availability of publicly accessible protein sequence databases • Proteomics databases (PedroDB, gpmDB, PepSeeker, Pride, …) Building experiments involving analysis services orchestration and data processing and integration All Hands Meetings, 2005

Objectives of ISPIDER A Grid dedicated to the creation of bioinformatics experiments for proteomics • Develop, or make, existing Proteome databases and Grid-enabled services • Develop Middleware support for developing and executing new proteome analyses, based on distributed query processing and workflow technologies • Undertake proteomic studies that demonstrate the effectiveness of the resulting infrastructure All Hands Meetings, 2005

Outline • Experimental proteomics • ISPIDER architecture • Example use cases • Conclusion and future directions All Hands Meetings, 2005

+ Phosph. Extensions Vanilla Query Client 2D Gel Visualisation Client PPI Validation + Analysis Client Protein ID Client + Aspergil. Extensions Web services Source Selection Services Data Cleaning Services Proteome Request Handler Proteomic Ontologies/ Vocabularies Instance Ident/Mapping Services myGrid Ontology Services myGrid DQP myGrid Workflows AutoMed WS WS WS WS WS WS WS WS WS WS PRIDE PF GS TR PS FA PPI Phos PID PEDRo ISPIDER Resources Existing Resources ISPIDER ISPIDER Proteomics Clients ISPIDER Proteomics Grid Infrastructure Existing E-Science Infrastructure Public Proteomics Resources KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package All Hands Meetings, 2005

Outline • Experimental proteomics • ISPIDER architecture • Example use cases • Conclusion and future directions All Hands Meetings, 2005

Value-added protein datasets • Motivation Protein identification experiments are usually used as input into further analysis processes. • Gathering evidence for a biological hypothesis • Suggesting new hypotheses • Objective Augment the identification results with additional information on the identified protein • Implementation Taverna workflow system All Hands Meetings, 2005

Value-added protein datasets PepMapper Web Service Auxiliary Services GO Services All Hands Meetings, 2005

Genome-focused protein identification • Motivation Currently, protein identification searches performed over large data sets. This means fewer false negatives, but false positives are also more likely. • Objective More focused and thus more efficient protein identification • Implementation Taverna workflow system DQP, a service-based query processor All Hands Meetings, 2005

Genome-focused protein identification select p.Name, p.Seq from p in db_proteinSequences where p.OS='HomoSapiens'; PepMapper web service DQP Web Service GOA Web Service IPI All Hands Meetings, 2005

Integrated access to proteome databases • Motivation Ability to analyse existing proteomics results en masse is limited, because of the heterogeneities between the schemas of the different databases • Objective Providing integrated access to proteome databases through a common schema • Implementation AutoMed, a framework for mapping heterogeneous schemata DQP, a service-based query processor All Hands Meetings, 2005

gpmDB PedroDB PRIDE Integrated access to proteome databases OGSA Distributed Query Processor OQL query OQL result OGSA-DAI Activity OGSA-DAI Activity OGSA-DAI Activity Automed DQP Wrapper Automed Wrappers User query Automed Query Processor Automed Repository Result All Hands Meetings, 2005

Conclusions • Available e-science technologies provide rapid prototyping facilities for bioinformatics analyses • Combining such technologies is possible and opens up more possibilities • Taverna + DQP • Automed + DQP • Writing custom code is usually required • Processing service output to extract inputs for following services • Transforming results between data formats • Dealing with mismatches between identifiers • Developing a user-guided environment for the detection and resolution of mismatches • Development of Proteomics client applications (PepMapper, PepSeeker and PRIDE) All Hands Meetings, 2005

Proteome Data Integration: Characteristics and Challenges

Proteome Data Integration: Characteristics and Challenges

Presentation Transcript

Proteome Analysis

Data Integration Efforts and Challenges

Data Integration Faces 3 Challenges

Data integration and transformation

Data Integration and Consolidation

Systems of Systems: Characteristics and Challenges

Data integration and Linked Data

Data Integration in Digital Libraries: Approaches and Challenges

Proteome and interactome

Data Integration and Products

Challenges for ERP Test Data Generation Test Data Characteristics and Constraints

SeaWinds Scatterometer Data: Characteristics and Challenges

Data and System Integration

Data Analysis and Integration

Druggable Genome and Proteome

Visualisation and data integration

Neuroinformatics challenges in MRI data integration

Women in Sororities: Characteristics and Challenges

Proteome and interactome

Visualisation and data integration

Data Integration Challenges And How To Overcome Them