BioGrid: Protein Interaction & Gene Expression Modeling

Overview IST-2001-38344

Cells are a collection of protein nanomachines

A biological challenge • To build models of protein complexes & understand the function of each component, based upon available evidence. • However, to build evidence for each protein interaction, a biologist must find, integrate, compare & then validate the results from a number of separate resources.

DNA ‘chips’ Modelling HTP Sequencing SNP Gene prediction Proteomics Domain analysis Synchrotron Genomics & Proteomics Expression Folding PROTEIN STRUCTURES DNA

Interaction Space Expression Space Literature Space Genomics & Proteomics

The need for computerised information systems • New HTP methods produce orders of magnitude more data than before: • More than is interpretable manually. • Data are stored in a (semi-)structured format. • Much knowledge is in literature & patents: • 13,000,000 abstracts in MEDLINE. • Knowledge is stored in an unstructured format. • Solution: computerised information systems: • Enable data mining & visualisation of integrated resources, with text analysis.

Components of bioGrid • Gene expression: • ExpressionSpace: • Clustering of microarray data. • May require large memory. • Protein interaction: • PSIMAP: • Predict interactions between protein domains. • May pre-compute as relatively unchanging. • Literature: • GoPubMed-D: • Organises corpus of documents into the GO ontology. • Lexical analysis requires lengthy compute.

Expression Space: Space Explorer Interaction Space: PSIMAP LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE Literature Space: Classification Server bioGrid: An integrated platform for gene expression data, protein interaction data, and literature

Workflow for use case - Part I • Search literature for papers about the experimental system studied: • Microarray & mitochondria. • Upload the gene expression data set. • Cluster the gene expression data set. • Identify a cluster that contains genes of interest, e.g. energy production. • Examine the expression profiles of the genes in the cluster.

Workflow for use case - Part II • Calculate an induced PSIMAP graph for the genes in the expression cluster. • Explore PSIMAP graph & nodes. • For pairs of genes predicted to interact: • Search literature for papers citing both genes. • Classify literature to assess possible function or metabolic processes of genes. • Assimilate evidence for components of a protein complex.

Distributed technology implementation • Globus, Unicore, Legion, … • Are geared towards computational complexity, not semantic complexity. • BioGrid’s approach: • Agent-based approach. • Integration of rules, reasoning, and messaging in a Java-environment. • Using meta-model. • Advantage: • Easy to maintain, easy to use, includes code distribution, architecture independent, geared towards farms of local and remote machines.

Prova-AA • Extensions to Prova for rule-based agent scripting. • Prova-AA introduces: • Messaging (local, JMS, and JADE). • Reaction rules. • Context-dependent inline reactions for asynchronous messaging. • Embedding of Prova agents in Java and Web app’s. • Advantages: • Cooperating agents vs. GRID RPC. • Ease of development and maintenance. • Platform independence and portability. • High level specification of communication protocols. • Native syntax integration with Java. • Low-cost creation of distributed workflows. And ad-hoc networks of computation nodes.

Proposed Architecture of integrated platform

First results & infrastructure needs IST-2001-38344

Technology challenges in building bioGrid • Computational complexity: • Generating protein interaction map takes ca. 1 day. • Analysing large sets of gene expression data can take up to an hour. • Analysis of large text bodies complex. • Semantic Complexity: • Computer does not “understand” data. • DBs and systems cannot inter-operate.

Distributed GoPubMed-D (2/3) BioGrid Prototype integrates with GoPubMed-D via embedded Prova-AA JADE agent.

Distributed computation with Prova-AA agents A flexible solution for a self-managing self-balancing distributed computation: • Manager and Workers architecture based on Prova-AA agents with Java computation modules. • Loosely synchronous interaction. • Minimal compact coding (30 lines for Manager and 20 lines for Worker). • Manager does not need to keep a registry of the Workers that can join in at any time. • Computation is divided in small atomic subtasks (4 or 5 proteins). • Manager dispatches a new subtask asynchronously upon receiving a ready message from a Worker. • Worker computes a subtask and responds with the results in a reply message and a new ready message. • Workers compute subtasks at their own pace so load balancing is automatic. • Workers extended with routing capabilities are available. • Can be easily extended with failover capabilities.

Building an information system for biology is non-trivial • Molecular biology resources: • Are heterogeneous in content: • Genomics, proteomics, literature. • Exist in a large number: • Public, commercial, organisational, personal. • Variable quality: Curated vs. automatic. • Have different interfaces: Web, SQL, SOAP, etc. • Are geographically distributed w/o yellow pages. • Store data in different formats - few standards. • Change rapidly. • Confidentiality & IPR protection. • Are too large to transport conveniently.

Social challenges in building Grid • Technology stability & reliability. • Security. • Usability. • Peer-reviewed results in major biomedical journals: • Science, Nature, Cell, BMJ, Lancet, etc.

BioGrid: Protein Interaction & Gene Expression Modeling

BioGrid: Protein Interaction & Gene Expression Modeling

Presentation Transcript

U.S. TOURISM AN OVERVIEW 2001-2002

Information Society Technologies Programme Accompanying Measures (IST-2001-32633)

An overview of September 11, 2001

ADAPT IST-2001-37173

IST-2001-34825

A European Union Thematic Networks Project IST-2001-39122

IST Work Programme 2001 RATP contribution

Esperonto Services IST-2001-34373

IST Programme project IST- 2001- 35188 CELEBRATE Context e-Learning with Broadband Technologies

UIUC Fa 2001 Accy403 (MSBA) Overview

UIUC Fall 2001 Accy403 (MSA) Overview

NEUWEB NEUral network engineered WEB portal IST-2001-34387

CRESCCO Project IST-2001-33135

CRESCCO Project IST-2001-33135

Overview IST-2001-38344

The GFSM 2001 System - An overview

IST-2001-34825

OVERVIEW OF TISP WORK: 2001 - 2002

GRACE Project IST-2001-38100

WSCC Coal Overview September 25, 2001

CRESCCO Project IST-2001-33135

REGNET 2001-03 IST Cultural Heritage in Regional Networks