Computations using pathways and networks

Computations using pathways and networks Nigam Shah nigam@stanford.edu

The GOAL = Making sense of high throughput data

High throughput data • “high throughput” is one of those fuzzy terms that is never really defined anywhere • Genomics data is considered high throughput if: • You can not “look” at your data to interpret it • Generally speaking it means ~ 1000 or more genes and 20 or more samples. • There are about 40 different high throughput genomics data generation technologies. • DNA, mRNA, proteins, metabolites … all can be measured

How does ontology help? • An ontology provides a organizing framework for creating “abstractions” of the high throughput data • The simplest ontologies (i.e. terminologies, controlled vocabularies) provide the most bang-for-the-buck • Gene Ontology (GO) is the prime example • More structured ontologies – such as those that represent pathways and more higher order biological concepts – still have to demonstrate real utility.

Gene Ontology to analyze microarray data

Using GO annotations

Descriptions built by connecting/linking ontology terms Biologists interpret a list of genes and form a result statement such as: The photosynthesis genes located in the chloroplast are repressed in response to ozone stress and have the ABRE binding site enriched in their promoters.

…more structure OBOL OBOL Relations Ontology Relations Ontology ?<link>? <Some MF> in <Some BP>

Between-ontology structure

… more structure [beyond GO]: PATO The building blocks of phenotype descriptions: EQ Entity (bearer) such as spermatocyte, wing Quality (property, attribute) - a kind of dependent continuant Formally, an EQ description defines: - a Quality which inheres_in a bearer entity The building blocks are combined according to the Pheno-syntax www.fruitfly.org/~cjm/formats

Semantically structured annotations HOW WHY

Open Questions/Challenges • Creation/acceptance of a systematic formalism for creating expressive annotations. (e.g. associated_with, involves) • A generic tool that uses ontologies and allow the user to compose terms and cross ontology annotations • Easy term/annotation composition • Control the amount of alternative [compositional] statements allowed

Pathways to analyze array data

“Pathways” to analyze array data • The notion of a cancer signaling pathway can serve as an organizing framework for interpreting microarray expression data. • On examining a relatively small set of genes based on prior biological knowledge about a given pathway, the analysis becomes more specific.

Reactome’s sky painter

Operations on pathway resources [1] A case study in pathway knowledgebase verification, BMC Bioinformatics 2006, 7:196 [2] Pathway Knowledge Base: An Integrated pathway resource using BioPAX, Submitted to Applied Ontology

Merge and compare pathway resources • Given a set of ‘nodes’ and some ‘links’ among them, query multiple pathway sources and fill in the most plausible interactions between the nodes. • Plausible = not contradicted by existing data and knowledge • Current pathway resources [in biopax] can not support this because, the manner in which ‘nodes’ are identified, the manner in which ‘links’ are identified is arbitrary. • Reactome has started to connect the pathway steps will GO biological processes. • BioPAX lets pathway sources “export” their nodes and links. • …but p53 in resource A is still different from P53 in resource B • … and Activate in resource A is still different from activates in resource B

Problem • I have no clue what a pathway is! • A set or series of interactions, often forming a network, which biologists have found useful to group together for organizational, historic, biophysical or other reasons. • The complexity and abstraction represented in a pathway is decided by its author attempting to represent the interactions between a set of genes, proteins, and small molecules.

“Networks” to analyze high throughput genomic data

Building networks • Take a high throughput dataset • Define a notion of ‘relatedness’ depending on the dataset • Co-expression for microarray data • Co-occurance for literature networks • … • Enlist [node]--<link>--[node] pairs • Find a good graph drawing program!

Nice hairball but … From Long et al, in Trends in Biochemical Sciences, vol 32, no 7. From Srinivasan et al, in Briefings in Bioinformatics August 2007. Srinivasan B, Snow R, Shah N and Batzoglou S in Interactome Networks conference @ CSHL

Hypotheses/Models to analyze high throughput genomic data

Events and Implicit claims An hypothesis is a statement about relationships (among objects) within a biological system. Protein P induces transcription of gene X An ‘event’ is a relationship between two biological entities. P promoter | gene X • Implicit claims that can be tested: • P is a transcription factor. • P is a transcriptional activator. • P is localized to the nucleus. • P can bind to the promoter of gene X

Representing Events Explicitly A hypothesis consists of at least one event stream An event stream is a sequence of one or more events or event streams with logical joints (or operators) between them. An event has exactly one agent_a, exactly oneagent_b and exactly one operator (i.e. a relationship between the two agents). It also has a physical location that denotes ‘where’ the event happened, the genetic context of the organism and associated experimental perturbations when the event happened. A logicaljoint is the conjunction between two event streams.

User interfaces Hypothesis described in Natural Language Biological process described in a formal language

Evaluating an hypothesis

A. Representation of an hypothesis in terms of events (ev = event) C. Plot of the support versus conflicts for submitted and neighboring hypotheses (n1, b1). Clicking on the n1 submits that hypothesis as ‘seed’ B. Holding the mouse on a neighboring hypothesis (b1) shows what event was replaced to create it

HyBrow: lessons learnt • The minimum requirement for a formal representation: • Ability to represent data  information  Knowledge • A language to unambiguously express your “thought experiment” (your model, hypothesis, theory, theorem etc) • A reasoning framework to evaluate the outcome/ validity/accuracy of your thought experiment • Project Home page: www.hybrow.org

Pathways as “models”? • Pathways are assumed to be models representing biological processes, without actually knowing the modeling formalism in which the model is valid. • The ‘language’ of writing out a pathway doesn’t really have a grammar and/or a logic • Most pathways end up being lists of heterogeneous sets of “steps” (in terms of the time of execution, the place of execution, the abstraction level, the kind of ‘thing’ passed along etc…) • Lots of discussion on requirements of data providers, where are the users/consumers and their use cases?

Claims • Pathways are useful only if they can serve as “models” [accurate representations] of a process • Hence whatever needs to be done to ensure that a pathway is a valid model of at least one formalism should be required of the pathway author. • A pathway representation that doesn’t solve the problem of uniquely identifying entities doesn’t solve the problem of integrating pathways. • We just end up with marked up, structured information from multiple providers, without actually integrating anything.

Success of projects in the Biomedical domain High KR complexity Minimal KR complexity Minimal computational complexity High computational complexity

Computations using pathways and networks