Chitta Baral Arizona State University

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University

Three parts to the talk • Prediction, Explanation and Planning with respect to biochemical networks • Hypothesis Generation with respect to biochemical networks • Collaborative BioCuration: CBioC

Motivation: purpose of interaction databases? • Suppose: We have an almost exhaustive database of the intracellular interactions (protein-protein, metabolic, etc.) of particular cells. • What next? • How will we use this database? • What if our knowledge is incomplete?

Motivation: Uses of networks & pathways • Visualize the pathways • Analyze the graphs of the networks • Compare graphs of the networks • Use pathway data in conjunction with micro-array data analysis • Do system level simulation • Is that all?

Motivation: ultimate uses! • Prediction/System Simulation (Systems Biology?) • Impact of particular perturbations (say caused by a drug that introduces certain proteins to the cell membrane or into the cell) • Do the perturbations have the desired impact? • Do they mess up something else? (side effects!) • But that’s not all!

Motivation: Explaining observations • A phenotypical observation (leading to) OR • an observation that a particular protein or chemical has abnormally high concentration • What is wrong? What is out of the ordinary? • The cause/explanation will give us approaches to fix the problem. • How deep should the explanations go? • How do we compare explanations?

Motivation: Designing drugs & therapies • What perturbations (when and where) need to be made so as to make the cell behave in a particular way? • In case of cancer: prevent proliferation, induce apoptosis, prevent migration, etc.

What if knowledge is incomplete? • What kind of useful reasoning can we do with incomplete knowledge? • Drug makers don’t wait till full knowledge is available. • Answer: hypothesis formation

Motivation: Use summary • The ultimate uses of signaling (metabolic, etc.) interaction databases are to do: • Prediction – therapy verification; determining side effects. • Explanation -- diagnosing what is wrong. • Planning – therapy and drug design. • Intermediate or immediate use • Generate Hypothesis

Initial goal of our research • Use knowledge representation and reasoning techniques to: • Represent interactions • Reason about these interactions: prediction, explanation, planning and hypothesis formation.

Some questions • Isn’t it a little premature? • We know very little about the networks • New knowledge is being constantly added • Why knowledge representation and reasoning? • Why not simulation • Why not use Petri nets, p-calculus • Why a knowledge-based approach? Why not a data base approach? What’s the difference?

Our approach : present and future • Yes, prediction is kind-of same as simulation • Incompleteness of information is an issue though! • But hard to do explanation generation, or design of therapies (planning) using simulation – guesses can be verified using simulation though • The core database query languages can not express explanation or planning queries. • Dealing with incompleteness!

Dealing with incompleteness – ongoing and future work • Is one of the key criteria behind a `good’ knowledge representation language when building AI systems. • Need to be non-monotonic. • Need to be elaboration tolerant. • Proper analysis leads to hypothesizing • If certain observations can not be satisfactorily explained by the existing knowledge about the network then use general biological knowledge to hypothesize

Motivation -- summary • Goal: To emulate the abstract reasoning done by biologists, medical researchers, and pharmacology researchers. • Types of reasoning: prediction, explanation, planning and hypothesis formation. • Current system biology approaches: mostly prediction. • Ongoing issues: Dealing with incomplete knowledge and elaboration tolerance.

Related Works • Quantitative approaches. (hybrid systems, use of differential equations) • Graphical representations. • Other qualitative approaches. • Petri Nets • -calculus • Pathway Logic • Model Checking

Overview of our approach • Represent signal network as a knowledge base that describes • actions/events (biological interactions, processes). • effect of these actions/events. • triggering conditions of the actions/events. • To query using the knowledge base: • Prediction; explanation; planning; Hypothesis generation • BioSigNet-RR (Biological Signal Network - Representation and Reasoning) and BioSigNet-RRH systems.

Foundation behind our approach • Research on representing and reasoning about dynamic systems (space shuttles, mobile robots, software agents) • causal relations between properties of the world • effects of actions (when can they be executed) • goal specification • action-plans • Research on knowledge representation, reasoning and declarative problem solving – the AnsProlog language.

An NFkB signaling pathway

Syntax by example • bind(TNF-a,TNFR1) causes trimerized(TNFR1) • trimerized(TNFR1) triggers bind(TNFR1,TRADD)

General syntax to represent networks • e causes f if f1; …; fk • g1; … ; gkcauses g • h1; … ; hmn_triggers e • k1; … ; kltriggers e • r1; … ; rl inhibits e • e is an event (also referred to as an action) and the rest are fluents (properties of the cell) • For metabolic interactions: e converts g1; … ; gkto f1; …; fk if h1; … ; hm

Semantics: queries and entailment • Observation part of queries • f at t • a occurs_at t • Given the Network N and observation O • Predict if a temporal expression holds. • Explain a set of observations. • Plan to achieve a goal.

Importance of a formal semantics • Besides defining prediction, explanation and planning, it is also useful in identifying: • Under what restrictions the answer given by a given (graph based) algorithm will be correct. (soundness!) • Under what restrictions a given (graph based) algorithm will find a correct answer if one exists. (completeness!)

Utility of declarative programming languages (such as AnsProlog) • Allows for quick implementation of the semantics • The specification or the definition of what is an explanation, or what is a plan becomes a program that finds explanations and plans respectively.

Given some initial conditions and observations, to predict how the world would evolve or predict the outcome of (hypothetical) interventions. Prediction

Back to the example • Binding of TNF-a with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. • TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. • TRADD binding with RIP inhibits phosphorylation of NIK. • TRADD binding with FADD in the absence of FLIP leads to cell death.

Binding of TNF-a with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death. Initial Condition bind(TNF-α,TNF-R1) occurs at t0 Query predicteventually apoptosis Answer Unknown! Incomplete knowledge about the TRADD’s bindings. Depends on if bind(TRADD, RIP) happened or not! Prediction 1.

Binding of TNF-a with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death. Initial Condition bind(TNF-α,TNF-R1) occurs at t0 Observation TRADD’s binding with TRAF2, FADD, RIP Query predicteventually apoptosis Answer: Yes! Prediction 2

Explanation • Given initial condition and observations, to explain why final outcome does not match expectation.

Binding of TNF-a with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death. Initial condition: bound(TNF-a,TNFR1) at t0 Observation: bound(TRADD, TRAF2) at t1 Query: Explain apoptosis One explanation: Binding of TRADD with RIP Binding of TRADD with FADD Explanation 1

Planning • Given initial conditions, to plan interventions to achieve a goal. • Application in drug and therapy design.

Planning requirements • In addition to the knowledge about the pathway we need additional information about possible interventions such as: • What proteins can be introduced • What mutations can be forced.

Planning example • Defining possible interventions: • intervention intro(DN-TRAF2) • intro(DN-TRAF2) causes present(DN-TRAF2) • present(DN-TRAF2) inhibits bind(TRAF2,TRADD) • present(DN-TRAF2) inhibits interact(TRAF2,NIK) • Initial condition: • bound(NFκB,IκB) at 0 • bind(TNF-α,TNF-R1) at 0 • Goal: to keep NFκB remain inactive. • Query: • plan always bound(NFκB,IκB) from 0

Conclusion of part 1 • From paper in ISMB 2004: • Our goal in this paper was to make progress towards developing a system (and the necessary representation language and reasoning algorithms) that can be used to represent signal networks and pathways associated with cells and reason with them. • A start was made. • Defined a simple language (syntax and semantics) • Defined prediction, planning and explanation • A prototype implementation using AnsProlog • Illustration of its applicability with respect to an NFkB pathway.

Issues with incomplete knowledge • Often one may not be able to do much predication, explanation or planning. • What then? • Can reasoning help in obtaining new knowledge? • Yes, through hypothesis generation! • In fact, hypothesis generation needs reasoning!

Part II: Hypothesis Generation

Hypothesis generation • Our observations can not be explained by our existing knowledge OR the explanations given by our existing knowledge are invalidated by experiments? • Conclusion: Our knowledge needs to be augmented or revised? • How? • Can we use a reasoning system to predict some hypothesis that one can verify through experimentation? • Automate the reasoning in the minds of a biologist, especially helpful when the background knowledge is humongous.

Knowledge base UV leads_to cancer High UV Hypothesis space (K,I) |= O p53 Cancer No cancer

Issues in this tiny example • Hypothesis formation: Theory: UV leads to cancer. Observation: wild-type p53 resists the UV effect. Hypothesis: p53 is a tumor-suppressor. • Elaboration tolerance: How do we update/revise “UV leads to cancer”? • Default & NM reasoning: NormallyUV leads to cancer. UV does not lead to cancer if p53 is present.

Related Works: some prior mention of hypothesis formation • HYPGENE (Karp, 1991) • TRANSGENE (Darden, 1997) • GenePath (Zupan et al., 2003) • Robot Scientist (King et al., 2004) • Database (Doherty et al., 2004) • BIOCHAM (Calzone et al., 2005) • PathLogic (Karp et al. 2002) • Cytoscape (Shannon et al., 2003) • Integrative Scheme (Su et al., 2003) • Pathway Analysis (Ingenuity) … do not use the latest advances in knowledge representation and reasoning. (eg. lack of ways to express defaults, non-monotonicity, elaboration tolerance, problem solving rules, etc.)

Hypothesis formation • Knowledge base: K • Set of initial conditions: I • Set of (experimental) observations: O • (K,I) does not entail O • To expand (K,I) to (K’, I’): (K’, I’) entails O • How to expand (hypothesis space) • Explanation: expand only I • Diagnosis: normality assumptions about I, minimally abandon the normality assumptions • Hypothesis formation: expand K

Construction of hypothesis space • Present: manual construction, using research literature • Future: integration of multiple data sources • Protein interactions • Pathway databases • Biological ontologies …….. Provide cues, hunches such as A may interact with B: action interact(A,B) A-B interaction may have effect C: interact(A,B)causesC

Generation of hypotheses • Enumeration of hypotheses • Search: computing with Smodels (an implementation of AnsProlog) • Heuristics • A trigger statement is selected only if it is the only cause of some action occurrence that is needed to explain the novel observations. • An inhibition statement is selected only if it is the only blocker of some triggered action at some time. • Maximizing preferences of selected statements

Generation … (cont’): heuristics • Knowledge base K • a causes g • b causes g • Initial condition I = { intially f } • Observation O = { eventually g } • (K,I) does not entail O • Hypothesis space: to expand K with rules among • f triggers a • f triggers b • Hypotheses: { f triggers a }, or { f triggers b }

Case study: p53 network

Tumor suppression by p53 • p53 has 3 main functional domains • N terminal transactivator domain • Central DNA-binding domain • C terminal domain that recognizes DNA damage • Appropriate binding of N terminal activates pathways that lead to protection of cell from cancer. • Inappropriate binding (say to Mdm2) inhibits p53 induced tumor suppression.

p53 knowledge base • Stress • high(UV ) triggers upregulate(mRNA(p53)) • Upregulation of p53 • upregulate(mRNA(p53)) causes high(mRNA(p53)) • high(mRNA(p53)) triggers translate(p53) • translate(p53) causes high(p53)

p53 knowledge base (cont.) • Tumor suppression by p53 • high(p53) inhibitsgrowth(tumor)

p53 knowledge base (cont’) • Interaction between Mdm2 and p53 • high(p53), high(mdm2) triggers bind(p53,mdm2) • bind(p53,mdm2) causes bound(dom(p53,N)) • bind(p53,mdm2) causes high([p53 : mdm2]), • bind(p53,mdm2) causes ¬high(p53),¬high(mdm2)

Hypothesis formation • Experimental observation: • I = { initially high(UV), high(mdm2), high(ARF) } • O = { eventually ~ tumorous } • (K,I) does not entail O • Need to hypothesize the role of ARF.

Chitta Baral Arizona State University