Graph-based analysis of biochemical networks

aMAZE - Protein Function and Biochemical Processes Graph-based analysis of biochemical networks Jacques van HeldenJacques.van.Helden@ulb.ac.be

Contents • Mapping metabolic networks onto a graph • Taversal rules for metabolic graphs • Path finding • Path finding in weighted graphs • Pathway reconstruction by reaction clustering • From gene expression data to pathways • Recurrent modules

Graph-based analysis of biochemical networks Mapping metabolic networks onto graphs Jacques van HeldenJacques.van.Helden@ulb.ac.be

Metabolic network L-Homoserine SuccinylSCoA AcetlyCoA 2.3.1.46 2.3.1.31 HSCoA CoA Alpha-succinyl-L-Homoserine L-Cysteine E.coli S.cerevisiae O-acetyl-homoserine 4.2.99.9 Succinate Cystathionine H2O Sulfide 4.4.1.8 4.2.99.10 NH4+ Pyruvate Homocysteine 5-MethylTHF 2.1.1.14 THF L-Methionine

One node per compound L-Homoserine SuccinylSCoA AcetlyCoA 2.3.1.46 2.3.1.46 2.3.1.46 2.3.1.46 HSCoA CoA Alpha-succinyl-L-Homoserine L-Cysteine 4.2.99.9 O-acetyl-homoserine 4.2.99.9 4.2.99.9 4.2.99.9 Succinate Cystathionine H2O Sulfide NH4+ Pyruvate Homocysteine • vertices = compounds • arcs = reactions • problem: no representation of cross-point reactions 5-MethylTHF THF L-Methionine

One node per reaction 2.3.1.46 2.3.1.31 Alpha-succinyl-L-Homoserine O-acetyl-homoserine 4.2.99.9 Cystathionine 4.4.1.8 4.2.99.10 Homocysteine Homocysteine • vertices = reactions • arcs = intermediate compounds • problem: no representation of cross-point compounds 2.1.1.14

One node per compound and per reaction L-Homoserine SuccinylSCoA AcetlyCoA 2.3.1.46 2.3.1.31 HSCoA CoA Alpha-succinyl-L-Homoserine L-Cysteine O-acetyl-homoserine 4.2.99.9 Succinate Cystathionine H2O Sulfide 4.4.1.8 4.2.99.10 NH4+ • 2 types of vertices • compounds and reactions • arcs • from substrate to reaction • from reaction to product • arc labels can be used to represent stoichiometry Pyruvate Homocysteine 5-MethylTHF 2.1.1.14 THF L-Methionine

a bipartite graph is a graph whose vertex-set V can be partitioned into two subsets U and W, such that each edge of G has one endpoint in U and one endpoint in W. arcs never go from compound to compound arcs never go from reaction to reaction 5,871 compounds 5,223reactions Reactions and compounds: directed bipartite graph 21,194arcs

Extending the graph to full biochemical networks • The concept can be extended to include additional types of vertices : • biochemical entities : compounds, genes, proteins, … • biochemical interactions : reaction, catalysis, transcription, regulation, translocation, transport catalysis… • This allows to represent metabolism, regulation, transport, signal transduction, compartments, … • Warning : with this extension, the graph is not bipartite anymore, because some interactions have other interactions as output (e.g. a catalysis acts on a reaction) • van Helden et al. (2000) Biol Chem, 381(9-10), 921-35. • van Helden et al. (2001) Briefings in Bioinformatics, 2(1), 81-93. • van Helden et al. (2002) In Bioinformatics and Genome Analysis. Springer-Verlag, Berlin Heidelberg, Vol. 38.

Graph-based analysis of metabolic networks Traversal rules for metabolic graphs Jacques van HeldenJacques.van.Helden@ulb.ac.be

Ubiquitous compounds Reactions L-Aspartic Semialdehyde dihydrodipicolinic acid 4.2.1.52 Pyruvate H2O Sucinyl diaminopimelate succinate 3.5.1.18 H2O LL-diaminopimelic acid Invalid pathway L-Aspartic Semialdehyde LL-diaminopimelic acid 4.2.1.52 3.5.1.18 H2O

Compound connectivity

Reaction connectivity

Reaction connectivity - without ubiquitous compounds

Invalid intermediates • Where to set the limit ? • Seems obvious for H2O (1615), NADH (569), ... • What about ATP (435) ? • And pyruvate ? • And NH3 ? • Depends on the reaction/pathway considered • e.g. ATP is valid intermediate in nucleotide biosynthesis • Depends on the atoms being transferred during the reaction • e.g. NADH gives one proton • Depends on the focus of the question • e.g. analysis of energy metabolism ATP, NAD will matter

Ubiquitous compounds • Jeong et al. (Nature2000; 407: 651-654) • Calculate network diameter, i.e. average length of shortest path between two compounds • Show that when ubiquitous compounds ("hubs" in their terminology) are removed, diameter increases. • Compared the metabolic network diameter between different organisms. • "Surprising" result: the network diameter does not depend on the number of enzymes found in the organism. • But: for this comparison, all compounds were considered, including H2O.

Direct traversal of reversible reactions Reaction L-Aspartic Semialdehyde dihydrodipicolinic acid 4.2.1.52 Pyruvate H2O Valid pathways 4.2.1.52 L-Aspartic Semialdehyde dihydrodipicolinic acid 4.2.1.52 dihydrodipicolinic acid L-Aspartic Semialdehyde Invalid pathway L-Aspartic Semialdehyde 4.2.1.52 Pyruvate

Mutual exclusion of reverse reactions Reactions L-Aspartic Semialdehyde dihydrodipicolinic acid 4.2.1.52 Pyruvate H2O dihydrodipicolinic acid L-Aspartic Semialdehyde 4.2.1.52 reverse H2O Pyruvate Invalid pathway 4.2.1.52 reverse L-Aspartic Semialdehyde dihydrodipicolinic acid 4.2.1.52 Pyruvate

Traversal of reversible reactions • Fell& Wagner (Nat Biotechnol2000; 18, 1121-2) • Select a sub-network (energy metabolism and small molecule biosynthesis in E.coli). • Discard ubiquitous compounds. • Identify the "center" of the network : glutamate, followed by pyruvate. • But: reactions can be traversed from substrate to substrate or from product to product. • Jeong et al. (Nature2000; 407: 651-654) • Calculate network diameter. • But: reactions can be traversed from substrate to substrate or from product to product.

Graph-based analysis of biochemical networks Path finding Jacques van HeldenJacques.van.Helden@ulb.ac.be

Applications of path finding to biochemical networks • metabolic pathways from compound A to compound B (2-ends path finding) • genes regulated by a membrane receptor via a signal transduction pathway (1-end path-finding) • proteins and compounds regulating directly or indirectly the expression of a given gene (1-end path finding, reverse) • feed-back loops (cycle finding) • functional distance between two enzymes, in terms of the minimal number of steps between the reactions they catalyze

A graph of compounds and reactions Reactions from KEGG • Compound nodes • 10,166 compounds(only 4302 used by one reaction) • Reaction nodes • 5,283 reactions • Arcs • 10,685 substrate  reaction (7,297 non-trivial) • 10,621 reaction  product(6,828 non-trivial)

Escherichia coli 4219 Genes (Blattner) 967 enzymes (Swissprot) 159 pathways (EcoCyc) Metabolic Pathways as subgraphs

Functional distance between enzymes • The length of the shortest path between two reactions can be considered as a measure of their functional distance. • By extension, one can estimate the functional distance between two enzymes as the length of the shortest path between the ctalayzed reactions. • Example of application: interpretation of pairs of fused genes • Two enzymatic functions can be carried by a single gene in a genome, and by two separated genes in another genomes, as the result of a gene fusion event • Are such fusion events preferentially observed between functionally related enzymes ?

Shortest path finding with gene fusion pairs enzyme A enzyme B • Fusion pairs • Tsoka and Ouzounis (Nat Genet2000; 26: 141-2) • Shortest path analysis • van Helden et al. (2002) In Bioinformatics and Genome Analysis. Springer-Verlag, Berlin Heidelberg, Vol. 38. reactions compounds functional distancebetween enzymes shortest path finding Fusion pairs Random pairs

Pathway enumeration source compound target compound • Kuffner et al. (Bioinformatics 2000; 16: 825-836). • All possible paths from glucose to pyruvate, with maximal length 9  500,000 possible paths. • Adding constraints • Selecting "complete" pathways, i.e. where all side reactants are ubiquitous • Constraint on pathway width • Width 2  541 pathways • Width 1  170 pathways reactions compounds potential metabolic pathways path finding

select reactions (for each pathway separately) set of reactions genesenzymes identification of enzymes enzyme-coding genes gene expressiondata scoring of gene cluster (covariance of the response) most probably relevant pathways Scoring pathways with gene expression data source compound target compound reactions compounds potential metabolic pathways path finding

random control (glycolysis) found Scoring pathways with gene expression data pathway score distribution Zien, A., Kuffner, et al. (2000). Ismb8, 407-17.

Path finding - summary • Metabolic pathways are organism-dependent • Shortest path is generally not the most relevant. • Simple path enumeration returns innumerable false positives. • Adding consistency rules (complete pathways) reduces the number of returned pathways. • Pathway scoring allows to select the most relevant pathways for a given organism. • Requirements • Gene expression data • Specification of the source and target compounds

Graph-based analysis of biochemical networks Pathway building by reaction clustering Jacques van HeldenJacques.van.Helden@ulb.ac.be

Reconstructing a pathway from a subset of reactions • Input: • a set of reactions (the seed reactions) • Output: • a metabolic pathway including • the seed reactions, together with their substrates and products • optionally, some additional reactions, intercalated to improve the pathway connectivity • the pathway can either be connected, or contain several unconnected components

Seed nodes Compound Reaction Seed Reaction

Linking seed nodes Compound Reaction Seed Reaction Direct link

Compound Reaction Seed Reaction Direct link Intercalated reaction Enhance linking by intercalating reactions

Subgraph extraction

Validation of the method • Take a known pathway (e.g. Lysine biosynthesis in Escherichia coli: 9 reactions). • Provide the program with a subset of reactions. • See if the program is able to reconstruct the whole pathway on the basis of this subset.

Validation of the method • Take a set of experimentally characterized pathways, and for each one • Select a subset of enzymes • Use the reactions catalysed by these enzymes as seed nodes • Extract the subgraph • Compare with known pathway

Lysine biosynthesis in E.coli Aspartate biosynthesis L-Aspartate ATP aspartate kinase III lysC 2.7.2.4 ADP L-aspartyl-4-P NADPH; H+ aspartate semialdehyde deshydrogenase asd Methionine biosynthesis 1.2.1.11 NADP+; Pi L-aspartic semialdehyde Threnonine biosynthesis pyruvate dihydrodipicolinate synthase dapA 4.2.1.52 2 H2O dihydropicolinic acid NADPH or NADH; H+ dihydrodipicolinate reductase dapB 1.3.1.26 NADP+ or NAD+ tetrahydrodipicolinate succinyl CoA tetrahydrodipicolinae N-succinyltransferase dapD 2.3.1.117 CoA N-succinyl-epsilon-keto-L-alpha-aminopimelic acid glutamate succinyl diaminopimelateaminotransferase dapC 2.6.1.17 alpha-ketoglutarate succinyl diaminopimelate H2O N-succinyldiaminopimelatedesuccinylase dapE 3.5.1.18 succinate LL-diaminopimelic acid diaminopimelateepimerase dapF 5.1.1.7 meso-diaminopimelic acid diaminopimelatedecarboxylase lysRprotein lysR lysA 3.5.1.18 CO2 L-lysine

Example: reconstitution of lysine pathway • Gap size: 0 • all Ecs from original pathway are provided as seeds • Seeds • 1.2.1.11 • 1.3.1.26 • 2.3.1.117 • 2.6.1.17 • 2.7.2.4 • 3.5.1.18 • 4.1.1.20 • 4.2.1.52 • 5.1.1.7 • Result: • Inferring reaction orientation(reverse or forward) • Ordering

Example: reconstitution of lysine pathway • Gap size: 1 • 5 seed reactions • Result • Inferring missing steps • Inferring reaction orientation • Ordering

Example: reconstitution of lysine pathway • Gap size: 2 • 4 seed reactions • Result • E.coli pathway found • Alternative pathways also returned

Example: reconstitution of lysine pathway • Gap size: 3 • 3 seed reactions • Result • E.coli pathway is not found, because the program finds shortcuts between the seed reactions

Building pathways from operons • Pathways obtained with the pathway builder, using the genes from His operon as seeds

Applications of pathway reconstruction • We have the complete genome for more than 100 bacteria • For these genomes, • there is almost no experimental characterization of metabolism • enzymes have been predicted by sequence similarity. • gene expression data will in some cases be available, in most cases not. • In some cases, one expects to find the same pathways as in model organisms, but in other cases, variants or completely distinct pathways

Strategy 1: starting from annotated pathways • For each known pathway from model organisms • Select the subset of reactions for which an enzyme exists in the target organism • If a reasonable number of reactions are present • Using these as seeds, reconstruct a pathway • This strategy is likely to detect some variants of the annotated pathways, but is not able to predict novel pathways.

Strategy 2 - starting from predicted functional groups • Comparative genomics provides us with clues about functional modularity • operons can be predicted following different methods, and reveal some level of modular organisation. • groups of synteny can also reveal functional modules. • phylogenetic profiles reveal groups of co-evolving genes, which are generally involved in a same process or pathway. • Strategy • predict operons, groups of synteny, and groups of co-evolving genes • with each of these groups • select enzyme-coding genes • identify the reactions catalyzed by their products • use these reactions as seeds for the pathway builder

Graph-based analysis of biochemical networks Path finding in weighted graphs Jacques van HeldenJacques.van.Helden@ulb.ac.be

Path finding in a weighted graph • Assign a higher weight to highly connected compounds. This allows to work with the whole graph, but reduce the probability to use a pool metabolite as intermediate between two successive reactions. • Assign a smaller weight to reactions for which an enzyme has been identified in the genome. This will favour organism-specific pathways, without preventing to use spontaneous reactions or reactions catalysed by an unidentified enzyme in this organism. • When gene expression data is available, assign a weight to reactions according to the level of expression of the corresponding enzymes. This will favour context-specific pathways.

L-Aspartate 2.7.2.4 S.cerevisiae E.coli L-aspartyl-4-P 1.2.1.11 L-aspartic semialdehyde 1.1.1.3 L-Homoserine 2.3.1.31 2.3.1.46 Alpha-succinyl-L-Homoserine O-acetyl-homoserine 4.2.99.9 Cystathionine 4.2.99.10 4.4.1.8 Homocysteine 2.1.1.14 L-Methionine 2.5.1.6 S-Adenosyl-L-Methionine Test case: methionine biosynthesis

Graph-based analysis of biochemical networks

Graph-based analysis of biochemical networks

Presentation Transcript

Graph-based Segmentation

Biochemical networks Concepts and definitions

Graph-Based Binary Analysis

Graph-based Segmentation

Performance Analysis of FlexRay-based ECU Networks

Neighborhood Based Fast Graph Search In Large Networks

Building Biochemical + Chemical Similarity Networks

Graph-Based Perspective

Particle-based Simulation of Biochemical Networks

Basic Data Structures for Graph based Visualization and Analysis of Metabolic Networks

Graph Theory (Networks)

Graph Theory (Networks)

A couple of approaches to modelling and analysis of biochemical networks

Introduction to biochemical networks

Neighborhood Based Fast Graph Search in Large Networks

Graph-based Planning

Analysis of Streptococcus Agalactiae Biochemical Test

biochemical analysis of plants

Graph-based Segmentation

ANALYSIS OF GENETIC NETWORKS USING ATTRIBUTED GRAPH MATCHING

Basic Data Structures for Graph based Visualization and Analysis of Metabolic Networks

Mesoscopic Stochastic Spatial Simulations of Biochemical Networks