A Factor Graph Model for Minimal Gene Set Enrichment Analysis

A Factor Graph Model for Minimal Gene Set Enrichment Analysis Diana Uskat Computational Biology - Gene Center Munich

Motivation Cutout of Gene Ontology Cutout of Gene Ontology Ontologizer from S. Bauer, J. Gagneur, P. N. Robinson Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010) • Problem Outline: • Single gene analysis of microarray experiments entails a large multiple testing problem • Even after appropriate multiple testing correction, the result is usually a long list of differentially expressed genes • Interpretation is difficult by hand • Possible improvement: Gene set enrichment analysis • Group genes into different biologically meaningful categories (Gene Ontology, KEGG Pathways, Transcription factor targets) • Use a statistical method for finding those categories which are enriched for differentially expressed genes Diana Uskat - Gene Center Munich

Established Methods: GSEA (Subramanian, Tamayo) TopGO (Alexa) Globaltest (Goemann, Mansmann) GOStats (Falcon, Gentleman) Drawbacks: There are often 1000’s of overlapping categories, genes can belong to multiple categories difficult new multiple testing problem Group testing returns often a large number of significant categories identification of biologically relevantcategories difficult Motivation Cutout of Gene Ontology Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010) Diana Uskat - Gene Center Munich

Idea (Bauer, Gagneur et al., Nucleic Acids Research 2010) Search for a sparse explanation, i.e. a minimal number of categories that explain the data (sufficiently well) Use a simplistic probabilistic graphical model relating categories and genes, and do Bayesian inference on the marginal posterior for each category Minimal Gene Set Enrichment Correct explanation Correct minimal explanation T3 T1 T2 T3 T1 T2 Categories “gene E3 is element of category T3” E1 E2 E3 E1 E2 E3 Genes (coloured means „on“) Diana Uskat - Gene Center Munich

Minimal Gene Set Enrichment Categories T3 T1 T2 The model E1 E2 E3 Genes Observations (data) D1 D2 D3 A Bayesian Network factorization of the full posterior: Main trick: Use a prior favoring sparse solutions Posterior Likelihood Prior Diana Uskat - Gene Center Munich

Factor Graphs Our method:Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function for its neighbouring variables • Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) T3 T1 T2 E1 E2 E3 D1 D2 D3 Diana Uskat - Gene Center Munich

Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function its neighbouring variables • Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) T3 T1 T2 Pr(D|E) given by dataset E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich

Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function its neighbouring variables • Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) T3 T1 T2 g1 g2 g3 g6 g4 g5 E only active if at least one parent active E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich 7

Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function its neighbouring variables • Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) fT T3 T1 T2 g1 g2 g3 g6 with g4 g5 E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich

Estimation Methods for Factor Graphs • Computation of posterior for T,E: • Message-Passing Algorithm: Sum-Product-Algorithm • Stops at correct result after one round if graph has a tree structure • No guarantees if graph has cycles • (e.g., oscillation may occur), however works well in practice • Principle: • Start in leaf nodes • Message propagation: • variable to factor node („Sum“) • factor to variable node („Product“) • Termination: Compute the marginal distribution of the variable nodes fT T3 T1 T2 g1 g2 g3 g6 g4 g5 E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich

Application: Yeast Salt Stress • Categories: Transcritption factors (with their targets) instead of GO categories • Given: • List of transcription factors with their corresponding genes • List of genes (their p-values) from a yeast salt stress experiment • Question: Which transcription factors are active during salt stress? • Task: Find a set of transcription factors that are most likely to be active g1 g2 TF1 g3 “g2 is target of TF2” TF2 g4 g5 Diana Uskat - Gene Center Munich

Results ~2.000 genes 118 transcription factors Graph obtained from re-analysis of Harbison TF binding data (Nat, 2004) by MacIsaac et al. (BMC Bioinformatics, 2006) 24.03.2010 Diana Uskat - Gene Center Munich 10

Results ~2.000 genes 118 transcription factors YML081W DAL81 STB4 HSF1 UME6 SNT2 RGT1 MET28 MSN2 GAL4 SKO1 Previously known transcription factors involved in salt stress (Capaldi et al., Nat.Gen 2008,Wu and Chen, Bioinform Biol Insights. 2009) Differentially phosphorylated transcription factors (Soufi et al., Mol.Biosyst 2009) Graph obtained from re-analysis of Harbison TF binding data (Nat, 2004) by MacIsaac et al. (BMC Bioinformatics, 2006) Diana Uskat - Gene Center Munich

Summary and Outlook • Todo: scalability and speed • Lists of (meaningful) gene sets are better than lists of genes • Search for biologically meaningful explanations requires a new minmal model (MGSE) for gene set enrichment analysis • We use factor graphs for parameter estimation • Wide application to GO analysis, TF-target analysis, Pathway enrichment Diana Uskat - Gene Center Munich

Acknowledgments Gene Center Munich: Achim Tresch, Theresa Niederberger, Björn Schwalb, Sebastian Dümcke Collaborating Partners: Gene Center Munich: Patrick Cramer, Christian Miller, Daniel Schulz, Dietmar Martin, Andreas Mayer EMBL Heidelberg: Julien Gagneur(talk nov. 2009, working group conference of the GMDS „AG Statistische Methoden in der Bioinformatik, Munich“) Diana Uskat - Gene Center Munich

A Factor Graph Model for Minimal Gene Set Enrichment Analysis