Graph mining in bioinformatics

Laur Tooming Graph mining in bioinformatics

Graphs in biology • Graphs are often used in bioinformatics for describing processes in the cell • Vertices are genes or proteins • The meaning of an edge depends on the type of the graph • Protein-protein interaction • Gene regulation

What we’re looking for • We want to find sets of genes that have a biological meaning. • Idea: find graph-theoretically relevant sets of vertices and find out if they are also biologically meaningful. • Simple example: connected components • A more advanced idea: graph clustering. Find subgraphs that have a high edge density.

Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://micans.org/mcl/ Markov Cluster Algorithm (MCL) • If there is cluster structure in a graph, random walks tend to remain in a cluster for a long time • Graph modelled as a stochastic matrix: sum of entries in a column is 1 • aij- probability that randomly walking out of j will go to i on the next step • Bigger edge weight means greater probability of choosing that edge

Markov Cluster Algorithm (MCL) • Two procedures, inflation and expansion, are applied alternatively • Expansion: matrix squaring • considers longer random walks • Inflation: raising entries to some power, rescaling to remain stochastic • Weakens weak edges and strengthens strong ones • Converges to a steady state

Markov Cluster Algorithm (MCL) Images from http://micans.org/mcl/ani/mcl-animation.html

Betweenness centrality clustering • An edge between different clusters is on many shortest paths from one cluster to another. • An edge inside a cluster is on less shortest paths, because there are more alternative paths inside a cluster. • Betweenness centrality of an edge - the number of shortest paths in the graph containing that edge. • Remove edges with the highest centrality from the graph to obtain clustering. • Optimisations: • instead of all shortest paths, pick a sample of vertices and calculate shortest paths from them • remove several edges at once

GraphWeb • Web interface for analysing biological graphs • Simple syntax for entering graphs • multiple datasets • directed edges • edge weights • Visualising graphs with GraphViz • Finding biological meaning with g:Profiler ds1: A > B 10 ds2: A > B 4 ds1: B C 5 ds2: C > D 12

Combining several datasets • Whether or not there is an edge between two vertices is determined in biological experiments, which may sometimes give false results. • For a given graph different sources may give different information. Some sources may be more trustworthy than others. • We would like to combine different sources and assess the trustworthyness of each edge in the resulting graph. • Edge weight in summary graph: sum over datasets • w(e,G) = Σw(e,Gi)*w(Gi)

Combining several datasets

The end

Graph mining in bioinformatics

Graph mining in bioinformatics

Presentation Transcript

Data Mining and Bioinformatics

Peta-Graph Mining

Graph Algorithms in Bioinformatics

Graph-Based Data Mining

Data Mining and Bioinformatics

Introduction to Graph Mining

Graph Algorithms in Bioinformatics

Graph Mining - surprising patterns in real graphs

Centrality and Graph Mining

Sensor and Graph Mining

Graph Algorithms in Bioinformatics

Mining Tree-Query Associations in a Graph

Lecture 11: Graph Data Mining

G53BIO – Bioinformatics Biological Data Mining

Graph Algorithms in Bioinformatics

Large Graph Mining

Graph Mining Applications in Machine Learning Problems

Graph Theory And Bioinformatics Jason Wengert

Large Graph Mining

Data Mining – Basics of Bioinformatics

Opportunities for Text Mining in Bioinformatics

Centrality and Graph Mining