Mining Coherent Dense Subgraphs across Multiple Biological Networks

Mining Coherent Dense Subgraphs across Multiple Biological Networks VahidMirjalili CSE 891

Motivation: • Finding patterns across multiple networks, to identify biological modules, and function prediction • Current algorithms are too costly • Developed a novel algorithm: CODENSE • Scalable in number and size • Adjustable based on the exact or approximate pattern mining

Clustering can detect meaningful biological modules • e.g. a dense protein interaction sub-network may correspond to a protein complex • Dense co-expression sub-network may represent a co-expression cluster • Biological modules are expected to be active across multiple conditions • One idea: aggregate all the networks and identify dense sub-graphs in the aggregated network • Risk of false positive detection

Aggregated graph:False positive in the aggregated graph • Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph

Solution to the false-positive summary-graph • Frequent sub-graphs • Mine the dense sub-graphs directly in each original network • A sub-graph is frequent if it occurs in multiple times in a set of graphs • In biological networks, each gene occur only once in a graph  no isomorphism problem

Frequent dense sub-grpah • A frequent dense sub-graph doesn’t show accurate information • Some edges in the frequent sub-graph shown above do not occur in the original set • It is more meaningful to divide this to two sub-graphs

Coherent Dense Sub-graphs • All edges in a coherent sub-graphs should have correlated occurrences in the original graph set • CODENSE divides the networks into 2 meta-graphs and perform clustering on these two graphs only (instead of individual networks) • CODENSE can distinguish the two modules • Good scalability • Discovery of overlapping clusters

Overlapping Sub-graphs • Partition-based clustering algorithms fail to identify overlapping sub-graphs • Mining Overlapping Dense Sub-graphs (MODES)

Application • Identify frequent co-expression clusters across multiple microarray datasets Microarray dataset: • Un-weighted, undirected graph • Each gene represents a node • Two genes are connected by an edge if they show high expression correlation • A densely connected sub-graph  tight co-expression cluster • Clusters from a single microarray dataset include spurious links, and may not be homogenous in function and regulation

Problem Formulation • A relation graph contains n simple graphs, such as • A common vertex set V is shared by the graphs • Support(G): the numbers of graphs in a relation graph dataset (D) • A graph is frequent if support(G) > threshold • Summary graph: is an un-weighted graph extracted from D, where an edge exists only if it occurs in more than k graphs in D

Problem Formulation • Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)

Second-Order Graph: where each node represents an edge from the relation graph dataset (D) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated • For efficiency, only construct the S graph for a sub-graph of the summary graph

Coherent Graph: a sub-graph extracted from the summary graph is coherent if • All its edges have support > k • Its second-order graph is dense • Graph Density: m: number of edges n: n umber of nodes

Two facts: • If a frequent sub-graph is dense, then it must be dense in the summary graph as well, but the reverse way doesn’t hold true always • If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense

Aggregate the graphs into a summary graph • Eliminate infrequent edges

MODES: Mining Overlapping DEnseSubgraphs • Developed based on HCS: Highly Connected Sub-graphs • Can efficiently identify dense sub-graphs • Can mine overlapping sub-graphs • Two approaches: • Minimum cut • Normalized cut (Shi, Malik 2000) • Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut

CODENSE analysis • Simplify the identification of coherent dense sub-graphs across n graphs into mining in two special graphs: summary graph + second-order graph • Can mine network modules • Can mine both exact and approximate patterns (by modifying the similarity threshold) • Can be extended to weighted graph (using Pearson correlation instead of Euclidean distance )

Experimental Study: co-expression network • 39 yeast microarray datasets • 6661 genes • Calculate the Pearson correlation between the expression levels (r)  • Construct the relation graph, (connectivity of two genes determined by the Pearson correlation) n: number of measurements

Create the summary graph , while removing edges that occur less than 6 times across 39 graphs • Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1 • For each sub( ), construct the second-order graph S • Apply MODES to S to identify sub-grpahs with density > d2 • Transform the edges  vertices, and apply MODES again to identify the dense sub-graphs with density > d3

Functional Module Discovery:MODES vs CODENSE • A cluster is considered functionally homogenous if: • The functional homogeneity modeled by hypergeometric distribution shall be significant at α=0.01 • At least 40% of its memebr genes belong to a specific G.O. functional category • MODES identified 366 clusters, but only 151 were functionally homogenous (42%) • CODENSE identified 770 clusters, which 76% of those were homogenous • Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks

Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which are not functionally homogenous Protein biosynthesis replicative cell aging mitochondrial electron transfer

Functional prediction: • CODENSE identified this 6-nodes sub-graph • 5 genes belong to “protein biosynthesis” category • Predict: ASC1 must be involved in protein biosynthesis as well Test with 448 known genes: 50% accuracy

Mining Coherent Dense Subgraphs across Multiple Biological Networks

Mining Coherent Dense Subgraphs across Multiple Biological Networks

Presentation Transcript

Dense Subgraphs on Dynamic Networks

BIOLOGICAL NETWORKS

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Biological networks

Biological Networks

Diagonally Subgraphs Pattern Mining

CSV: Visualizing and Mining Cohesive Subgraphs

SI 614 Network subgraphs (motifs) Biological networks

Maintaining Large Dense Subgraphs on Dynamic Networks

Biological Networks

Biological Networks

Biological networks

Biological networks

Biological Data Mining

Dense subgraphs of random graphs

Biological Data Mining

An efficient algorithm for detecting frequent subgraphs in biological networks

Biological networks

Biological Networks

Biological Data Mining

Biological networks