Frequent Subgraph Pattern Mining on Uncertain Graph Data

Frequent Subgraph Pattern Miningon Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09, Hong Kong Nov 4, 2009

Outline • Background • Problem Definition • Algorithm • Experimental Results • Conclusions

Background • Graph mining has played an important role in a range of real world applications. • medicines: structures of molecules • bioinformatics: biological networks • technologies: WWW • social science: social networks • many others

Directions of Graph Mining Models of graphse.g. [Leskovec et al. KDD’05] Patterns of graphse.g., [Yan et al. ICDM’02] Uncertainties of graphs Privacy of graphse.g., [Zou et al. VLDB’09] Evolution of graphse.g., [Faloutsos et al. SIGMOD’07]

Uncertainties of Graphs: Example I • Protein-Protein Interaction (PPI) Networks • Vertices: proteins • Edges: interactions between proteins • Uncertainties: probabilities of interactions really existing TIF34 0.375 0.639 0.867 0.651 0.651 FET3 0.147 0.639 0.698 NTG1 SMT3 RAD59 RPC40 The data are taken from the STRING Database (http://string-db.org).

Uncertainties of Graphs: Example II • Topologies of wireless sensor networks (WSNs) • Vertices: sensor nodes • Edges: wireless links between sensor nodes • Uncertainties: probabilities of wireless links functioning at any given time 0.75 0.95 0.88 0.92 0.69

A B B A y x x y x y z B B B B graph G1 graph G2 A A x x x y B B B B Preliminaries Graph Database Subgraph Pattern support = 1.0 support = 0.5 The support of S = the number of graphs containing S the total number of graphs

Frequent Subgraph Pattern Mining Problem • Input: a graph database D, and a support threshold minsup • Output: all subgraph patterns with support no less than minsup • FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08]. • PPI networks are subject to uncertainties. • How do we define support?

B B B B A y 0.8 0.5 A y x x y exist in this form x y B 0.6 0.7 B B B Implicated Graph B B exist in this form A y x x B B Model of Uncertain Graphs (1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168 Uncertain Graph 0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112

Model of Uncertain Graphs (Cont’d) Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.

A B B 0.8 0.5 A y x x y 0.1 0.8 x y 0.7 0.6 0.7 z B B B B Uncertain graph G1 Uncertain graph G2 exist in this form A B B A y x y x y B B B B Implicated graph of G1 Implicated graph of G2 Uncertain Graph Databases Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs. Totally, 24 * 23 = 128 implicated graph databases. Implicated Graph Database ((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3

implicating implicating implicating …… d1 d2 dn Expected Support D uncertain graph DB p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn) s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn The expected support of Sis

FSP Mining Problem on Uncertain Graphs • Input: an uncertain graph database D, and an expected support threshold minsup • Output: all subgraph patterns with expected support no less than minsup • It is #P-hard to count the number of frequent subgraph patterns. • Reduction from the problem of counting the number of satisfying truth assignments of a monotone k-CNF formula. • The FSP mining problem on uncertain graphs is NP-hard.

Approximation Method • It is #P-hard to compute the expected support of a subgraph pattern. • We develop an approximation method to find an approximate set of frequent subgraph patterns. • Let e(0 < e < 1) be a relative error tolerance. Discard Arbitrary Output expected support 0 1 (1-e) minsup minsup

Objective I • Difficulty I:# of frequent subgraph patterns is exponentially large. • Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.

A B B 0.8 0.5 A y x x y 0.1 0.8 x y 0.7 0.6 0.7 z B B B B Uncertain graph G2 Uncertain graph G1 Discard Arbitrary Output expected support 0 1 (1-e) minsup minsup Method for Objectives I • Step 1: Build a search tree T of subgraph patterns. • Step 2: Examine subgraph patterns in T in depth-first order • If S is infrequent, then all its descendents can be pruned.

Discard Arbitrary Output expected support 0 1 (1-e) minsup minsup Objective II • Difficulty II: It is #P-hard to compute the expected support esup(S) of a subgraph pattern S. • Objective II: Make the following judgments without computing esup(S) exactly. • If esup(S) is surely not in the green region, then discard. • If esup(S) is probable to be in the green region and surely not in the red region, then output.

expected support 0 1 (1-e) minsup minsup Method for Objective II • Step 1: Approximate esup(S) by an interval [l, u] such that esup(S)∈[l, u]. • Step 2: Decide whether Scan be output or not by testing the following conditions. Output Discard Shrink

Approximating esup(S) by [l,u] A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G. Algorithm Approximate esup(S) by [l,u] Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most e*minsup. Step 2:

B B A 0.8 0.5 A y x (x1) (x4) x y x y (x2) (x3) 0.6 0.7 B B B B pattern S uncertain graph Gi Approximate Pr(S occurs in Gi) by [li, ui] Step 1: Find all embeddings of S in Gi. 4 embeddings Step 2: Assign boolean variables to the edges in the embeddings.Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8. Step 3: Construct a conjunctive formula for each embedding.C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4). Step 4: Construct a DNF formula.F = C1 V C2 V C3 V C4. Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]). Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].

Experimental Results • Data • The STRING Database (http://string-db.org)

Time Efficiency

Approximation Quality

Scalability

Conclusions • A new model of uncertain graph data has been proposed. • The frequent subgraph pattern mining problem on uncertain graph data has been formalized. • The computational complexity of the problem has been formally proved to be NP-hard. • An approximate mining algorithm has been proposed. • The proposed algorithm has high efficiency, high approximation quality, and high scalability.

Thank you

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Presentation Transcript

Frequent Pattern Mining

Graph-Based Data Mining

Summarization of Frequent Pattern Mining

Mining Frequent Itemsets over Uncertain Databases

Frequent Pattern Mining in Data Streams

On Frequent Chatters Mining

A Tree-based Approach for Frequent Pattern Mining from Uncertain Data

Frequent Subgraph Mining

Our New Progress on Frequent/Sequential Pattern Mining

Frequent-Pattern Tree

Cache-conscious Frequent Pattern Mining on a Modern Processor

Chapter 4 – Frequent Pattern Mining

Mining Frequent Itemsets over Uncertain Databases

Constrained Frequent Itemset Mining from Uncertain Data Streams

Mining Probabilistically Frequent Sequential Patterns in Uncertain Databases

CloseGraph : Mining Closed Frequent Graph Patterns

Frequent Itemset Mining of Uncertain Data Streams Using the Damped Window Model

Mining Compressed Frequent-Pattern Sets

Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets

Frequent-Pattern Tree

Our New Progress on Frequent/Sequential Pattern Mining