1 / 29

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Frequent Subgraph Pattern Mining on Uncertain Graph Data. Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09, Hong Kong Nov 4, 2009. Outline. Background Problem Definition Algorithm Experimental Results Conclusions. Background.

kalei
Télécharger la présentation

Frequent Subgraph Pattern Mining on Uncertain Graph Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Frequent Subgraph Pattern Miningon Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09, Hong Kong Nov 4, 2009

  2. Outline • Background • Problem Definition • Algorithm • Experimental Results • Conclusions

  3. Background • Graph mining has played an important role in a range of real world applications. • medicines: structures of molecules • bioinformatics: biological networks • technologies: WWW • social science: social networks • many others

  4. Directions of Graph Mining Models of graphse.g. [Leskovec et al. KDD’05] Patterns of graphse.g., [Yan et al. ICDM’02] Uncertainties of graphs Privacy of graphse.g., [Zou et al. VLDB’09] Evolution of graphse.g., [Faloutsos et al. SIGMOD’07]

  5. Uncertainties of Graphs: Example I • Protein-Protein Interaction (PPI) Networks • Vertices: proteins • Edges: interactions between proteins • Uncertainties: probabilities of interactions really existing TIF34 0.375 0.639 0.867 0.651 0.651 FET3 0.147 0.639 0.698 NTG1 SMT3 RAD59 RPC40 The data are taken from the STRING Database (http://string-db.org).

  6. Uncertainties of Graphs: Example II • Topologies of wireless sensor networks (WSNs) • Vertices: sensor nodes • Edges: wireless links between sensor nodes • Uncertainties: probabilities of wireless links functioning at any given time 0.75 0.95 0.88 0.92 0.69

  7. Outline • Background • Problem Definition • Algorithm • Experimental Results • Conclusions

  8. A B B A y x x y x y z B B B B graph G1 graph G2 A A x x x y B B B B Preliminaries Graph Database Subgraph Pattern support = 1.0 support = 0.5 The support of S = the number of graphs containing S the total number of graphs

  9. Frequent Subgraph Pattern Mining Problem • Input: a graph database D, and a support threshold minsup • Output: all subgraph patterns with support no less than minsup • FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08]. • PPI networks are subject to uncertainties. • How do we define support?

  10. B B B B A y 0.8 0.5 A y x x y exist in this form x y B 0.6 0.7 B B B Implicated Graph B B exist in this form A y x x B B Model of Uncertain Graphs (1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168 Uncertain Graph 0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112

  11. Model of Uncertain Graphs (Cont’d) Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.

  12. A B B 0.8 0.5 A y x x y 0.1 0.8 x y 0.7 0.6 0.7 z B B B B Uncertain graph G1 Uncertain graph G2 exist in this form A B B A y x y x y B B B B Implicated graph of G1 Implicated graph of G2 Uncertain Graph Databases Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs. Totally, 24 * 23 = 128 implicated graph databases. Implicated Graph Database ((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3

  13. implicating implicating implicating …… d1 d2 dn Expected Support D uncertain graph DB p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn) s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn The expected support of Sis

  14. FSP Mining Problem on Uncertain Graphs • Input: an uncertain graph database D, and an expected support threshold minsup • Output: all subgraph patterns with expected support no less than minsup • It is #P-hard to count the number of frequent subgraph patterns. • Reduction from the problem of counting the number of satisfying truth assignments of a monotone k-CNF formula. • The FSP mining problem on uncertain graphs is NP-hard.

  15. Outline • Background • Problem Definition • Algorithm • Experimental Results • Conclusions

  16. Approximation Method • It is #P-hard to compute the expected support of a subgraph pattern. • We develop an approximation method to find an approximate set of frequent subgraph patterns. • Let e(0 < e < 1) be a relative error tolerance. Discard Arbitrary Output expected support 0 1 (1-e) minsup minsup

  17. Objective I • Difficulty I:# of frequent subgraph patterns is exponentially large. • Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.

  18. A B B 0.8 0.5 A y x x y 0.1 0.8 x y 0.7 0.6 0.7 z B B B B Uncertain graph G2 Uncertain graph G1 Discard Arbitrary Output expected support 0 1 (1-e) minsup minsup Method for Objectives I • Step 1: Build a search tree T of subgraph patterns. • Step 2: Examine subgraph patterns in T in depth-first order • If S is infrequent, then all its descendents can be pruned.

  19. Discard Arbitrary Output expected support 0 1 (1-e) minsup minsup Objective II • Difficulty II: It is #P-hard to compute the expected support esup(S) of a subgraph pattern S. • Objective II: Make the following judgments without computing esup(S) exactly. • If esup(S) is surely not in the green region, then discard. • If esup(S) is probable to be in the green region and surely not in the red region, then output.

  20. expected support 0 1 (1-e) minsup minsup Method for Objective II • Step 1: Approximate esup(S) by an interval [l, u] such that esup(S)∈[l, u]. • Step 2: Decide whether Scan be output or not by testing the following conditions. Output Discard Shrink

  21. Approximating esup(S) by [l,u] A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G. Algorithm Approximate esup(S) by [l,u] Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most e*minsup. Step 2:

  22. B B A 0.8 0.5 A y x (x1) (x4) x y x y (x2) (x3) 0.6 0.7 B B B B pattern S uncertain graph Gi Approximate Pr(S occurs in Gi) by [li, ui] Step 1: Find all embeddings of S in Gi. 4 embeddings Step 2: Assign boolean variables to the edges in the embeddings.Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8. Step 3: Construct a conjunctive formula for each embedding.C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4). Step 4: Construct a DNF formula.F = C1 V C2 V C3 V C4. Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]). Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].

  23. Outline • Background • Problem Definition • Algorithm • Experimental Results • Conclusions

  24. Experimental Results • Data • The STRING Database (http://string-db.org)

  25. Time Efficiency

  26. Approximation Quality

  27. Scalability

  28. Conclusions • A new model of uncertain graph data has been proposed. • The frequent subgraph pattern mining problem on uncertain graph data has been formalized. • The computational complexity of the problem has been formally proved to be NP-hard. • An approximate mining algorithm has been proposed. • The proposed algorithm has high efficiency, high approximation quality, and high scalability.

  29. Thank you

More Related