1 / 27

Mining Frequent Subgraphs

Mining Frequent Subgraphs. COMP 790-90 Seminar Spring 2007. 1L06. Overview. Introduction Finding recurring subgraphs from graph databases. gSpan FFSM. p 2. p 5. s 1. q 1. y. c. b. y. y. y. b. b. s 2. p 1. q 2. x. a. a. a. x. y. y. y. y. d. b. b. b. p 4. s 3.

barto
Télécharger la présentation

Mining Frequent Subgraphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2007

  2. 1L06 Overview • Introduction • Finding recurring subgraphs from graph databases. • gSpan • FFSM

  3. p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) Labeled Graph • We define a labeled graphG as a five element tuple G = {V, E, V, E, } where • V is the set of vertices of G, • E  V V is a set of undirected edges of G, • V(E) are set of vertex (edge) labels, •  is the labeling function: V V and E  Ethat maps vertices and edges to their labels.

  4. x b b a b p2 p5 s1 q1 y y c b b b y y y y y b b a b b y s2 p1 q2 x x a a a a x a x a y y y y y y d b b b b b b p4 s3 q3 p3 (S) (P) (Q) Frequent Subgraph Mining Input: A set GD of labeled undirected graphs  = 2/3 Output: All frequent subgraphs (w. r. t. ) from GD.

  5. Finding Frequent Subgraphs • Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least  graphs. • Isomorphic subgraphs are considered the same subgraph. • Apriori approaches • Generation of subgraph candidates is complicated and expensive. • Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

  6. gSpan • DFS without candidate generation • Relabels graph representation to support DFS. • Discovers all frequent subgraphs without candidate generation or pruning. • DFS Representation • Map each graph to a DFS code (sequence). • Lexicographically order the codes. • Construct a search tree based on the lexicographic order.

  7. Depth-First Search Tree (a) (b) (c) (d)

  8. DFS Codes • Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if: • i1 = i2 && j1 < j2 • i1 < j1 && j1 = i2 • code(G,T) = edge sequence of ei < ei+1 (a) (b) (c) (d)

  9. DFS Lexicographic Order • ∂ = code(G∂,T∂) = (a0,a1,…,am) • ß = code(Gß,Tß) = (b0,b1,…,bn) • ∂ ≤ ß iff (1) or (2): • (1) • (2) • Minimum DFS code • The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G. • Graphs A and B are isomorphic if min(A) = min(B).

  10. DFS Codes: Parents and Children • If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b): • ß is the child of ∂. • ∂ is the parent of ß. • A valid DFS code requires that b grows from a vertex on the rightmost path.

  11. DFS Code Trees • Organize DFS code nodes as parent-child. • Pre-order traversal follows DFS lexicographic order. • If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

  12. gSpan • D is the set of all graphs. • S is the result set. Algorithm 1: GraphSet_Projection(D,S) 1: sort labels in D by frequency 2: remove infrequent vertices and edges 3: relabel remaining vertices and edges 4: S’ = all frequent 1-edge graphs in D 5: sort S’ in DFS lexicographic order 6: S = S’ 7: foreach edge e in S’ do 8: s = graph defined by e 9: s.D = subgraphs in D containing e 10: Subgraph_Mining(D,S,s) 11: D = D - e 12: if |D| < minSup 13: break Subprocedure 1: Subgraph_Mining(D,S,s) 1: if s != min(s) 2: return 3: S = S U {s} 4: s’ = +1-edge children of s in s.D 5: foreach child c of s’ do 6: if support(c) ≥ minSup 7: Subgraph_Mining(Ds,S,c)

  13. Runtime: Synthetic Runtime (sec)

  14. 1000 100 Runtime (sec) 10 1 0 5 10 15 20 25 30 Support Threshold (%) Runtime: Chemical Apriori (FSG) gSpan

  15. gSpan Advantages • Lower memory requirements. • Faster than naïve FSG by an order of magnitude. • No candidate generation. • Lexicographic ordering minimizes search tree. • False positives pruning. • Any disadvantage?

  16. FFSM: Fast Frequent Subgraph Mining -- An Overview: • How to solve graph isomorphism problem? • A Novel Graph Canonical Form: CAM • How to tackle subgraph isomorphism problem (NP-complete)? • Incrementally maintained embeddings • How to enumerate subgraphs: • An Efficient Data Structure: CAM Tree • Two Operations: CAM-join, CAM-extension.

  17. a b y b x b y x b y 0 d 0 y c 0 y 0 c 0 0 0 y 0 d y y 0 0 a M3 M1 p2 p5 a y c b y y b p1 y x b x a y 0 0 y d y d b 0 y 0 c 0 p4 p3 M2 (P) Adjacency Matrix • Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. • Every off-diagonal entry in the lower triangle part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge. 1for an undirected graph, the upper triangle is always a mirror of the lower triangle

  18. b x b y 0 d 0 y 0 c y y 0 0 a a y b M3 y x b 0 y c 0 0 0 y 0 d M1 Code • A Code of n  n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order: M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00a a y b y x b 0 0 y d 0 y 0 c 0 M2 • TheCanonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

  19. a a M1 a y b y b y x b a a 0 y c y x b y b y b a 0 0 y c 0 0 y 0 d y 0 b y x b y b 0 M5 M2 M3 M4 M6 MP Submatrix • For an m  m matrix A, an n  n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M. • We define a CAM is connected iff the corresponding graph is connected. • Theorem I: A CAM’s MP submatrix is CAM • Theorem II: A connected CAM’s MP submatrix is connected

  20. b b a y d x b y b y 0 c a a a y b y b a a a a a a b y b 0 y d y 0 b y y y y b b b b y y b b x b y x b y 0 c 0 y y y x x x 0 b b b b y 0 x 0 b b 0 y 0 d 0 0 0 0 y y y y 0 0 0 0 c d c c 0 0 y y 0 0 d d a a y b y b 0 x b 0 x b p2 p5 y 0 0 y c 0 0 y d c b y p1 a a a x a y y y b b b y 0 0 y x x 0 b b b y d b 0 0 0 y y y 0 0 0 c c d p4 0 0 0 0 0 0 y y y 0 0 0 d c d p3 (P) CAM Tree: Subgraphs b d c a b b y c x b a b a a y y b b y b x b 0 y c y 0 d 0 0 x x b b

  21. a b a b y b x b a a y b y b 0 x b y 0 b a y b y x b p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) CAM Tree: Frequent Subgraphs = 2/3

  22. How to Enumerate Nodes in a CAM Tree? • Two operations to explore CAM tree: • CAM-Join • CAM-Extension • Augmenting CAM tree with Suboptimal CAMs • Objectives: • none false dismissal • no redundancy • Plus: We want to this efficiently!

  23. a b y b y c y x b j e e j j e e a a b b b b y b x b y b x b x b x b 0 y c 0 y d 0 y d y 0 c y 0 d 0 y c j j e e j j a a a a b b y b y b y b y b x b x b y x b y x b y 0 c y 0 d y x b y x b p2 p5 y 0 0 y c 0 0 y d 0 y 0 d 0 y 0 c 0 y 0 c 0 y 0 d c b y j j p1 a a x a y y b y b y y x b y x b d b 0 y 0 c 0 0 y d p4 p3 0 0 y 0 d 0 0 y 0 c (P) Suboptimal Tree We define a Suboptimal CAM as a matrix that its MP submatrix is a CAM. d b c a b b a y d x b y b

  24. Summary • Theorem: For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

  25. Experimental Study • Predictive Toxicology Evaluation Competition (PTE) • Contains: 337 compounds • Each graph contains 27 nodes and 27 edges on average • NIH DTP Anti-Viral Screen Test (DTP CA/CM) • Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI). • We formed a dataset contains CA (423) and CM (1083). • Each graph contains 25 nodes and 27 edges on average

  26. Performance (PTE) Support Threshold (%) Support Threshold (%)

  27. Performance (DTP CACM) Support Threshold (%) Support Threshold (%)

More Related