Mining Frequent Subgraphs

Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2007

1L06 Overview • Introduction • Finding recurring subgraphs from graph databases. • gSpan • FFSM

p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) Labeled Graph • We define a labeled graphG as a five element tuple G = {V, E, V, E, } where • V is the set of vertices of G, • E  V V is a set of undirected edges of G, • V(E) are set of vertex (edge) labels, •  is the labeling function: V V and E  Ethat maps vertices and edges to their labels.

x b b a b p2 p5 s1 q1 y y c b b b y y y y y b b a b b y s2 p1 q2 x x a a a a x a x a y y y y y y d b b b b b b p4 s3 q3 p3 (S) (P) (Q) Frequent Subgraph Mining Input: A set GD of labeled undirected graphs  = 2/3 Output: All frequent subgraphs (w. r. t. ) from GD.

Finding Frequent Subgraphs • Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least  graphs. • Isomorphic subgraphs are considered the same subgraph. • Apriori approaches • Generation of subgraph candidates is complicated and expensive. • Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

gSpan • DFS without candidate generation • Relabels graph representation to support DFS. • Discovers all frequent subgraphs without candidate generation or pruning. • DFS Representation • Map each graph to a DFS code (sequence). • Lexicographically order the codes. • Construct a search tree based on the lexicographic order.

Depth-First Search Tree (a) (b) (c) (d)

DFS Codes • Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if: • i1 = i2 && j1 < j2 • i1 < j1 && j1 = i2 • code(G,T) = edge sequence of ei < ei+1 (a) (b) (c) (d)

DFS Lexicographic Order • ∂ = code(G∂,T∂) = (a0,a1,…,am) • ß = code(Gß,Tß) = (b0,b1,…,bn) • ∂ ≤ ß iff (1) or (2): • (1) • (2) • Minimum DFS code • The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G. • Graphs A and B are isomorphic if min(A) = min(B).

DFS Codes: Parents and Children • If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b): • ß is the child of ∂. • ∂ is the parent of ß. • A valid DFS code requires that b grows from a vertex on the rightmost path.

DFS Code Trees • Organize DFS code nodes as parent-child. • Pre-order traversal follows DFS lexicographic order. • If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

gSpan • D is the set of all graphs. • S is the result set. Algorithm 1: GraphSet_Projection(D,S) 1: sort labels in D by frequency 2: remove infrequent vertices and edges 3: relabel remaining vertices and edges 4: S’ = all frequent 1-edge graphs in D 5: sort S’ in DFS lexicographic order 6: S = S’ 7: foreach edge e in S’ do 8: s = graph defined by e 9: s.D = subgraphs in D containing e 10: Subgraph_Mining(D,S,s) 11: D = D - e 12: if |D| < minSup 13: break Subprocedure 1: Subgraph_Mining(D,S,s) 1: if s != min(s) 2: return 3: S = S U {s} 4: s’ = +1-edge children of s in s.D 5: foreach child c of s’ do 6: if support(c) ≥ minSup 7: Subgraph_Mining(Ds,S,c)

Runtime: Synthetic Runtime (sec)

1000 100 Runtime (sec) 10 1 0 5 10 15 20 25 30 Support Threshold (%) Runtime: Chemical Apriori (FSG) gSpan

gSpan Advantages • Lower memory requirements. • Faster than naïve FSG by an order of magnitude. • No candidate generation. • Lexicographic ordering minimizes search tree. • False positives pruning. • Any disadvantage?

FFSM: Fast Frequent Subgraph Mining -- An Overview: • How to solve graph isomorphism problem? • A Novel Graph Canonical Form: CAM • How to tackle subgraph isomorphism problem (NP-complete)? • Incrementally maintained embeddings • How to enumerate subgraphs: • An Efficient Data Structure: CAM Tree • Two Operations: CAM-join, CAM-extension.

a b y b x b y x b y 0 d 0 y c 0 y 0 c 0 0 0 y 0 d y y 0 0 a M3 M1 p2 p5 a y c b y y b p1 y x b x a y 0 0 y d y d b 0 y 0 c 0 p4 p3 M2 (P) Adjacency Matrix • Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. • Every off-diagonal entry in the lower triangle part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge. 1for an undirected graph, the upper triangle is always a mirror of the lower triangle

b x b y 0 d 0 y 0 c y y 0 0 a a y b M3 y x b 0 y c 0 0 0 y 0 d M1 Code • A Code of n  n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order: M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00a a y b y x b 0 0 y d 0 y 0 c 0 M2 • TheCanonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

a a M1 a y b y b y x b a a 0 y c y x b y b y b a 0 0 y c 0 0 y 0 d y 0 b y x b y b 0 M5 M2 M3 M4 M6 MP Submatrix • For an m  m matrix A, an n  n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M. • We define a CAM is connected iff the corresponding graph is connected. • Theorem I: A CAM’s MP submatrix is CAM • Theorem II: A connected CAM’s MP submatrix is connected

b b a y d x b y b y 0 c a a a y b y b a a a a a a b y b 0 y d y 0 b y y y y b b b b y y b b x b y x b y 0 c 0 y y y x x x 0 b b b b y 0 x 0 b b 0 y 0 d 0 0 0 0 y y y y 0 0 0 0 c d c c 0 0 y y 0 0 d d a a y b y b 0 x b 0 x b p2 p5 y 0 0 y c 0 0 y d c b y p1 a a a x a y y y b b b y 0 0 y x x 0 b b b y d b 0 0 0 y y y 0 0 0 c c d p4 0 0 0 0 0 0 y y y 0 0 0 d c d p3 (P) CAM Tree: Subgraphs b d c a b b y c x b a b a a y y b b y b x b 0 y c y 0 d 0 0 x x b b

a b a b y b x b a a y b y b 0 x b y 0 b a y b y x b p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) CAM Tree: Frequent Subgraphs = 2/3

How to Enumerate Nodes in a CAM Tree? • Two operations to explore CAM tree: • CAM-Join • CAM-Extension • Augmenting CAM tree with Suboptimal CAMs • Objectives: • none false dismissal • no redundancy • Plus: We want to this efficiently!

a b y b y c y x b j e e j j e e a a b b b b y b x b y b x b x b x b 0 y c 0 y d 0 y d y 0 c y 0 d 0 y c j j e e j j a a a a b b y b y b y b y b x b x b y x b y x b y 0 c y 0 d y x b y x b p2 p5 y 0 0 y c 0 0 y d 0 y 0 d 0 y 0 c 0 y 0 c 0 y 0 d c b y j j p1 a a x a y y b y b y y x b y x b d b 0 y 0 c 0 0 y d p4 p3 0 0 y 0 d 0 0 y 0 c (P) Suboptimal Tree We define a Suboptimal CAM as a matrix that its MP submatrix is a CAM. d b c a b b a y d x b y b

Summary • Theorem: For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

Experimental Study • Predictive Toxicology Evaluation Competition (PTE) • Contains: 337 compounds • Each graph contains 27 nodes and 27 edges on average • NIH DTP Anti-Viral Screen Test (DTP CA/CM) • Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI). • We formed a dataset contains CA (423) and CM (1083). • Each graph contains 25 nodes and 27 edges on average

Performance (PTE) Support Threshold (%) Support Threshold (%)

Performance (DTP CACM) Support Threshold (%) Support Threshold (%)

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Presentation Transcript

Frequent Item Mining

Frequent Pattern Mining

Summarization of Frequent Pattern Mining

Frequent Structure Mining

On Frequent Chatters Mining

Mining Frequent Patterns

Frequent Subgraph Mining

Data Mining: Concepts and Techniques Mining Frequent Patterns

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Mining Frequent Subgraphs

Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism

Diagonally Subgraphs Pattern Mining

Chapter 4 – Frequent Pattern Mining

CSV: Visualizing and Mining Cohesive Subgraphs

Mining Compressed Frequent-Pattern Sets

Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets

An efficient algorithm for detecting frequent subgraphs in biological networks

Subgraphs