Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism

Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism Jun Huan, Wei Wang, Jan Prins ICDM 2003

Outline • Introduction • Canonical Adjacency Matrix • Join, Extension and Suboptimal CAMs • SCAM Tree • Conclusion

Introduction • Mining patterns from graph databases is challenging since graph related operation, such as subgraph testing, generally have higher time complexity than the corresponding operations on itemsets, sequences, and trees. • The problem of frequent subgraph mining is to find all frequent subgraphs from a graph database • Two challenging problem: • Subgraph isomorphism • An efficient scheme to enumerate all frequent subgraphs

Introduction • In this paper, we developed FFSM(Fast Frequent Subgraph Mining) targeting efficient subgraph testing and a better candidate subgraph enumeration scheme. The key features: (1) a novel graph canonical form and two efficient candidate proposing operations: FFSM-Join and FFSM-Extension (2) an graph framework(suboptimal CAM tree) to guarantee that all frequent subgraphs are enumerated unambiguously (3) avoiding subgraph isomorphism testing by maintaining an embedding set for each frequent subgraph

Canonical Adjacency Matrix(CAM) • In FFSM, every graph is represented By an adjacency matrix M (1) diagonal entry of M is filled with the label of the corresponding node (2) off-diagonal entry is filled with the label of the corresponding edge, or zero if there is no edge.

CAM • Given an n x n adjacency matrix M of a graph G with n nodes • Define the code of M denoted by code(M) • code(M) : the sequence of lower triangular entries of M(including entries on the diagonal) in the order m1,1, m2,1, m2,2…mn,1, mn,2,…mn,n-1, mn,n • We use standard lexicographic order on sequences to define a total order of two arbitrary codes • The canonical form is the maximal code among all its possible codes

Top : code(M1) = “axbxyb0yyb” >= code(M2)=“axboybxyyb” Bottom :

Join, Extension and Suboptimal CAMs • The current methods for enumerating all the subgraphs might be classified into two categories:join & extension • Join : a single join might produce multiple candidates and that a candidate might be redundantly proposed by many join operations • Extension : to restrict the nodes that a newly introduced edge may attach to • To achieve efficient subgraph enumeration: (1) Can we design a join operation such that every distinct CAM is generated only once? (2) Can we improve the join operation such that only a few(at most two)CAMs are generated from a single join operation? (3) Can we design an extension operation such that every edge might be attached to only one node in a graph represented by its CAM?

Join, Extension and Suboptimal CAMs • In order to tackle these challenges, we augment the CAM tree with a set of suboptimal CAM, and introduce two new operations : FFSM-Join and FFSM-Extension

Join, Extension and Suboptimal CAMs • At the bottom of Fig-2 we show a case in which a graph might be redundantly proposed by FSG(62) = 15 times. As shown in the graph, FFSM-Join completely removes the redundancy after “sorting” the subgraph by their CAM. • Suboptimal CAM (SCAM) def : given a graph G, and it’s CAM.SCAM is the submatrix of CAM and it can represent the subgraph of G p.s. proper SCAM : it isn’t a CAM

SCAM Tree • All SCAM of a graph G could be organized as a tree

SCAM Tree • SCAM Tree is “complete” that all nodes could be enumerated by either a join or an extension operation.

Conclusion • In this paper, it present a new algorithm FFSM which introducing two operations and a graph framework for reducing the number of redundant candidates • Experiment demonstrates that FFSM achieves a performance gain over the gSpan gSpan : build a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Base on the lexicographic order,gSpan adopts the depth-first search to mine frequent subgraph efficiently.

Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism

Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism

Presentation Transcript

Frequent Item Mining

Frequent Pattern Mining

Summarization of Frequent Pattern Mining

Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

Constraint Mining of Frequent Patterns in Long Sequences

Frequent Structure Mining

Mining Frequent Patterns

CBW: An Efficient Algorithm for Frequent Itemset Mining

Frequent Subgraph Mining

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Efficient Algorithms for Mining Share-Frequent Itemsets

CanTree: a tree structure for efficient incremental mining of frequent patterns

Diagonally Subgraphs Pattern Mining

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Mining Approximate Frequent Itemsets in the Presence of Noise

Frequent Pattern Mining

An efficient algorithm for detecting frequent subgraphs in biological networks

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees