MapReduce -based Closed Frequent Itemset Mining with Efficient Redundancy Filtering

MapReduce-based Closed Frequent Itemset Mining with Efficient Redundancy Filtering Su-Qi Wang∗, Yu-Bin Yang∗, Guang-Peng Chen∗, Yang Gao∗ and Yao Zhang† ∗State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China †JinLing College, Nanjing University, Nanjing, China ICDMW 2012 11 July 2014 SNU IDB Hyesung Oh

Introduction • Closed frequent itemset • Proposed in 1999 by Pasquier et al* • Alternative of the frequent itemset mining(FIM) • Has the same power of FIM, reduce redundancy • Existing CFI mining algorithms • Candidate generate-and-test approach • Pattern growth approach • Limitations of data size • memory use and communication costs • Some algorithms using PC clusters • Workload balancing, … * N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets for association rules,” Database Theory–ICDT’99, pp. 398–416, 1999.

Closed frequent itemset • Frequent itemset • Closed, greater than or equal to minsup minsup = 2

Parallelized AFOPT-close algorithm • 4 steps • Step 1: Parallel counting. (MR pass) • Count the support of each item • Step 2: Constructing the global F-list. • Sort the items by their frequency descorder • Exclude items of which sup is lower than minsup • Step 3: Parallel mining closed frequent itemset. (MR pass) • Mining locally closed frequent itemset • Step 4: Parallel filtering the redundant itemsets. (MR pass) • Filter itemset which is locally closed but not globally closed

Example Minsup = 3 Sort desc order: F-list Word count { fm 4}, { fpc 3}, { fp 3} are closed locally but not in global

Detail of Step 3

Efficient Redundant itemsets Filtering Mapper Output Reducer Reducer Output

Experimental Results - 1 • Two real datasets • “connect” • contains game state information • 8.8 Megabytes • “webdocs” • 1,692,082 taransactions with 5,267,656 distinct items • Max length of a transaction is 71,472 • 1.4 Gigabytes • 6 nodes with Hadoop 0.21.0 • Each node • 4 Intel Core processors • 4GB RAM • 500G HDD • Ubuntu 10.10 • Java openjdk-6-jdk

Experimental Results - 2 [12] G. Chen, Y. Yang, Y. Gao, and L. Shang, “Mining closed frequent itemset based on mapreduce,” in Proceedings of the 4th China Conference on Data Mining. CCDM, 2011.

Conclusion • Good scalability on large-scale datasets • When locally closed frequent itemset is large • Communication cost becomes an important factor

MapReduce -based Closed Frequent Itemset Mining with Efficient Redundancy Filtering