510 likes | 910 Vues
Algorithms for Mining Maximal Frequent Itemsets -- A Survey. Chaojun Lu. Introduction Frequent Itemset Extension Tree Common Techniques Some MFI-Mining Algorithms Concluding Remarks. Introduction. Terminology and Notations Problem Solution. Terminology and Notations
E N D
Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu
Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks
Introduction • Terminology and Notations • Problem • Solution
Terminology and Notations set of items: I = { i1, i2, …, in} set of transactions: DB = {T1,T2,…,Tm},Ti I (k-)itemset: N I ( |N| = k ) support of itemset N: supp(N) frequent itemset (fi) maximal frequent itemset (mfi) set of all frequent (k-)itemsets: FI, FIk set of all mfi: MFI
Problem Discover all maximal frequent itemsets in a given transaction database Solution Traversing the search space -- subset lattice of I -- and count support for itemset in DB
Solution(cont.) • Traversing the search space by -- • Brute-force: 2|I| • Clever use of the Basic Property of itemsets: • A B supp(A) supp(B) • BP1: All subsets of a known frequent itemset are also frequent. • BP2: All supersets of a known infrequent itemset are also infrequent.
Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks
Frequent Itemset eXtension Tree • Purpose • Idea • Description • Problem Re-formulated
Purpose To provide a general framework for analyzing and comparing existent MFI mining algorithms. Idea Larger frequent itemsets are generated by extending known smaller frequent itemsets with suitable items. FIXTree captures and illustrates this extension process.
Description of FIXTree • Root: • Nodes: frequent itemset • Each node N is associated with its candidate extensions CX(N) and frequent extensions FX(N) defined as: • CX(N) = {x | xI and N{x} may be frequent} • FX(N) = {x | xCX(N) and N{x} is frequent} • Parent-Child PC: C is a frequent extension of P, i.e. C = P{x} for some xFX(P).
Example ({1,2,3,4,5}/{1,2,3,4}) 1 ({2,3,4}/{2,4}) 2 ({3,4}/{3,4}) 3… 4… 23 ({4}/) 12 ({4}/{4}) 14 (/) 24 (/) 124 (/) Problem Re-formulated Generate as small a FIXTree containing MFI as possible while searching the subset lattice of I.
Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks
Common Techniques • Search Strategies • Pruning Strategies • Dynamic Reordering • Data Representation for Fast Support Counting • Frequency Determination
Search Strategies • We can generate the FIXTree via: • Breadth-first • Depth-first • Hybrid • For MFI-mining, it’s unnecessary to generate and count all nodes. Instead, we try to generate as fewer nodes of the FIXTree as possible, so long as MFI can be identified.
Pruning Strategies BasicPS1: Prune node N’s infrequent extension subtree. 1 ({2,3,4}/{2,4}) 12 ({4}/{4}) 13 14 (/) Note: This strategy greatly improves a PURE DFS algorithm for mining long patterns.
Pruning Strategies(cont.) BasicPS2: Node N’s CX(N) comes from its parent-node P’s FX(P). Let N=P{x}, xFX(P), then CX(N) = {y | yFX(P) and y > x} 1 ({2,3,4}/{2,4}) 14 (/…) 12 ({4}/…)
Pruning Strategies (cont.) MaxPS1: At node N, if NCX(N) M (a known fi/mfi), then N-subtree may be pruned. MaxPS2: At node N, if NCX(N) is frequent by support counting, then all N’s children may be pruned ( and a possible new mfi is produced). Look-ahead 1 ({2,3,4}/…) 12 13 14 123 124 1234
Pruning Strategies(cont.) MaxPS3: At node N, NCX(N) is frequent, then all N’s right-hand-side siblings may be pruned. (Those branches won’t produce new mfi.) ({1,2,3,4,5}/{1,2,3,4}) 1… 2 ({3,4}/…) 3… 4…
Pruning Strategies(cont.) DFMaxPS: In DFS, AFTER the recursive call DFS(Ni), check if the leftmost path N{i,…,n}is frequent. If yes, then Ni’s right-hand-side siblings may be pruned. (These won’t produce new mfi.) N(…/{1,2,…n}) N1 … Ni ({i+1,…,n}) N(i+1) Nn
Pruning Strategies(cont.) EquivPS: At node N, if for some xCX(N), supp(N{x}) = supp(N), then N can be replaced by N{x}, with CX(N{x}) = CX(N)-{x} N ({x,y,z}/…) Nx ({y,z}/…) Nx… Ny… Nz… Nxy… Nxz… Itemsets containing N but not x cannot be mfi Nxy… Nxz…
Dynamic Reordering • The item order in which to extend itemsets greatly affects MFI mining algorithms • Two heuristics: • DR1 At node N, reorder all xFX(N) in supp(Nx) increasing order. 1 {2,3,4} 13{4} 14 12 {4,3} 124{3} 134 123 1243
Dynamic Reordering(cont.) • DR2 Reorder items of FX() (i.e. FI1) in decreasing order of IF(x) with xFI1, where • IF(x) = {y | yFI1 and xy is infrequent}. • Notes: • |M(x)| |FI1|-|IF(x)| where M(x) is the size of the longest mfi containing x • DR2 + DR1 for FI1. • Compute FI1 and FI2 before use of DR2.
Data Representation • Data representation • transaction • set of items • bitstring • tid-list for each item(set) • FP-tree • vertical bitmap for each item(set) • diffset • Count support on the entire DB or sub-DB? • Counting techniques
Frequency Determination • We can determine a frequent itemset N via: • Direct counting supp(N) in DB • A known frequent superset of N • Lower Bound of supp(N) exceeding minsup
Lower Bound Technique • Obtain a lower-bound on supp(N) based on support information of N’s subsets. • supp(N{x}) = supp(N)-drop(N,x) • supp(N)-drop(M,x) where MN. • supp(NX) supp(N)-drop(M,x) where MN.
Lower Bound Technique(cont.) • LB-PS • We already have supp(N),supp(N1),supp(N2),supp(N3), so we can compute • Supp(N123) = supp(N)-drop(N,1)-drop(N,2)-drop(N,3) and check if it is minsup? • If yes, then prune N2 and N3 branches. (cf. MaxPS3) N (…/{1,2,3}) N1 ({2,3}/…) N2 ({3}/…) N3
Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks
Some MFI-Mining Algorithms • Apriori • Pincer- Search • FP-growth • Max-Miner • DepthProject • MAFIA • GenMax
Apriori Breadth-first Key steps: Given FIk Generate Ck+1 Join (Extending FIk using BasicPS2) Prune (BP2) Support Counting Ck+1 to obtain FIk+1
Apriori(cont.) Symmetry of FI-mining problem FIk IFk extension Count Ck+1 Count Ck reduction IFk+1 FIk+1 {1,2,…,n} Extension-based vs Reduction-based Frequent vs Infrequent
Pincer-Search Hybrid Search (Top-down + Bottom-up) Key steps: initially CMFI={I} Given FIk-1, Ck , CMFI and MFI Count Ck CMFI to obtain FIk , IFIk and new MFI Use MFI to prune FIk (BP1, MaxPS) Use IFIk to update CMFI Generate Ck+1 Join (Extending FIk using BasicPS2) Recover missing candidates Prune (BP2)
Pincer-Search(cont.) topdown 1 2 3 4 5 12 13 14 23 24 34 pruned pruned 1234 bottomup 12345
FP-Growth FP-tree: a compact form of DB/sub-DB Key steps: FP-growth(N,N-tree) if N-tree is a single path N{x,y,z} then a possible mfi is found Nx Ny Nz else { extend N with xFX(N) construct Nx-tree FP-growth(N{x},Nx-tree)}
FP-Growth(cont.) c:1 f:4 f c a b m p p(mbacf/c) b a c f b:1 m(bacf/acf) b:1 c:3 cp p:1 pruned a:3 p’s subDB:fcam,fcam,cb p’s FP-tree: c m’s subDB: fca,fca,vcab m’s FP-tree: fca b:1 m:2 m:1 p:2
FP-Growth(cont.) Depth-first MaxPS (if used for MFI-mining) Dynamic Reordering Projected subDB Without Candidate Generation? Construct subDB for N CX(N) Single path MaxPS Mining frequent 1-itemset in subDB FX(N)
MaxMiner Breadth-first + Pruning Key Steps: At node N with CX(N) Count NCX(N), N{x} for xCX(N) to get FX(N) If NCX(N) is frequent, prune using MaxPS2 Reorder FX(N) using DR1 Generate N’s children N{x} for xFX(N) with CX(N{x})={y | yFX(N) and y > x} MaxPS3 + LB-PS
DepthProject Depth-first + Pruning Key Steps: At node N with CX(N), call DP(N,DB) Count N{x} in DB to obtain FX(N) Prune using DFMaxPS, MaxPS1 Project DB to obtain subDB (if necessary) Reorder FX(N) using DR1 For each xFX(N): DP(N{x}, subDB) Output: a superset of MFI
DepthProject(cont.) Projected DB DB Proj.DB for {a} a ({b,c}) abc FX(a) bc [101] ab ac acd c abc abe b [1010] bd
DepthProject(cont.) Project DB for some nodes on a path Bitstring representation Byte Counting Bucket Counting
MAFIA Depth-first + Pruning Key Steps: At node N, call MAFIA(N, MFI) If NCX(N) MFI then prune using MaxPS1 Count N{x} obtain FX(N) using EquivPS Reorder FX(N) using DR1 For each xFX(N) MAFIA(N{x}, MFI) If on leftmost path, prune using DFMaxPS
MAFIA(cont.) Data Representation Vertical bitmap and byte counting Bitmap of item(set) N - bmp(N) N N {x} Tran. j 0/1 t(N {x}) = t(N)t(x) bmp(N) AND bmp(x)
GenMax Depth-first + Pruning Key Steps Compute FI1 and FI2 Reorder FI1 using DR2 + DR1 MFI = used for MaxPS1 LMFI( , FI1, MFI) //use diffsets Return MFI
GenMax(cont.) MFI-subset check: progressive focusing LMFI(N,FX(N),LMFI) For each xFX(N) Generate N{x}with CX(N) If NxCX(Nx) LMFI // MaxPS1 then return Count CX(Nx) to obtain FX(Nx) update LMFI to obtain newLMFI LMFI(Nx, FX(Nx), newLMFI)
GenMax(cont.) MFI-subset check optimization: check for local MFI DR2 Data Representation: diffsets
Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks
Concluding Remarks • Independent components can fit together nicely • Search strategy: hybrid • Pruning strategy and dynamic reordering • Data projection, bitmap representation, fast counting, compression • Different algorithms perform well under different MFI distributions • MAFIA and GenMax: current state-of-the-art
References R. C. Agarwal, et al. Depth first generation of long patterns. R. J. Bayardo. Efficiently mining long patterns from databases. D. Burdick, et al. MAFIA: a maximal frequent itemset algorithm for transactional databases. K. Gouda, et al. Efficiently mining maximal frequent itemsets. J. Han, et al. Mining frequent patterns without candidate generation. D-I Lin, et al. Pincer-search: an efficient algorithm for discovering the maximum frequent set.