1 / 48

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

Algorithms for Mining Maximal Frequent Itemsets -- A Survey. Chaojun Lu. Introduction Frequent Itemset Extension Tree Common Techniques Some MFI-Mining Algorithms Concluding Remarks. Introduction. Terminology and Notations Problem Solution. Terminology and Notations

jarrett
Télécharger la présentation

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu

  2. Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks

  3. Introduction • Terminology and Notations • Problem • Solution

  4. Terminology and Notations set of items: I = { i1, i2, …, in} set of transactions: DB = {T1,T2,…,Tm},Ti I (k-)itemset: N  I ( |N| = k ) support of itemset N: supp(N) frequent itemset (fi) maximal frequent itemset (mfi) set of all frequent (k-)itemsets: FI, FIk set of all mfi: MFI

  5. Problem Discover all maximal frequent itemsets in a given transaction database Solution Traversing the search space -- subset lattice of I -- and count support for itemset in DB

  6. Solution(cont.) • Traversing the search space by -- • Brute-force: 2|I| • Clever use of the Basic Property of itemsets: • A  B  supp(A)  supp(B) • BP1: All subsets of a known frequent itemset are also frequent. • BP2: All supersets of a known infrequent itemset are also infrequent.

  7. Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks

  8. Frequent Itemset eXtension Tree • Purpose • Idea • Description • Problem Re-formulated

  9. Purpose To provide a general framework for analyzing and comparing existent MFI mining algorithms. Idea Larger frequent itemsets are generated by extending known smaller frequent itemsets with suitable items. FIXTree captures and illustrates this extension process.

  10. Description of FIXTree • Root:  • Nodes: frequent itemset • Each node N is associated with its candidate extensions CX(N) and frequent extensions FX(N) defined as: • CX(N) = {x | xI and N{x} may be frequent} • FX(N) = {x | xCX(N) and N{x} is frequent} • Parent-Child PC: C is a frequent extension of P, i.e. C = P{x} for some xFX(P).

  11. Example  ({1,2,3,4,5}/{1,2,3,4}) 1 ({2,3,4}/{2,4}) 2 ({3,4}/{3,4}) 3… 4… 23 ({4}/) 12 ({4}/{4}) 14 (/) 24 (/) 124 (/) Problem Re-formulated Generate as small a FIXTree containing MFI as possible while searching the subset lattice of I.

  12. Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks

  13. Common Techniques • Search Strategies • Pruning Strategies • Dynamic Reordering • Data Representation for Fast Support Counting • Frequency Determination

  14. Search Strategies • We can generate the FIXTree via: • Breadth-first • Depth-first • Hybrid • For MFI-mining, it’s unnecessary to generate and count all nodes. Instead, we try to generate as fewer nodes of the FIXTree as possible, so long as MFI can be identified.

  15. Pruning Strategies BasicPS1: Prune node N’s infrequent extension subtree. 1 ({2,3,4}/{2,4}) 12 ({4}/{4}) 13 14 (/) Note: This strategy greatly improves a PURE DFS algorithm for mining long patterns.

  16. Pruning Strategies(cont.) BasicPS2: Node N’s CX(N) comes from its parent-node P’s FX(P). Let N=P{x}, xFX(P), then CX(N) = {y | yFX(P) and y > x} 1 ({2,3,4}/{2,4}) 14 (/…) 12 ({4}/…)

  17. Pruning Strategies (cont.) MaxPS1: At node N, if NCX(N)  M (a known fi/mfi), then N-subtree may be pruned. MaxPS2: At node N, if NCX(N) is frequent by support counting, then all N’s children may be pruned ( and a possible new mfi is produced). Look-ahead 1 ({2,3,4}/…) 12 13 14 123 124 1234

  18. Pruning Strategies(cont.) MaxPS3: At node N, NCX(N) is frequent, then all N’s right-hand-side siblings may be pruned. (Those branches won’t produce new mfi.)  ({1,2,3,4,5}/{1,2,3,4}) 1… 2 ({3,4}/…) 3… 4…

  19. Pruning Strategies(cont.) DFMaxPS: In DFS, AFTER the recursive call DFS(Ni), check if the leftmost path N{i,…,n}is frequent. If yes, then Ni’s right-hand-side siblings may be pruned. (These won’t produce new mfi.) N(…/{1,2,…n}) N1 … Ni ({i+1,…,n}) N(i+1) Nn

  20. Pruning Strategies(cont.) EquivPS: At node N, if for some xCX(N), supp(N{x}) = supp(N), then N can be replaced by N{x}, with CX(N{x}) = CX(N)-{x} N ({x,y,z}/…)  Nx ({y,z}/…) Nx… Ny… Nz… Nxy… Nxz… Itemsets containing N but not x cannot be mfi Nxy… Nxz…

  21. Dynamic Reordering • The item order in which to extend itemsets greatly affects MFI mining algorithms • Two heuristics: • DR1 At node N, reorder all xFX(N) in supp(Nx) increasing order. 1 {2,3,4} 13{4} 14 12 {4,3} 124{3} 134 123 1243

  22. Dynamic Reordering(cont.) • DR2 Reorder items of FX() (i.e. FI1) in decreasing order of IF(x) with xFI1, where • IF(x) = {y | yFI1 and xy is infrequent}. • Notes: • |M(x)|  |FI1|-|IF(x)| where M(x) is the size of the longest mfi containing x • DR2 + DR1 for FI1. • Compute FI1 and FI2 before use of DR2.

  23. Data Representation • Data representation • transaction • set of items • bitstring • tid-list for each item(set) • FP-tree • vertical bitmap for each item(set) • diffset • Count support on the entire DB or sub-DB? • Counting techniques

  24. Frequency Determination • We can determine a frequent itemset N via: • Direct counting supp(N) in DB • A known frequent superset of N • Lower Bound of supp(N) exceeding minsup

  25. Lower Bound Technique • Obtain a lower-bound on supp(N) based on support information of N’s subsets. • supp(N{x}) = supp(N)-drop(N,x) •  supp(N)-drop(M,x) where MN. • supp(NX)  supp(N)-drop(M,x) where MN.

  26. Lower Bound Technique(cont.) • LB-PS • We already have supp(N),supp(N1),supp(N2),supp(N3), so we can compute • Supp(N123) = supp(N)-drop(N,1)-drop(N,2)-drop(N,3) and check if it is  minsup? • If yes, then prune N2 and N3 branches. (cf. MaxPS3) N (…/{1,2,3}) N1 ({2,3}/…) N2 ({3}/…) N3

  27. Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks

  28. Some MFI-Mining Algorithms • Apriori • Pincer- Search • FP-growth • Max-Miner • DepthProject • MAFIA • GenMax

  29. Apriori Breadth-first Key steps: Given FIk Generate Ck+1 Join (Extending FIk using BasicPS2) Prune (BP2) Support Counting Ck+1 to obtain FIk+1

  30. Apriori(cont.) Symmetry of FI-mining problem  FIk IFk extension Count Ck+1 Count Ck reduction IFk+1 FIk+1 {1,2,…,n} Extension-based vs Reduction-based Frequent vs Infrequent

  31. Pincer-Search Hybrid Search (Top-down + Bottom-up) Key steps: initially CMFI={I} Given FIk-1, Ck , CMFI and MFI Count Ck  CMFI to obtain FIk , IFIk and new MFI Use MFI to prune FIk (BP1, MaxPS) Use IFIk to update CMFI Generate Ck+1 Join (Extending FIk using BasicPS2) Recover missing candidates Prune (BP2)

  32. Pincer-Search(cont.)  topdown 1 2 3 4 5 12 13 14 23 24 34 pruned pruned 1234 bottomup 12345

  33. FP-Growth FP-tree: a compact form of DB/sub-DB Key steps: FP-growth(N,N-tree) if N-tree is a single path N{x,y,z} then a possible mfi is found Nx Ny Nz else { extend N with xFX(N) construct Nx-tree FP-growth(N{x},Nx-tree)}

  34. FP-Growth(cont.)   c:1 f:4 f c a b m p p(mbacf/c) b a c f b:1 m(bacf/acf) b:1 c:3 cp p:1 pruned a:3 p’s subDB:fcam,fcam,cb p’s FP-tree: c m’s subDB: fca,fca,vcab m’s FP-tree: fca b:1 m:2 m:1 p:2

  35. FP-Growth(cont.) Depth-first MaxPS (if used for MFI-mining) Dynamic Reordering Projected subDB Without Candidate Generation? Construct subDB for N  CX(N) Single path  MaxPS Mining frequent 1-itemset in subDB  FX(N)

  36. MaxMiner Breadth-first + Pruning Key Steps: At node N with CX(N) Count NCX(N), N{x} for xCX(N) to get FX(N) If NCX(N) is frequent, prune using MaxPS2 Reorder FX(N) using DR1 Generate N’s children N{x} for xFX(N) with CX(N{x})={y | yFX(N) and y > x} MaxPS3 + LB-PS

  37. DepthProject Depth-first + Pruning Key Steps: At node N with CX(N), call DP(N,DB) Count N{x} in DB to obtain FX(N) Prune using DFMaxPS, MaxPS1 Project DB to obtain subDB (if necessary) Reorder FX(N) using DR1 For each xFX(N): DP(N{x}, subDB) Output: a superset of MFI

  38. DepthProject(cont.) Projected DB DB Proj.DB for {a} a ({b,c}) abc FX(a) bc [101] ab ac acd c abc abe b [1010] bd

  39. DepthProject(cont.) Project DB for some nodes on a path Bitstring representation Byte Counting Bucket Counting

  40. MAFIA Depth-first + Pruning Key Steps: At node N, call MAFIA(N, MFI) If NCX(N) MFI then prune using MaxPS1 Count N{x} obtain FX(N) using EquivPS Reorder FX(N) using DR1 For each xFX(N) MAFIA(N{x}, MFI) If on leftmost path, prune using DFMaxPS

  41. MAFIA(cont.) Data Representation Vertical bitmap and byte counting Bitmap of item(set) N - bmp(N) N N {x} Tran. j 0/1 t(N {x}) = t(N)t(x) bmp(N) AND bmp(x)

  42. GenMax Depth-first + Pruning Key Steps Compute FI1 and FI2 Reorder FI1 using DR2 + DR1 MFI =  used for MaxPS1 LMFI( , FI1, MFI) //use diffsets Return MFI

  43. GenMax(cont.) MFI-subset check: progressive focusing LMFI(N,FX(N),LMFI) For each xFX(N) Generate N{x}with CX(N) If NxCX(Nx) LMFI // MaxPS1 then return Count CX(Nx) to obtain FX(Nx) update LMFI to obtain newLMFI LMFI(Nx, FX(Nx), newLMFI)

  44. GenMax(cont.) MFI-subset check optimization: check for local MFI DR2 Data Representation: diffsets

  45. Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks

  46. Concluding Remarks • Independent components can fit together nicely • Search strategy: hybrid • Pruning strategy and dynamic reordering • Data projection, bitmap representation, fast counting, compression • Different algorithms perform well under different MFI distributions • MAFIA and GenMax: current state-of-the-art

  47. References R. C. Agarwal, et al. Depth first generation of long patterns. R. J. Bayardo. Efficiently mining long patterns from databases. D. Burdick, et al. MAFIA: a maximal frequent itemset algorithm for transactional databases. K. Gouda, et al. Efficiently mining maximal frequent itemsets. J. Han, et al. Mining frequent patterns without candidate generation. D-I Lin, et al. Pincer-search: an efficient algorithm for discovering the maximum frequent set.

  48. Thank You!

More Related