1 / 44

The FP-Growth/Apriori Debate

The FP-Growth/Apriori Debate. Jeffrey R. Ellis CSE 300 – 01 April 11, 2002. Presentation Overview. FP-Growth Algorithm Refresher & Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal

libitha
Télécharger la présentation

The FP-Growth/Apriori Debate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002

  2. Presentation Overview • FP-Growth Algorithm • Refresher & Example • Motivation • FP-Growth Complexity • Vs. Apriori Complexity • Saving calculation or hiding work? • Real World Application • Datasets are not created equal • Results of real-world implementation

  3. FP-Growth Algorithm • Association Rule Mining • Generate Frequent Itemsets • Apriori generates candidate sets • FP-Growth uses specialized data structures (no candidate sets) • Find Association Rules • Outside the scope of both FP-Growth & Apriori • Therefore, FP-Growth is a competitor to Apriori

  4. Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FP-Growth Example TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

  5. Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty FP-Growth Example

  6. FP-CreateTree Input: DB, min_support Output: FP-Tree Scan DB & count all frequent items. Create null root & set as current node. For each Transaction T Sort T’s items. For each sorted Item I Insert I into tree as a child of current node. Connect new tree node to header list. Two passes through DB Tree creation is based on number of items in DB. Complexity of CreateTree is O(|DB|) FP-Growth Algorithm

  7. Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FP-Growth Example TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

  8. FP-Growth Input: FP-Tree, f_is (frequent itemset) Output: All freq_patterns if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is  combination else for each item i in header generate pattern f_is  i construct pattern’s conditional cFP-Tree if (FP-Tree  0) then call FP-Growth (cFP-Tree, pattern) Recursive algorithm creates FP-Tree structures and calls FP-Growth Claims no candidate generation FP-Growth Algorithm

  9. A B C D Two-Part Algorithm if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is  combination • e.g., { A B C D } • p = |pattern| = 4 • AD, BD, CD, ABD, ACD, BCD, ABCD • (n=1 to p-1) (p-1Cn)

  10. A B C D C D E D E E Two-Part Algorithm else for each item i in header generate pattern f_is  i construct pattern’s conditional cFP-Tree if (cFP-Tree  null) then call FP-Growth (cFP-Tree, pattern) • e.g., { A B C D E } for f_is = D • i = A, p_base = (ABCD), (ACD), (AD) • i = B, p_base = (ABCD) • i = C, p_base = (ABCD), (ACD) • i = E, p_base = null • Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.

  11. FP-Growth Complexity • Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. • Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree) • Creation of a new cFP-Tree occurs also.

  12. K K L L M M L L M M M M M M Sample Data FP-Tree null A … … B C D C D E … … … J K L M L M M M

  13. Algorithm Results (in English) • Candidate Generation sets exchanged for FP-Trees. • You MUST take into account all paths that contain an item set with a test item. • You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. • Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

  14. Null Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 FP-Growth Mining Example Header: A B F G H IM H:30 I:20

  15. FP-Growth Mining Example Null i = A pattern = { A M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

  16. FP-Growth Mining Example Null i = B pattern = { B M } support = 2 pattern base cFP-Tree … Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

  17. G:2 B:2 M:2 FP-Growth Mining Example Header: G B M Patterns mined: BM GM BGM freq_itemset = { B M } All patterns: BM, GM, BGM

  18. FP-Growth Mining Example Null i = F pattern = { F M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

  19. FP-Growth Mining Example Null i = G pattern = { G M } support = 4 pattern base cFP-Tree … Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

  20. FP-Growth Mining Example Null i = I pattern = { I G M } support = 3 pattern base cFP-Tree … Header: I B G M I:3 B:1 B:1 G:2 G:1 G:1 M:2 M:1 M:1 freq_itemset = { G M }

  21. I:3 G:3 M:3 FP-Growth Mining Example Header: I G M Patterns mined: IM GM IGM freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM

  22. FP-Growth Mining Example Null i = B pattern = { B G M } support = 2 pattern base cFP-Tree … Header: I B G M I:3 B:1 B:1 G:2 G:1 G:1 M:2 M:1 M:1 freq_itemset = { G M }

  23. B:2 G:2 M:2 FP-Growth Mining Example Header: B G M Patterns mined: BM GM BGM freq_itemset = { B G M } All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM

  24. FP-Growth Mining Example Null i = H pattern = { H M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

  25. FP-Growth Mining Example • Complete pass for { M } • Move onto { I }, { H }, etc. • Final Frequent Sets: • L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } • L3 : { BGM, GIM, BGM, GIM } • L4 : None

  26. FP-Growth Redundancy v. Apriori Candidate Sets • FP-Growth (support=2) generates: • L2 : 10 sets (5 distinct) • L3 : 4 sets (2 distinct) • Total : 14 sets • Apriori (support=2) generates: • C2 : 21 sets • C3 : 2 sets • Total : 23 sets • What about support=1? • Apriori : 23 sets • FP-Growth : 28 sets in {M} with i = A, B, F alone!

  27. FP-Growth vs. Apriori • Apriori visits each transaction when generating a new candidate sets; FP-Growth does not • Can use data structures to reduce transaction list • FP-Growth traces the set of concurrent items; Apriori generates candidate sets • FP-Growth uses more complicated data structures & mining techniques

  28. Algorithm Analysis Results • FP-Growth IS NOT inherently faster than Apriori • Intuitively, it appears to condense data • Mining scheme requires some new work to replace candidate set generation • Recursion obscures the additional effort • FP-Growth may run faster than Apriori in circumstances • No guarantee through complexity which algorithm to use for efficiency

  29. Improvements to FP-Growth • None currently reported • MLFPT • Multiple Local Frequent Pattern Tree • New algorithm that is based on FP-Growth • Distributes FP-Trees among processors • No reports of complexity analysis or accuracy of FP-Growth

  30. Real World Applications • Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” • Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus • Tested implementations against 1 artificial and 3 real data sets • Time-based comparisons generated

  31. Apriori & FP-Growth • Apriori • Implementation from creator Christian Borgelt (GNU Public License) • C implementation • Entire dataset loaded into memory • FP-Growth • Implementation from creators Han & Pei • Version – February 5, 2001

  32. Other Algorithms • CHARM • Based on concept of Closed Itemset • e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc. • CLOSET • Han, Pei implementation of Closed Itemset • MagnumOpus • Generates rules directly through search-and-prune technique

  33. Datasets • IBM-Artificial • Generated at IBM Almaden (T10I4D100K) • Often used in association rule mining studies • BMS-POS • Years of point-of-sale data from retailer • BMS-WebView-1 & BMS-WebView-2 • Months of clickstream traffic from e-commerce web sites

  34. Dataset Characteristics

  35. Experimental Considerations • Hardware Specifications • Dual 550MHz Pentium III Xeon processors • 1GB Memory • Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } • Confidence = 0% • No other applications running (second processor handles system processes)

  36. IBM-Artificial BMS-POS

  37. BMS-WebView-1 BMS-WebView-2

  38. Study Results – Real Data • At support  0.20%, Apriori performs as fast as or better than FP-Growth • At support < 0.20%, Apriori completes whenever FP-Growth completes • Exception – BMS-WebView-2 @ 0.01% • When 2 million rules are generated, Apriori finishes in 10 minutes or less • Proposed – Bottleneck is NOT the rule algorithm, but rule analysis

  39. Real Data Results

  40. Study Results – Artificial Data • At support < 0.40%, FP-Growth performs MUCH faster than Apriori • At support  0.40%, FP-Growth and Apriori are comparable

  41. Real-World Study Conclusions • FP-Growth (and other non-Apriori) perform better on artificial data • On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets • FP-Growth may be suitable when low support, large result count, fast generation are needed • Future research may best be directed toward analyzing association rules

  42. Research Conclusions • FP-Growth does not have a better complexity than Apriori • Common sense indicates it will run faster • FP-Growth does not always have a better running time than Apriori • Support, dataset appears more influential • FP-Trees are very complex structures (Apriori is simple) • Location of data (memory vs. disk) is non-factor in comparison of algorithms

  43. To Use or Not To Use? • Question: Should the FP-Growth be used in favor of Apriori? • Difficulty to code • High performance at extreme cases • Personal preference • More relevant questions • What kind of data is it? • What kind of results do I want? • How will I analyze the resulting rules?

  44. References • Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. • Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001. • Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. • Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. • Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 • Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. • Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

More Related