The FP-Growth/Apriori Debate

The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002

Presentation Overview • FP-Growth Algorithm • Refresher & Example • Motivation • FP-Growth Complexity • Vs. Apriori Complexity • Saving calculation or hiding work? • Real World Application • Datasets are not created equal • Results of real-world implementation

FP-Growth Algorithm • Association Rule Mining • Generate Frequent Itemsets • Apriori generates candidate sets • FP-Growth uses specialized data structures (no candidate sets) • Find Association Rules • Outside the scope of both FP-Growth & Apriori • Therefore, FP-Growth is a competitor to Apriori

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FP-Growth Example TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty FP-Growth Example

FP-CreateTree Input: DB, min_support Output: FP-Tree Scan DB & count all frequent items. Create null root & set as current node. For each Transaction T Sort T’s items. For each sorted Item I Insert I into tree as a child of current node. Connect new tree node to header list. Two passes through DB Tree creation is based on number of items in DB. Complexity of CreateTree is O(|DB|) FP-Growth Algorithm

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FP-Growth Example TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

FP-Growth Input: FP-Tree, f_is (frequent itemset) Output: All freq_patterns if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is  combination else for each item i in header generate pattern f_is  i construct pattern’s conditional cFP-Tree if (FP-Tree  0) then call FP-Growth (cFP-Tree, pattern) Recursive algorithm creates FP-Tree structures and calls FP-Growth Claims no candidate generation FP-Growth Algorithm

A B C D Two-Part Algorithm if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is  combination • e.g., { A B C D } • p = |pattern| = 4 • AD, BD, CD, ABD, ACD, BCD, ABCD • (n=1 to p-1) (p-1Cn)

A B C D C D E D E E Two-Part Algorithm else for each item i in header generate pattern f_is  i construct pattern’s conditional cFP-Tree if (cFP-Tree  null) then call FP-Growth (cFP-Tree, pattern) • e.g., { A B C D E } for f_is = D • i = A, p_base = (ABCD), (ACD), (AD) • i = B, p_base = (ABCD) • i = C, p_base = (ABCD), (ACD) • i = E, p_base = null • Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.

FP-Growth Complexity • Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. • Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree) • Creation of a new cFP-Tree occurs also.

K K L L M M L L M M M M M M Sample Data FP-Tree null A … … B C D C D E … … … J K L M L M M M

Algorithm Results (in English) • Candidate Generation sets exchanged for FP-Trees. • You MUST take into account all paths that contain an item set with a test item. • You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. • Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Null Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 FP-Growth Mining Example Header: A B F G H IM H:30 I:20

FP-Growth Mining Example Null i = A pattern = { A M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

FP-Growth Mining Example Null i = B pattern = { B M } support = 2 pattern base cFP-Tree … Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

G:2 B:2 M:2 FP-Growth Mining Example Header: G B M Patterns mined: BM GM BGM freq_itemset = { B M } All patterns: BM, GM, BGM

FP-Growth Mining Example Null i = F pattern = { F M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

FP-Growth Mining Example Null i = G pattern = { G M } support = 4 pattern base cFP-Tree … Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

FP-Growth Mining Example Null i = I pattern = { I G M } support = 3 pattern base cFP-Tree … Header: I B G M I:3 B:1 B:1 G:2 G:1 G:1 M:2 M:1 M:1 freq_itemset = { G M }

I:3 G:3 M:3 FP-Growth Mining Example Header: I G M Patterns mined: IM GM IGM freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM

FP-Growth Mining Example Null i = B pattern = { B G M } support = 2 pattern base cFP-Tree … Header: I B G M I:3 B:1 B:1 G:2 G:1 G:1 M:2 M:1 M:1 freq_itemset = { G M }

B:2 G:2 M:2 FP-Growth Mining Example Header: B G M Patterns mined: BM GM BGM freq_itemset = { B G M } All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM

FP-Growth Mining Example Null i = H pattern = { H M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }

FP-Growth Mining Example • Complete pass for { M } • Move onto { I }, { H }, etc. • Final Frequent Sets: • L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } • L3 : { BGM, GIM, BGM, GIM } • L4 : None

FP-Growth Redundancy v. Apriori Candidate Sets • FP-Growth (support=2) generates: • L2 : 10 sets (5 distinct) • L3 : 4 sets (2 distinct) • Total : 14 sets • Apriori (support=2) generates: • C2 : 21 sets • C3 : 2 sets • Total : 23 sets • What about support=1? • Apriori : 23 sets • FP-Growth : 28 sets in {M} with i = A, B, F alone!

FP-Growth vs. Apriori • Apriori visits each transaction when generating a new candidate sets; FP-Growth does not • Can use data structures to reduce transaction list • FP-Growth traces the set of concurrent items; Apriori generates candidate sets • FP-Growth uses more complicated data structures & mining techniques

Algorithm Analysis Results • FP-Growth IS NOT inherently faster than Apriori • Intuitively, it appears to condense data • Mining scheme requires some new work to replace candidate set generation • Recursion obscures the additional effort • FP-Growth may run faster than Apriori in circumstances • No guarantee through complexity which algorithm to use for efficiency

Improvements to FP-Growth • None currently reported • MLFPT • Multiple Local Frequent Pattern Tree • New algorithm that is based on FP-Growth • Distributes FP-Trees among processors • No reports of complexity analysis or accuracy of FP-Growth

Real World Applications • Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” • Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus • Tested implementations against 1 artificial and 3 real data sets • Time-based comparisons generated

Apriori & FP-Growth • Apriori • Implementation from creator Christian Borgelt (GNU Public License) • C implementation • Entire dataset loaded into memory • FP-Growth • Implementation from creators Han & Pei • Version – February 5, 2001

Other Algorithms • CHARM • Based on concept of Closed Itemset • e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc. • CLOSET • Han, Pei implementation of Closed Itemset • MagnumOpus • Generates rules directly through search-and-prune technique

Datasets • IBM-Artificial • Generated at IBM Almaden (T10I4D100K) • Often used in association rule mining studies • BMS-POS • Years of point-of-sale data from retailer • BMS-WebView-1 & BMS-WebView-2 • Months of clickstream traffic from e-commerce web sites

Dataset Characteristics

Experimental Considerations • Hardware Specifications • Dual 550MHz Pentium III Xeon processors • 1GB Memory • Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } • Confidence = 0% • No other applications running (second processor handles system processes)

IBM-Artificial BMS-POS

BMS-WebView-1 BMS-WebView-2

Study Results – Real Data • At support  0.20%, Apriori performs as fast as or better than FP-Growth • At support < 0.20%, Apriori completes whenever FP-Growth completes • Exception – BMS-WebView-2 @ 0.01% • When 2 million rules are generated, Apriori finishes in 10 minutes or less • Proposed – Bottleneck is NOT the rule algorithm, but rule analysis

Real Data Results

Study Results – Artificial Data • At support < 0.40%, FP-Growth performs MUCH faster than Apriori • At support  0.40%, FP-Growth and Apriori are comparable

Real-World Study Conclusions • FP-Growth (and other non-Apriori) perform better on artificial data • On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets • FP-Growth may be suitable when low support, large result count, fast generation are needed • Future research may best be directed toward analyzing association rules

Research Conclusions • FP-Growth does not have a better complexity than Apriori • Common sense indicates it will run faster • FP-Growth does not always have a better running time than Apriori • Support, dataset appears more influential • FP-Trees are very complex structures (Apriori is simple) • Location of data (memory vs. disk) is non-factor in comparison of algorithms

To Use or Not To Use? • Question: Should the FP-Growth be used in favor of Apriori? • Difficulty to code • High performance at extreme cases • Personal preference • More relevant questions • What kind of data is it? • What kind of results do I want? • How will I analyze the resulting rules?

References • Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. • Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001. • Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. • Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. • Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 • Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. • Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

The FP-Growth/Apriori Debate

The FP-Growth/Apriori Debate

Presentation Transcript

Structure of Chromosomes and Microbial Growth

Bacterial Growth

How many arrows do you see in the following shape?

Chapter 8 Cell Growth and Division

Endogenous Growth

Growth Plate Injuries

Food poisoning

Long-Run Growth

Nucleation and Growth of Crystals

CHAPTER 10: Economic Growth

Growth and Development

Corn Growth and Development

GROWTH AND DEVELOPMENT

Tree Root Growth After Planting

Population Growth and Demography

Growth models of Bipartite Networks

Splash

ECONOMIC GROWTH

Economic Growth: Malthus and Solow ECN5114

Growth and Development Yang Fan Pediatric Department

Control of Microbial Growth