1 / 24

CSCI6405 class project

CSCI6405 class project. Implementation and comparison of three AR mining algorithms. Xuehai Wang, Xiaobo Chen, Shen chen. Outline. Motivation Dataset Apriori based hash tree algorithm FP-tree algorithm Conclusion Reference. Motivation.

rollin
Télécharger la présentation

CSCI6405 class project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCI6405 class project Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen AR mining

  2. Outline • Motivation • Dataset • Apriori based hash tree algorithm • FP-tree algorithm • Conclusion • Reference AR mining

  3. Motivation • Make the time of generating rules as shot as possible! • To understand the three algorithms • Apriori algorithm • Apriori with hash tree algorithm • FP-tree algorithm • Learn how to improve an algorithm AR mining

  4. Dataset • IBM dataset generator • Can set item number • Can set minimal support • Can set dataset size 1 2 5 8 9 2 3 4 6 7 12 Tid item AR mining

  5. Apriori principle • Apriori principle • A candidate generation-and-test Approach [4] • Given a frequent itemset, its subset must be frequent • A set is infrequent, its super set will not be generated and tested • But there is still some places can be improved • Count the support • I/O scan times AR mining

  6. Apriori Hash Tree Alg • Candidate K-itemset size is l • There is n transactions • Average transaction size is m • Calculate support count: • Original Apriori Alg: • With hash tree: O( n.log(l).(mk) ) AR mining

  7. Apriori Hash Tree Alg • Candidate is stored in a hash tree structure 1-itemset candidate hash tree 1(2) 1(1) 1(2) 2(1) 3(1) 2(1) 3(1) AR mining

  8. Apriori Hash Tree Alg 1itemset , Min support = 2 1(3) 2(4) 3(3) 4(1) 5(1) 6(3) AR mining

  9. Apriori Hash Tree Alg 2 itemset, Min support = 2 2 3(2) 3 6(2) 2 6(1) 1 2(2) 1 3(2) 1 6(1) 3 itemset, Min support = 2 1 2 3(1) AR mining

  10. FP-tree • Since the mining dataset is always very huge, it’s impossible to read all transactions into computer memory all in once. • But I/O scan is very time consuming. • FP-tree algorithm will try to suite all information from the dataset into computer memory, hence only need to scan I/O two times. AR mining

  11. FP-tree • FP-tree algorithm and implementation • By Xiaobo Chen AR mining

  12. FP-tree (Frequent Pattern Tree) • Mining frequent pattern without candidate generation • Divide and conquer methodology: decompose mining tasks into smaller ones AR mining

  13. FP-tree (Merits of FP-tree algorithm) • Make most use of common shared prefix • Complete and compact All information of a transaction is stored in a path The size is constrained by the data set consequently, the longest path corresponds to the longest pattern The compact ratio: over 100 AR mining

  14. f:1 c:1 a:1 m:1 p:1 FP-tree (Construction of FP-tree) min_support = 3 • TID freq. Items bought • 100 {f, c, a, m, p} • 200 {f, c, a, b, m} • 300 {f, b} • 400 {c, p, b} • 500 {f, c, a, m, p} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root AR mining

  15. f:2 c:2 a:2 m:1 p:1 FP-tree (construction (Cont’d)) TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} root b:1 m:1 AR mining

  16. Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 FP-tree construction (Cont’d) min_support = 3 • TID freq. Items bought • 100 {f, c, a, m, p} • 200 {f, c, a, b, m} • 300 {f, b} • 400 {c, p, b} • 500 {f, c, a, m, p} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root c:1 b:1 b:1 p:1 b:1 m:1 AR mining

  17. FP-tree (Mining Frequent Patterns Using the FP-tree) • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • Method • For each item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it containsonly one path(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) AR mining

  18. c:1 b:1 p:1 Conditional pattern base for p fcam:2, cb:1 f:4 c:3 a:3 p m:2 p:2 FP-tree (Mining Frequent Patterns Using the FP-tree) • Start with last item in order (i.e., p). • Follow node pointers and traverse only the paths containing p. • Accumulate all of transformed prefix paths of that item to form a conditional pattern base root Constructing a new FP-tree based on this pattern base leads to only one branch c:3 Thus we derive only one frequent pattern cont. p. Pattern cp AR mining

  19. f:4 Conditional pattern base for m fca:2, fcab:1 c:3 a:3 m m:2 b:1 m:1 FP-tree (Mining Frequent Patterns Using the FP-tree) • Move to next least frequent item in order, i.e., m • Follow node pointers and traverse only the paths containing m. • Accumulate all of transformed prefix paths of that item to form a conditional pattern base root Constructing a new FP-tree based on this pattern base leads to path fca:3 From this we derive frequent patterns fcam, fcm, cam, fm, cm, am AR mining

  20. Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty FP-tree (Conditional Pattern-Bases for the example) AR mining

  21. FP-tree (Why is Frequent pattern Growth fast?) • Performance studies show that FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection • Reasoning: • No candidate generation, no candidate test • Use compact data structure • Eliminate repeated database scan • Basic operation is counting and FP-tree building AR mining

  22. FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support Threshold AR mining

  23. Conclusion • FP-tree is faster than other two algorithms. • Apriori as well as hash tree algorithms are easier to implement. • We can easily combine them with other methods or tools. (i.e. distributed parallel computing). • The parameter of dataset is very important too. • Density, size, min support … AR mining

  24. References • [1] Jiawei Han and Micheline Kamber:"Data Mining: Concepts and Techniques ",Morgan Kaufmann, 2001 • [2] Jiawei Han, Jian Pei, Yiwen Yin:Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000 • [3] N.Mamoulis, Advanced Database Technologies (Slides) • [4] Jiawei Han and Micheline Kamber. Data Mining - Concepts and Techniques. MorganKaufmann Publishers, 2001. AR mining

More Related