1 / 30

Parallel Association Rule Mining

Parallel Association Rule Mining. Presented by: Ramoza Ahsan and Xiao Qin November 5 th , 2013. Outline. Background of Association Rule Mining Apriori Algorithm Parallel Association Rule Mining Count Distribution Data Distribution Candidate Distribution FP tree Mining and growth

crete
Télécharger la présentation

Parallel Association Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Association Rule Mining Presented by: Ramoza Ahsan and Xiao Qin November 5th, 2013

  2. Outline • Background of Association Rule Mining • Apriori Algorithm • Parallel Association Rule Mining • Count Distribution • Data Distribution • Candidate Distribution • FP tree Mining and growth • Fast Parallel Association Rule mining without candidate generation • More Readings

  3. Association Rule Mining • Association rule mining • Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.) • Record usually contains transaction date and items bought. • Literature work more focused on serial mining. • Support and Confidence: Parameters for Association Rule mining.

  4. Association rule Mining Parameters • The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset. • Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X) Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5

  5. Outline • Background of Association Rule Mining • Apriori Algorithm • Parallel Association Rule Mining • Count Distribution • Data Distribution • Candidate Distribution • FP tree Mining and growth • Fast Parallel Association Rule mining without candidate generation • FP tree over Hadoop

  6. Apriori Algorithm Apriori runs in two steps. • Generation of candidate itemsets • Pruning of itemsets which are infrequent Level-wise generation of frequent itemsets. Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent.

  7. Apriori Algorithm for generating frequent itemsets • Minimum support=2

  8. Parallel Association Rule Mining • Paper presents parallel algorithm for generating frequent itemsets • Each of N procesor has private memory and disk. • Data is distributed evenly on the disks of every processor. • Count Distribution algorithm focusses on minimizing communication. • Data Distribution utilizes memory aggregation efficiently • Candidate Distribution reduces synchronization between processors.

  9. Algorithm 1: Count Distribution • Each processor generates complete Ck,using complete frequent itemset Lk-1. • Processor traverses over its local data partition and develops local support counts. • Exchange the counts with other processors to develop global count. Synchronization is needed. • Each processor computes Lk from Ck. • Each processor makes a decision to continue or stop.

  10. Algorithm 2: Data Distribution • Partition the dataset into N small chunks • Partition the set of candidates k-itemsets into N exclusive subsets. • Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks. • Aggregate the count.

  11. Algorithm 2: Data Distribution Data 1/N Data Ck 1/N Ck 1/N Data 1/N Ck 1/N Ck 1/N Data 1/N Ck 1/N Ck 1/N Data 1/N Data

  12. Algorithm 2: Data Distribution 1/N Data 1/N Data 1/N Ck 1/N Ck 1/N Data synchronize 1/N Ck 1/N Ck 1/N Data 1/N Ck 1/N Data

  13. Algorithm 3: Candidates Distribution • If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass. • The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.

  14. Algorithm 3: Candidates Distribution Data Data_1 Data_2 Lk-1 Lk-1_1 Ck_1 Lk-1_2 Ck_2 Data_3 Lk-1_3 Ck_3 Lk-1_4 Ck_4 Data_4 Lk-1_5 Ck_5 Data_5

  15. Algorithm 3: Candidates Distribution Data_1 Data_2 Lk-1_1 Ck_1 Data_3 Lk-1_2 Ck_2 Lk-1_3 Ck_3 Data_4 Lk-1_4 Ck_4 Lk-1_5 Ck_5 Data_5

  16. Data Partition and L Partition • Data • Each pass, every node grabs the necessary tuples from the dataset. • L • Let L3={ABC, ABD, ABE, ACD, ACE} • The items in the itemsets are lexicographically ordered. • Partition the itemsets based on common k-1 long prefixes.

  17. Rule Generation • Ex. • Frequent Itemset {ABCDE,AB} • The Rule that can be generated from this set is AB => CDE Support : Sup(ABCDE) Confidence : Sup(ABCDE)/Sup(AB)

  18. Outline • Background of Association Rule Mining • Apriori Algorithm • Parallel Association Rule Mining • Count Distribution • Data Distribution • Candidate Distribution • FP tree Mining and growth • Fast Parallel Association Rule mining without candidate generation • FP tree over Hadoop

  19. FP Tree Algorithm Allows frequent itemset discovery without candidate itemset generation: • Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set. • Step 2: Extracts frequent itemsets directly from the FP-tree

  20. FP-Tree & FP-Growth example Min supp=3

  21. Fast Parallel Association Rule Mining Without Candidacy Generation • Phase 1: • Each processor is given equal number of transactions. • Each processor locally counts the items. • Local count is summed up to get global count. • Infrequent items are pruned and frequent items are stored in header table in descending order of frequency. • construction of parallel frequent pattern trees for each processor. • Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.

  22. Example with min supp=4 Step 4. After pruning infrequent ones Step 1

  23. FP tree for P0 B:1 B:2 B:3 A:1 D:1 A:2 F:1 D:1 D:2 G:1 G:1

  24. Construction of local FP trees

  25. Conditional Pattern Bases

  26. Frequent pattern strings • All frequent pattern trees are shared by all processors • Each generate conditional pattern base from respective items in header table • Merging all conditional pattern bases of same item yields frequent string. • If support of item is less than threshold it is not added in final frequent string.

  27. More Readings [1] [2] [3] [4]

  28. FP-Growth on Hadoop 3 Map-Reduce(s)

  29. FP-Growth on Hadoop Core

  30. Thank You!

More Related