P ARALLEL M INING OF A SSOCIATION R ULES

PARALLELMINING OF ASSOCIATIONRULES PRESENTED BY VARUN V KAUSHIK PRIYANKA GUNASEKARAN

INTRODUCTION • Mining of association rules on a shared-nothing multiprocessor • Three parallel algorithms for association rule mining which explore a spectrum of various trade-offs between computation, communication, memory usage, synchronization and problem specific information. • All these algorithms are based on the serial Apriori algorithm and are implemented on IBM POWER parallel System SP2 which is a shared-nothing machine. CS583

OVERVIEW OF APRIORI SERIAL ALGORITHM • PROBLEM DECOMPOSITION:- The problem is decomposed into two subproblems:- • Find frequent itemsets. • Use these frequent itemsets to generate the desired rules • Apriori Algorithm • Superior performance over other serial algorithms • Easy to parallelize • In the first pass it simply counts item occurrences to determine the frequent 1-itemset. • Then it generates the candidate itemset Ck. • Scans data and finds support of candidates in Ck and generates Lk. CS583

PARALLEL ALGORITHMS • The three algorithms we will be talking about are • COUNT DISTRIBUTION ALGORITHM:- This algorithm focuses on minimizing communication at the expense of carrying out redundant duplicate computations in parallel. • DATA DISTRIBUTION ALGORITHM:- This algorithm focuses on utilizing the aggregate main memory of the system more efficiently and is a communication happy algorithm. • CANDIDATE DISTRIBUTION ALGORITHM:- This algorithm exploits the semantics of particular problem at hand both to reduce synchronization between processors and to segment the database based upon the patterns of the database. CS583

COUNT DISTRIBUTION ALGORITHM 1. Each processor Pi generates the complete Ck, using the complete frequent itemset Lk-1 created at the end of pass k-1. 2. Processor Pi makes a pass over its data partition Di and develops local support counts for candidates in Ck. 3. Each processor Pi exchanges local Ckcounts with all other processors to develop global Ckcounts. Synchronization among processors occur in this step. 4. Each processor Pi now computes Lk from Ck 5. Each processor Pi independently make the decision to terminate or continue to the next pass. The decision will be identical as all the processors have identical Lk. CS583

Count Distribtion Algorithm • ADVANTAGES • No data tuples are exchanged between processes as only counts need to be exchanged. • Thus processors can work independently and asynchronously while reading the data. • DISADVANTAGES • Does not exploit the aggregate memory of the system effectively CS583

Lk C1k D1 P1 D2 P2 Lk C2k Lk C5k P5 P3 D3 D5 P4 D4 C3k C4k Lk Lk ALGORITHM 1: COUNT DISTRIBUTION CS583

Processor 1 Database D Processor 1 {A} 2 {B} 2 {C} 3 {D} 1 {E} 2 1 ACD 2 BCE D1 C11 Processor 2 Processor 2 {A} 1 {B} 2 {C} 1 {E} 2 3 ABCE 4 BE 5 ABCE D2 C21 ALGORITHM 1: COUNT DISTRIBUTION CS583

L1 C12 L1 C22 ALGORITHM 1: COUNT DISTRIBUTION Processor 1 C1 min_sup = 40% Processor 2 C1 CS583

itemsetsup. AB 0 AC 1 AE 0 BC 1 BE 1 CE 1 C12 C2 L2 itemset sup. AB 2 AC 2 AE 2 BC 2 BE 3 CE 2 C2 L2 C22 ALGORITHM 1: COUNT DISTRIBUTION CS583

Data Distribution • Pass 1 : Same as the Count distribution algorithm • Pass k > 1: 1. Processor Pi generates Ck from Lk-1. 2. Processor Pi develops support counts for the itemsets in its local candidate set Cki using local and other data pages 3. Processor Pi calculates Lik using the local Cki 4. Processors exchange Lik that every processor has the complete Lk for generating Ck+1 CS583

Data Distribution Algorithm • ADVANTAGES • This algorithm is designed to exploit better the total system’s memory as the number of processor is increased. • DISADVANTAGES • It’s viable only on machine with very fast communication because every processor must broadcast its local data to all other processors in every pass. CS583

Disadvantage of both algorithms • Each transaction must be compared against the entire candidate set because any database transaction could support any candidate itemset. • This is what requires Count to duplicate the candidate set on every processor and Data to broadcast every database transaction. • Processors must be synchronized at the end of each pass to develop global data. CS583

CANDIDATE DISTRIBUTION ALGORITHM • Pass k < l Use ether Count or Data distribution algorithm. • Pass k = l 1. Partition Lk-1 among processor. 2. Processor Pi generates Cik logically using only Lk-1 partition assigned to it. 3. Pi develops global counts for candidates in Cik and the database is repartitioned into DRi at the same time. 4. After Pi has processed all its local data any data received from all aother processor, it post N-1 asynchronous receive buffer to receive Ljk from all other processor. TheseLjkare needed for pruning Cik+1 in the prune step of candidate generation. 5. Processor Pi computes Lik from Cik and asynchronously broadcasts it to the other N -1 processors using N-1 asynchronous send. CS583

CANDIDATE DISTRIBUTION ALGORITHM • Pass k > l 1. Pi collects all frequent itemsets that have been sent it by other processors for using in the pruning step. 2. Pi generates Cik using the local Lik-1. 3. Pi makes a pass over DRi and count Cik . It then computes Lik from Cikand asynchronously broadcast Lik to every other processor using N-1 asynchronous sends. CS583

PARALLEL RULE GENERATION • Generating the rules from frequent itemsets is much less expensive than discovering frequent itemset as it doesnot require examination of the database. • Generating rules in parallel simply involves partitioning the set of all frequent itemsets among the processors. Each processor then generates rules for its partition only. • In the calculation of the confidence of a rule a processor may need to examine the support of an itemset for which it is not responsible and hence needs access to all frequent itemsets bfore rule generation can begin. This is not a problem with Count and Data Distribution algorithm as at the end of the last pass all processors have all frequent itemsets. But in Candidate Distribution algorithm faster processors may need to wait until slower processors have discovered and transmitted all of their frequent itemsets. CS583

PERFORMANCE EVALUATION RELATIVE PERFORMANCE AND TRADE-OFFS • In the candidate distribution, repartitioning was done in fourth pass giving the best performance • The performance of the candidate distribution and Count Distribution was similar to the Serial algorithm but Data distribution did not fair well enough due to extra communication. • Both count and Data distribution algorithms spent a large amount of time for communicating. • Count had the smallest overhead and hence was the best algorithm. CS583

Sensitivity Analysis • Sizeup • The results showed sublinear performance with the Count algorithm, the algorithm is more efficient as the database size increases • Scaleup • Count algorithm scales very well, being able to keep the response time almost constant as the database and multiprocessor size increases. CS583

Effects of Hash Filtering CS583

Effect of Hash Filtering • The basic idea is to build the hash filter as the tuples are read in the first pass. • For every two itemset present the count is incremented in a corresponding hash bucket. Thus at the end of the pass, we have an upperbound on the support count for every 2-itemset present in the database. • When generating C2 using L1, candidates items are hashed and any candidate whose support count in the hash table is less than the minimum support is deleted. • The Count algorithm beats the Hash filter cause Count never explicitly forms C2;rather it uses a specialized version of the hash-tree since nothing in C2 can be pruned by the the Apriori candidate generation algorithm, which acts as a 2-D count array, drastically reducing memory requirements and function call overhead CS583

CONCLUSION • The count distributed algorithms good for a workstation-cluster environment. • The Data distributed algorithm can maximizes the use of aggregate memory but requires high communication to broadcast all the data • The Candidate distributed algorithm proceed independently with out synchronizing • The Count distribution algorithm exhibits linear scaleup and excellent speedup and sizeup behavior due to less overhead and hence is the best algorithms of all. • The data distribution algorithm is worst due to the cost of broadcasting local data from each processor to every other processor. • The Candidate distribution algorithm lost due to the cost and complexity of data redistribution . CS583

P ARALLEL M INING OF A SSOCIATION R ULES