Sampling Large Databases for Association Rules

Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007

Association Rules Outline • Association Rules Problem Overview • Association Rules Definitions • Previous Work on Association Rules • Toivonen’s Algorithm • Experiments Result • Conclusion

Overview • Purpose If people tend to buy A and B together, then a buyer of A is a good target for an advertisement for B.

The Market-Basket Example • Items frequently purchased together: Bread PeanutButter • Uses: • Placement • Advertising • Sales • Coupons • Objective: increase sales and reduce costs

Other Example • The same technology has other uses University course enrollment data has been analyzed to find combinations of courses taken by the same students

Scale of Problem • WalMart sells 100,000 items and can store hundreds of millions of baskets. • The Web has 100,000,000 words and several billion pages.

Association Rule Definitions • Set of items: I={I1,I2,…,Im} • Transactions: D={t1,t2, …, tn}, tj I • Support of an itemset: Percentage of transactions which contain that itemset. • Frequent itemset: Itemset whose number of occurrences is above a threshold.

Association Rule Definitions • Association Rule (AR): implication X  Y where X,Y  I and X  Y = ; • Support of AR (s) X Y: Percentage of transactions that contain X Y • Confidence of AR (a) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X

Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b}B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • Association Rule • {m, b}  c • Support = 2/8 = 25% • Confidence = 2/4 = 50%

Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence threshold.

Association Rule Techniques • Find all frequent itemsets • Generate strong association rules from the frequent itemsets

APriori Algorithm • A two-pass approach called a-priori limits the need for main memory. • Key idea: monotonicity : if a set of items appears at least s times, so does every subset. • Converse for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.

APriori Algorithm (contd.) • Pass 1: Read baskets and count in main memory the occurrences of each item. • Requires only memory proportional to #items. • Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to have occurred at least s times. • Requires memory proportional to square of frequent items only.

Partitioning • Divide database into partitions D1,D2,…,Dp • Apply Apriori to each partition • Any large itemset must be large in at least one partition.

Partitioning Algorithm • Divide D into partitions D1,D2,…,Dp; • For I = 1 to p do • Li = Apriori(Di); • C = L1 …  Lp; • Count C on D to generate L;

Sampling • Large databases • Sample the database and apply Apriori to the sample. • Potentially Frequent Itemsets (PL): Large itemsets from sample • Negative Border (BD - ): • Generalization of Apriori-Gen applied to itemsets of varying sizes. • Minimal set of itemsets which are not in PL, but whose subsets are all in PL.

Negative Border Example Let Items = {A,…,F} and there are itemsets: {A}, {B}, {C}, {F}, {A,B}, {A,C}, {A,F}, {C,F}, {A,C,F} The whole negative border is: {{B,C}, {B,F}, {D}, {E}}

Toivonen’s Algorithm • Start as in the simple algorithm, but lower the threshold slightly for the sample. • Example: if the sample is 1% of the baskets, use 0.008 as the support threshold rather than 0.01 . • Goal is to avoid missing any itemset that is frequent in the full set of baskets.

Toivonen’s Algorithm (contd.) • Add to the itemsets that are frequent in the sample the negative border of these itemsets. • An itemset is in the negative border if it is not deemed frequent in the sample, but all its immediate subsets are. • Example: ABCD is in the negative border if and only if it is not frequent, but all of ABC , BCD , ACD , and ABD are.

Toivonen’s Algorithm (contd.) • In a second pass, count all candidate frequent itemsets from the first pass, and also count the negative border. • If no itemset from the negative border turns out to be frequent, then the candidates found to be frequent in the whole data are exactly the frequent itemsets.

Toivonen’s Algorithm (contd.) • What if we find something in the negative border is actually frequent? • We must start over again! • But by choosing the support threshold for the sample wisely, we can make the probability of failure low, while still keeping the number of itemsets checked on the second pass low enough for main-memory.

Experiment Synthetic data set characteristics (T = rowsize on average, I = size of maximalfrequent sets onaverage)

Experiment (contd.) Lowered frequency thresholds (%) for probability of missing any given frequent set is less than δ = 0.001

Number of trials with misses

Conclusions • Advantages: Reduced failure probability, while keeping candidate-count low enough for memory • Disadvantages: Potentially large number of candidates insecond pass

Thank you!

Sampling Large Databases for Association Rules