DATA MINING LECTURE 3

DATA MININGLECTURE 3 Frequent Itemsets Association Rules

Frequent Itemsets • Given a set of transactions, find combinations of items (itemsets) that occur frequently : support, number of transactions that contain I minsup Market-Basket transactions Examples of frequent itemsets {Diaper, Beer} : 3{Milk, Bread} : 3{Milk, Bread, Diaper}: 2

Mining Frequent Itemsets task • Input: A set of transactions T, over a set of items I • Output: All itemsets with items in I having • support ≥ minsupthreshold • Problem parameters: • N = |T|: number of transactions • d = |I|: number of (distinct) items • w: max width of a transaction • Number of possible itemsets= 2d

The itemsetlattice Given d items, there are 2dpossible itemsets Too expensive to test all!

Reduce the number of candidates • Apriori principle (Main observation): • If an itemset is frequent, then all of its subsets must also be frequent • If an itemset is not frequent, then all of its supersets cannot be frequent • The support of an itemsetnever exceeds the support of its subsets • This is known as the anti-monotone property of support

Illustration of the Apriori principle Frequent subsets Found to be frequent

Found to be Infrequent Pruned Illustration of the Apriori principle Infrequent supersets

The Apriori algorithm Level-wise approach • Find frequent 1-items and put them toLk(k=1)‏ • Use Lkto generate a collection of candidate itemsetsCk+1 with size (k+1)‏ • Scan the database to find which itemsets in Ck+1 are frequent and put them into Lk+1 • If Lk+1 is not empty • k=k+1 • Goto step 2 R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, 1994.

The Apriori algorithm Ck: Candidate itemsetsof size k Lk: frequent itemsetsof size k L1= {frequent 1-itemsets}; for(k = 2; Lk !=; k++) Ck+1= GenerateCandidates(Lk)‏ for eachtransaction t in database do increment count of candidates in Ck+1that are contained in t endfor Lk+1= candidates in Ck+1 with support ≥min_sup endfor returnkLk;

Generate Candidates Ck+1 • Assume the items in Lkare listed in an order (e.g., lexicographic)‏ • Step 1: self-joiningLk(IN SQL)‏ insert intoCk+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lkq where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk< q.itemk Create an itemset of size k+1, by joining two itemsets of size k, that share the first k-1 items

Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

{a,b,c} {a,b,d} {a,b,c,d} Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

{a,c,d} {a,c,e} {a,c,d,e} Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

Generate Candidates Ck+1 • Assume the items in Lk are listed in an order (e.g., alphabetical)‏ • Step 1: self-joiningLk(IN SQL)‏ insert intoCk+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lkq where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk< q.itemk • Step 2:pruning forallitemsets c in Ck+1do forallk-subsets s of cdo if (s is not in Lk) then delete c from Ck+1 All itemsets of size k that are subsets of a new (k+1)-itemset should be frequent

{a,c,d} {a,b,c} {a,c,e} {a,b,d} {a,c,d,e} {a,b,c,d} bcd abc abd acd X   Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abcand abd • acde from acd and ace • Pruning: • abcdis kept since all subset itemsets are in L3 • acdeis removed because ade is not in L3 • C4={abcd}     cde acd ace ade

The Apriori algorithm Ck: Candidate itemsetsof size k Lk: frequent itemsetsof size k L1= {frequent 1-itemsets}; for(k = 2; Lk !=; k++) Ck+1= GenerateCandidates(Lk)‏ for eachtransaction t in database do increment count of candidates in Ck+1that are contained in t endfor Lk+1= candidates in Ck+1 with support ≥min_sup endfor returnkLk;

Reducing Number of Comparisons • Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

ASSOCIATION RULES

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Example: Definition: Association Rule • Association Rule • An implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • the probability P(X,Y) that X and Y occur together • Confidence(c) • Measures how often items in Y appear in transactions thatcontain X • the conditional probability P(X|Y) that X occurs given that Y has occurred.

Association Rule Mining Task • Input: A set of transactions T, over a set of items I • Output: All rules with items in I having • support ≥ minsupthreshold • confidence ≥ minconfthreshold

Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a partitioning of a frequent itemset into Left-Hand-Side (LHS) and Right-Hand-Side (RHS) Frequent itemset: {A,B,C,D} Rule:ABCD

Rule Generation • We have all frequent itemsets, how do we get the rules? • For every frequent itemsetS, we find rules of the form L S – L , where L  S, thatsatisfy the minimum confidence requirement • Example: L = {A,B,C,D} • Candidate rules: A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BD AC, CD AB, ABC D, BCD A, BC AD, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

Rule Generation • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) • But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Pruned Rules Rule Generation for Apriori Algorithm Low Confidence Rule Lattice of rules created by the RHS

Rule Generation for Apriori Algorithm • Candidate rule is generated by merging two rules that share the same prefixin the RHS • join(CDAB,BDAC)would produce the candidaterule D ABC • Prune rule D  ABCif itssubset ADBCdoes not havehigh confidence • Essentially we are doing Apriori on the RHS

RESULT POST-PROCESSING

Compact Representation of Frequent Itemsets • Some itemsets are redundant because they have identical support as their supersets • Number of frequent itemsets • Need a compact representation

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Maximalitemsets = positive border Infrequent Itemsets Border

Negative Border Itemsets that are not frequent, but all their immediate subsets are frequent. Infrequent Itemsets Border Minimal: Smallest itemsets with this property

Border • Border = Positive Border + Negative Border • Itemsets such that all their immediate subsets are frequent and all their immediate supersets are infrequent. • Either the positive, or the negative border is sufficient to summarize all frequent itemsets.

Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset

Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions

Maximal vs Closed Frequent Itemsets Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

Maximal vs Closed Itemsets

Pattern Evaluation • Association rule algorithms tend to produce too many rules • many of them are uninteresting or redundant • Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence • Interestingness measures can be used to prune/rank the derived patterns • In the original formulation of association rules, support & confidence are the only measures used

f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y Computing Interestingness Measure • Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y Used to define various measures • support, confidence, lift, Gini, J-measure, etc.

Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 0.75 • but P(Coffee) = 0.9 • Although confidence is high, rule is misleading • P(Coffee|Tea) = 0.9375 Drawback of Confidence

Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 420 students know how to swim and bike (S,B) • P(SB) = 420/1000 = 0.42 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) = P(S)  P(B) => Statistical independence

Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 500students know how to swim and bike (S,B) • P(SB) = 500/1000 = 0.5 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) > P(S)  P(B) => Positively correlated

Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 300students know how to swim and bike (S,B) • P(SB) = 300/1000 = 0.3 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) < P(S)  P(B) => Negatively correlated

Statistical-based Measures • Measures that take into account statistical dependence Text mining: Pointwise Mutual Information

Example: Lift/Interest • Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 0.75 • but P(Coffee) = 0.9 • Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Drawback of Lift & Interest Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Rare co-occurrences are deemed more interesting

ALTERNATIVE FREQUENT ITEMSET COMPUTATION Slides taken from Mining Massive Datasets course by AnandRajaraman and Jeff Ullman.

Efficient computation of pairs • The operation that takes most resources (mostly in terms of memory, but also time) is the computation of frequent pairs. • How can we make this faster?

PCY Algorithm • During Pass 1 (computing frequent items) of A-priori, most memory is idle. • Use that memory to keep counts of buckets into which pairs of items are hashed. • Just the count, not the pairs themselves.

Needed Extensions • Pairs of items need to be generated from the input file; they are not present in the file. • We are not just interested in the presence of a pair, but we need to see whether it is present at least s (support) times.

PCY Algorithm – (2) • A bucket is frequent if its count is at least the support threshold. • If a bucket is not frequent, no pair that hashes to that bucket could possibly be a frequent pair. • On Pass 2 (frequent pairs), we only count pairs that hash to frequent buckets.

DATA MINING LECTURE 3

DATA MINING LECTURE 3

Presentation Transcript

DATA MINING LECTURE 9

DATA MINING LECTURE 2

DATA MINING LECTURE 10

DATA MINING LECTURE 5

DATA MINING LECTURE 1

DATA MINING LECTURE 3

DATA MINING LECTURE 10b

DATA MINING LECTURE 5

DATA MINING LECTURE 8b

DATA MINING LECTURE 13

DATA MINING LECTURE 12

DATA MINING LECTURE 9

DATA MINING LECTURE 13

DATA MINING LECTURE 11

DATA MINING LECTURE 12

DATA MINING LECTURE 10

DATA MINING LECTURE 7

DATA MINING LECTURE 8

Data Mining Lecture 1

DATA MINING LECTURE 4