Download
data mining lecture 3 n.
Skip this Video
Loading SlideShow in 5 Seconds..
DATA MINING LECTURE 3 PowerPoint Presentation
Download Presentation
DATA MINING LECTURE 3

DATA MINING LECTURE 3

144 Vues Download Presentation
Télécharger la présentation

DATA MINING LECTURE 3

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. DATA MININGLECTURE 3 Frequent Itemsets Association Rules

  2. Frequent Itemsets • Given a set of transactions, find combinations of items (itemsets) that occur frequently : support, number of transactions that contain I minsup Market-Basket transactions Examples of frequent itemsets {Diaper, Beer} : 3{Milk, Bread} : 3{Milk, Bread, Diaper}: 2

  3. Mining Frequent Itemsets task • Input: A set of transactions T, over a set of items I • Output: All itemsets with items in I having • support ≥ minsupthreshold • Problem parameters: • N = |T|: number of transactions • d = |I|: number of (distinct) items • w: max width of a transaction • Number of possible itemsets= 2d

  4. The itemsetlattice Given d items, there are 2dpossible itemsets Too expensive to test all!

  5. Reduce the number of candidates • Apriori principle (Main observation): • If an itemset is frequent, then all of its subsets must also be frequent • If an itemset is not frequent, then all of its supersets cannot be frequent • The support of an itemsetnever exceeds the support of its subsets • This is known as the anti-monotone property of support

  6. Illustration of the Apriori principle Frequent subsets Found to be frequent

  7. Found to be Infrequent Pruned Illustration of the Apriori principle Infrequent supersets

  8. The Apriori algorithm Level-wise approach • Find frequent 1-items and put them toLk(k=1)‏ • Use Lkto generate a collection of candidate itemsetsCk+1 with size (k+1)‏ • Scan the database to find which itemsets in Ck+1 are frequent and put them into Lk+1 • If Lk+1 is not empty • k=k+1 • Goto step 2 R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, 1994.

  9. The Apriori algorithm Ck: Candidate itemsetsof size k Lk: frequent itemsetsof size k L1= {frequent 1-itemsets}; for(k = 2; Lk !=; k++) Ck+1= GenerateCandidates(Lk)‏ for eachtransaction t in database do increment count of candidates in Ck+1that are contained in t endfor Lk+1= candidates in Ck+1 with support ≥min_sup endfor returnkLk;

  10. Generate Candidates Ck+1 • Assume the items in Lkare listed in an order (e.g., lexicographic)‏ • Step 1: self-joiningLk(IN SQL)‏ insert intoCk+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lkq where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk< q.itemk Create an itemset of size k+1, by joining two itemsets of size k, that share the first k-1 items

  11. Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

  12. Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

  13. {a,b,c} {a,b,d} {a,b,c,d} Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

  14. {a,c,d} {a,c,e} {a,c,d,e} Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

  15. Generate Candidates Ck+1 • Assume the items in Lk are listed in an order (e.g., alphabetical)‏ • Step 1: self-joiningLk(IN SQL)‏ insert intoCk+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lkq where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk< q.itemk • Step 2:pruning forallitemsets c in Ck+1do forallk-subsets s of cdo if (s is not in Lk) then delete c from Ck+1 All itemsets of size k that are subsets of a new (k+1)-itemset should be frequent

  16. {a,c,d} {a,b,c} {a,c,e} {a,b,d} {a,c,d,e} {a,b,c,d} bcd abc abd acd X   Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abcand abd • acde from acd and ace • Pruning: • abcdis kept since all subset itemsets are in L3 • acdeis removed because ade is not in L3 • C4={abcd}     cde acd ace ade

  17. The Apriori algorithm Ck: Candidate itemsetsof size k Lk: frequent itemsetsof size k L1= {frequent 1-itemsets}; for(k = 2; Lk !=; k++) Ck+1= GenerateCandidates(Lk)‏ for eachtransaction t in database do increment count of candidates in Ck+1that are contained in t endfor Lk+1= candidates in Ck+1 with support ≥min_sup endfor returnkLk;

  18. Reducing Number of Comparisons • Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

  19. ASSOCIATION RULES

  20. Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

  21. Example: Definition: Association Rule • Association Rule • An implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • the probability P(X,Y) that X and Y occur together • Confidence(c) • Measures how often items in Y appear in transactions thatcontain X • the conditional probability P(X|Y) that X occurs given that Y has occurred.

  22. Association Rule Mining Task • Input: A set of transactions T, over a set of items I • Output: All rules with items in I having • support ≥ minsupthreshold • confidence ≥ minconfthreshold

  23. Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a partitioning of a frequent itemset into Left-Hand-Side (LHS) and Right-Hand-Side (RHS) Frequent itemset: {A,B,C,D} Rule:ABCD

  24. Rule Generation • We have all frequent itemsets, how do we get the rules? • For every frequent itemsetS, we find rules of the form L S – L , where L  S, thatsatisfy the minimum confidence requirement • Example: L = {A,B,C,D} • Candidate rules: A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BD AC, CD AB, ABC D, BCD A, BC AD, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

  25. Rule Generation • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) • But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

  26. Pruned Rules Rule Generation for Apriori Algorithm Low Confidence Rule Lattice of rules created by the RHS

  27. Rule Generation for Apriori Algorithm • Candidate rule is generated by merging two rules that share the same prefixin the RHS • join(CDAB,BDAC)would produce the candidaterule D ABC • Prune rule D  ABCif itssubset ADBCdoes not havehigh confidence • Essentially we are doing Apriori on the RHS

  28. RESULT POST-PROCESSING

  29. Compact Representation of Frequent Itemsets • Some itemsets are redundant because they have identical support as their supersets • Number of frequent itemsets • Need a compact representation

  30. Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Maximalitemsets = positive border Infrequent Itemsets Border

  31. Negative Border Itemsets that are not frequent, but all their immediate subsets are frequent. Infrequent Itemsets Border Minimal: Smallest itemsets with this property

  32. Border • Border = Positive Border + Negative Border • Itemsets such that all their immediate subsets are frequent and all their immediate supersets are infrequent. • Either the positive, or the negative border is sufficient to summarize all frequent itemsets.

  33. Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset

  34. Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions

  35. Maximal vs Closed Frequent Itemsets Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

  36. Maximal vs Closed Itemsets

  37. Pattern Evaluation • Association rule algorithms tend to produce too many rules • many of them are uninteresting or redundant • Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence • Interestingness measures can be used to prune/rank the derived patterns • In the original formulation of association rules, support & confidence are the only measures used

  38. f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y Computing Interestingness Measure • Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y Used to define various measures • support, confidence, lift, Gini, J-measure, etc.

  39. Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 0.75 • but P(Coffee) = 0.9 • Although confidence is high, rule is misleading • P(Coffee|Tea) = 0.9375 Drawback of Confidence

  40. Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 420 students know how to swim and bike (S,B) • P(SB) = 420/1000 = 0.42 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) = P(S)  P(B) => Statistical independence

  41. Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 500students know how to swim and bike (S,B) • P(SB) = 500/1000 = 0.5 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) > P(S)  P(B) => Positively correlated

  42. Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 300students know how to swim and bike (S,B) • P(SB) = 300/1000 = 0.3 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) < P(S)  P(B) => Negatively correlated

  43. Statistical-based Measures • Measures that take into account statistical dependence Text mining: Pointwise Mutual Information

  44. Example: Lift/Interest • Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 0.75 • but P(Coffee) = 0.9 • Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

  45. Drawback of Lift & Interest Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Rare co-occurrences are deemed more interesting

  46. ALTERNATIVE FREQUENT ITEMSET COMPUTATION Slides taken from Mining Massive Datasets course by AnandRajaraman and Jeff Ullman.

  47. Efficient computation of pairs • The operation that takes most resources (mostly in terms of memory, but also time) is the computation of frequent pairs. • How can we make this faster?

  48. PCY Algorithm • During Pass 1 (computing frequent items) of A-priori, most memory is idle. • Use that memory to keep counts of buckets into which pairs of items are hashed. • Just the count, not the pairs themselves.

  49. Needed Extensions • Pairs of items need to be generated from the input file; they are not present in the file. • We are not just interested in the presence of a pair, but we need to see whether it is present at least s (support) times.

  50. PCY Algorithm – (2) • A bucket is frequent if its count is at least the support threshold. • If a bucket is not frequent, no pair that hashes to that bucket could possibly be a frequent pair. • On Pass 2 (frequent pairs), we only count pairs that hash to frequent buckets.