1 / 35

Association Rule Mining

Association Rule Mining. Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei. What Is Frequent Pattern Mining?. Frequent pattern : pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Frequent pattern mining: finding regularities in data

dritz
Télécharger la présentation

Association Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei

  2. What Is Frequent Pattern Mining? • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] • Frequent pattern mining: finding regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? Frequent-pattern mining methods

  3. Why Is Frequent Pattern Mining an Essential Task in Data Mining? • Foundation for many essential data mining tasks • Association, correlation, causality • Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association • Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) • Broad applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis • Web log (click stream) analysis, DNA sequence analysis, etc. Frequent-pattern mining methods

  4. Customer buys both Customer buys diaper Customer buys beer Basic Concepts: Frequent Patterns and Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c,conditional probability that a transaction having X also contains Y. • Let min_support = 50%, min_conf = 50%: • A  C (50%, 66.7%) • C  A (50%, 100%) Frequent-pattern mining methods

  5. Concept: Frequent Itemsets • Minimum support=2 • {sunny, hot, no} • {sunny, hot, high, no} • {rainy, normal} • Min Support =3 • ? • How strong is {sunny, no}? • Count = • Percentage = Frequent-pattern mining methods

  6. Concept: Itemset  Rules • {sunny, hot, no} = {Outlook=Sunny, Temp=hot, Play=no} • Generate a rule: • Outlook=sunny and Temp=hot  Play=no • How strong is this rule? • Support of the rule • = support of the itemset {sunny, hot, no} = 2 = Pr({sunny, hot, no}) • Either expressed in count form or percentage form • Confidence = Pr(Play=no | {Outlook=sunny, Temp=hot}) • In general LHS RHS, Confidence = Pr(RHS|LHS) • Confidence • =Pr(RHS|LHS) • =count(LHS and RHS) / count(LHS) • What is the confidence of Outlook=sunnyPlay=no? Frequent-pattern mining methods

  7. 6.1.3 Types of Association Rules • Quantitative • Age(X, “30…39”) and income(X, “42K…48K”)  buys(X, TV) • Single vs. Multi dimensions: • Buys(X, computer)  buys(X, “financial soft”); • Multi: above example • Levels of abstraction • Age(X, ..)  buys(X, “laptop computer”) • Age(X, ..)  buys(X, “computer); • Extensions • Max Pattern • Closed Itemset Frequent-pattern mining methods

  8. Frequent Patterns • Patterns = Item Sets • {i1, i2, … in}, where each item is a pair: (Attribute=value) • Frequent Patterns • Itemsets whose support >= minimum support • Support • count(itemset)/count(database) Frequent-pattern mining methods

  9. Max-patterns • Max-pattern: frequent patterns without proper frequent super pattern • BCDE, ACD are max-patterns • BCD is not a max-pattern Min_sup=2 Frequent-pattern mining methods

  10. Frequent Max Patterns • Succinct Expression of frequent patterns • Let {a, b, c} be frequent • Then, {a, b}, {b, c}, {a, c} must also be frequent • Then {a}, {b}, {c}, must also be frequent • By writing down {a, b, c} once, we save lots of computation • Max Pattern • If {a, b, c} is a frequent max pattern, then {a, b, c, x} is NOT a frequent pattern, for any other item x. Frequent-pattern mining methods

  11. Find frequent Max Patterns • Minimum support=2 • {sunny, hot, no} ?? Frequent-pattern mining methods

  12. Closed Patterns • A closed itemset X has no superset X’ such that every transaction containing X also contains X’ • {a, b}, {a, b, d}, {a, b, c} are frequent closed patterns • But, {a, b} is not a max pattern • Concise rep. of freq pats • Reduce # of patterns and rules • N. Pasquier et al. In ICDT’99 Min_sup=2 Frequent-pattern mining methods

  13. Mining Association Rules—an Example For rule AC: support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6% Min. support 50% Min. confidence 50% Frequent-pattern mining methods

  14. Method 1:Apriori: A Candidate Generation-and-test Approach • Any subset of a frequent itemset must be frequent • if {beer, diaper, nuts} is frequent, so is {beer, diaper} • Every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, and • test the candidates against DB • The performance studies show its efficiency and scalability • Agrawal & Srikant 1994, Mannila, et al. 1994 Frequent-pattern mining methods

  15. The Apriori Algorithm — An Example Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan Frequent-pattern mining methods

  16. The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; Frequent-pattern mining methods

  17. Important Details of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • How to count supports of candidates? Frequent-pattern mining methods

  18. Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd} Frequent-pattern mining methods

  19. How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Frequent-pattern mining methods

  20. How to Count Supports of Candidates? • Why counting supports of candidates a problem? • The total number of candidates can be very huge • One transaction may contain many candidates • Method: • Candidate itemsets are stored in a hash-tree • Leaf node of hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction Frequent-pattern mining methods

  21. Speeding up Association rules Dynamic Hashing and Pruning technique Thanks to Cheng Hong & Hu Haibo

  22. DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Candidates: a, b, c, d, e • Hash entries: {ab, ad, ae} {bd, be, de} … • Frequent 1-itemset: a, b, d, e • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 Frequent-pattern mining methods

  23. Still challenging, the niche for DHP • DHP ( Park ’95 ): Dynamic Hashing and Pruning • Candidate large 2-itemsets are huge. • DHP: trim them using hashing • Transaction database is huge that one scan per iteration is costly • DHP: prune both number of transactions and number of items in each transaction after each iteration Frequent-pattern mining methods

  24. How does it look like? DHP Apriori Generate candidate set Generate candidate set Count support Count support Make new hash table Frequent-pattern mining methods

  25. Hash Table Construction • Consider two items sets, all itesms are numbered as i1, i2, …in. For any any pair (x, y), has according to • Hash function bucket #= h({x y}) = ((order of x)*10+(order of y)) % 7 • Example: • Items = A, B, C, D, E, Order = 1, 2, 3 4, 5, • H({C, E})= (3*10 + 5)% 7 = 0 • Thus, {C, E} belong to bucket 0. Frequent-pattern mining methods

  26. How to trim candidate itemsets • In k-iteration, hash all “appearing” k+1 itemsets in a hashtable, count all the occurrences of an itemset in the correspondent bucket. • In k+1 iteration, examine each of the candidate itemset to see if its correspondent bucket value is above the support ( necessary condition ) Frequent-pattern mining methods

  27. Example Figure1. An example transaction database Frequent-pattern mining methods

  28. Generation of C1 & L1(1st iteration) C1 L1 Frequent-pattern mining methods

  29. Hash Table Construction • Find all 2-itemset of each transaction Frequent-pattern mining methods

  30. Hash Table Construction (2) • Hash function h({x y}) = ((order of x)*10+(order of y)) % 7 • Hash table {C E} {A E} {B C} {B E} {A B} {A C} {C E} {B C} {B E} {C D} {A D} {B E} {A C} bucket 0 1 2 3 4 5 6 Frequent-pattern mining methods

  31. C2 Generation (2nd iteration) Frequent-pattern mining methods

  32. Apriori Don’t prune database. Prune Ck by support counting on the original database. DHP More efficient support counting can be achieved on pruned database. Effective Database Pruning Frequent-pattern mining methods

  33. Performance Comparison Frequent-pattern mining methods

  34. Performance Comparison (2) Frequent-pattern mining methods

  35. Conclusion • Effective hash-based algorithm for the candidate itemset generation • Two phase transaction database pruning • Much more efficient ( time & space ) than Apriori algorithm Frequent-pattern mining methods

More Related