1 / 31

Chapter 5 Frequent Patterns and Association Rule Mining

Chapter 5 Frequent Patterns and Association Rule Mining. Outline. Frequent Itemsets and Association Rule APRIORI Efficient Mining Post-processing Applications. Transactional Data. Definitions: An item : an article in a basket, or an attribute-value pair

deliz
Télécharger la présentation

Chapter 5 Frequent Patterns and Association Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5Frequent Patterns and Association Rule Mining CISC6930

  2. Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930

  3. Transactional Data • Definitions: • An item: an article in a basket, or an attribute-value pair • A transaction: set of items purchased in a basket; it may have TID (transaction ID) • A transactionaldataset: A set of transactions CISC6930

  4. Itemsets & Frequent Itemsets • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold CISC6930

  5. Association Rule • Find all the rules X  Ywith minimum support and confidence • support, s, probability that a transaction contains X  Y • s = P(X  Y) = support count (X  Y) / number of all transactions • confidence, c,conditional probability that a transaction having X also contains Y • c = P(X|Y) = support count (X  Y) / support count (X) CISC6930

  6. Customer buys both Customer buys diaper Customer buys beer Example • Let minsup = 50%, minconf = 50% • Number of all transactions = 5 & Min. support count = 5*50%=2.5 ⇒ 3 • Items: Beer, Nuts, Diaper, Coffee, Eggs, Milk • Freq. Pat.: {Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, {Beer, Diaper}:3 • Association rules (support, confidence): • Beer  Diaper (60%, 100%) • Diaper  Beer (60%, 75%) CISC6930

  7. Use of Association Rules • Association rules do not necessarily represent causality or correlation between the two itemsets. • X  Y does not mean X causes Y, noCausality • X  Y can be different from Y  X, unlike correlation • Association rules assist in Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. • What products were often purchased together?— Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? CISC6930

  8. Computational Complexity of Frequent Itemset Mining • How many itemsets are potentially to be generated in the worst case? • The number of frequent itemsets to be generated is senstive to the minsup threshold • When minsup is low, there exist potentially an exponential number of frequent itemsets • The worst case is close to MN where M: # distinct items, and N: max length of transactions when M is large. • The worst case complexty vs. the expected probability • Ex. Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: ~10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? CISC6930

  9. Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930

  10. Association Rule Mining • Major steps in association rule mining • Frequent itemsets computation • Rule derivation • Use of support and confidence to measure strength CISC6930

  11. The Downward Closure Property • The downward closure property of frequent patterns • Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} CISC6930

  12. ABC ABD ACD BCD AB AC AD BC BD CD A B C D Apriori: A Candidate Generation & Test Approach • A frequent (used to be called large) itemset is an itemset whose support (S) is ≥ minSup. • Apriori pruning principle: • If there is any itemset which is infrequent, its superset should not be generated/tested! CISC6930

  13. APRIORI • Method: • Initially, scan DB once to get frequent 1-itemset • Generatelength (k+1)candidateitemsets from length kfrequentitemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated CISC6930

  14. The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan CISC6930

  15. The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; CISC6930

  16. Implementation of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd} CISC6930

  17. Self-joining • Lk is generated by joining Lk-1 with itself • Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. • l1 from Lk-1 = {l1[1], l1[2],…., l1[k-2], l1[k-1] } • l2 from Lk-1 = {l2[1], l2[2],…., l2[k-2], l2[k-1] } • Only when first k-2 items of l1 and l2 are in common, and l1[k-1] < l2[k-1], these two itemsets are joinable CISC6930

  18. Further Improvement of the Apriori Method • Major computational challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates • Completeness: any association rule mining algorithm should get the same set of frequent itemsets. CISC6930

  19. Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Scan 1: partition database and find local frequent patterns • Since the sub-database is relatively smaller than original database, all frequent itemsets are tested in one scan. • Scan 2: consolidate global frequent patterns • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 CISC6930

  20. DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Candidates: a, b, c, d, e • Frequent 1-itemset: a, b, d, e • Using some hash function, get hash entries: {ab, ad, ae} with bucket count as 3, {bd, bd, be, de} with bucket count as 4… • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 CISC6930

  21. Transaction Reduction • A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. • Such a transaction could be marked or removed from further consideration of (k+1)-itemsets. • Less support counting CISC6930

  22. Sampling for Frequent Patterns • Select a sample of original database, mine frequent patterns within sample using Apriori with a lower support count threshold than min. support. • Scan database once to verify frequent itemsets found in sample, only bordersof closure of frequent patterns are checked • Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent patterns • H. Toivonen. Sampling large databases for association rules. In VLDB’96 CISC6930

  23. Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930

  24. Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset Xis closed frequent if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) CISC6930

  25. Closed Patterns and Max-Patterns • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} • Min_sup = 1. • What is the set of closed itemset? • <a1, …, a100>: 1 • < a1, …, a50>: 2 • What is the set of max-pattern? • <a1, …, a100>: 1 • What is the set of all patterns? • !! • Max-pattern loses the information of actual support counts. CISC6930

  26. MaxMiner: Mining Max-patterns • 1st scan: find frequent items • A, B, C, D, E • 2nd scan: find support for • AB, AC, AD, AE, ABCDE • BC, BD, BE, BCDE • CD, CE, CDE, DE, • Max-Miner uses pruning based on subset infrequency, as does Apriori, but it also uses pruning based on superset frequency. • Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan • R. Bayardo. Efficiently mining long patterns from databases. SIGMOD’98 Potential max-patterns CISC6930

  27. Complete set-enumeration tree { } 3 1 2 1,2 1,3 2,3 1,2,3 CISC6930

  28. Mining Closed Frequent Itemsets (CFI) • Strategies • Item merging: every transaction has X and Y but no proper superset of Y, then X⋃Y is a CFI. There is no need to find CFI having X but no Y. • Sub-itemset pruning: if X is a proper subset of CFI Y, and support(X)=support(Y), then X and all X’s descendants could be pruned. • Perform superset-checking and subset-checking CISC6930

  29. Database: Min_sup = 2, Min_conf = 50% Using Apriori algorithm, 20 frequent itemsets are found, out of them {acdf}(2), {cef}(3), {ae}(2), {cf}(4) {a}(3), {e}(4) J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. Mining Frequent Closed Patterns: CLOSET Min_sup=2 CISC6930

  30. Flist: list of all frequent items in support ascending order Flist: d(2)-a(3)-f(4)-e(4)-c(4) Divide search space Patterns having d(2) {10:cefa}, {40:cfa} Every transaction having d(2) also has cfa(2)  {cfad}(2) is a frequent closed pattern Patterns having a(3) but no d, {10:cef},{20:e},{40:cf} {a}(3) is a frequent closed pattern since none of above has a. Find frequent closed pattern recursively: {ea}(2) Patterns having f but no a, d, {10,30,50:ce}, {40:c} {cf}(4) is a frequent closed pattern since all above have it. {cef}(3) is also a frequent closed pattern. Patterns having e but not a, d, f, {10,30,50:c} {e}(4) is a frequent closed patterns. Patterns having only c. Mining Frequent Closed Patterns: CLOSET Min_sup=2 CISC6930

  31. Data Format • Horizontal data format • TID-itemset {TID: itemset} • Vertical data format • Item-TID-set {item: TID_set} • Mining could be performed by intersecting the TID_set of every pair of frequent items. CISC6930

More Related