1 / 50

Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal

Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal. Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran. Outline. Motivation Terms & Definitions Interest Measure Algorithms for mining generalized association rules Comparison

tarmon
Télécharger la présentation

Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Generalized Association RulesRamkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran

  2. Outline • Motivation • Terms & Definitions • Interest Measure • Algorithms for mining generalized association rules • Comparison • Conclusions Idit Haran, Data Mining Seminar, 2003

  3. Motivation • Find Association Rules of the form:Diapers  Beer • Different kinds of diapers: Huggies/Pampers, S/M/L, etc. • Different kinds of beers: Heineken/Maccabi, in a bottle/in a can, etc. • The information on the bar-code is of type:Huggies Diapers, M  Heineken Beer in bottle • The preliminary rule is not interesting, and probably will not have minimum support. Idit Haran, Data Mining Seminar, 2003

  4. Clothes Footwear Outwear Shirts Shoes Hiking Boots Jackets Ski Pants Taxonomy • is-a hierarchies Idit Haran, Data Mining Seminar, 2003

  5. Taxonomy - Example • Let say we found the rule:Outwear  Hiking Bootswith minimum support and confidence. • The rule Jackets  Hiking Bootsmay not have minimum support • The rule Clothes  Hiking Bootsmay not have minimum confidence. Idit Haran, Data Mining Seminar, 2003

  6. Taxonomy • Users are interested in generating rules that span different levels of the taxonomy. • Rules of lower levels may not have minimum support • Taxonomy can be used to prune uninteresting or redundant rules • Multiple taxonomies may be present. for example: category, price(cheap, expensive), “items-on-sale”. etc. • Multiple taxonomies may be modeled as a forest, or a DAG. Idit Haran, Data Mining Seminar, 2003

  7. z ancestors (marked with ^) edge: is_a relationship p parent c1 c2 child descendants Notations Idit Haran, Data Mining Seminar, 2003

  8. Notations • I = {i1, i2, …, im}- items. • T- transaction, set of items TI(we expect the items in T to be leaves in T .) • D – set of transactions • T supports item x, if x is in T or x is an ancestor of some item in T. • T supports XI if it supports every item in X. Idit Haran, Data Mining Seminar, 2003

  9. Notations • A generalized association rule: X Yif XI , YI , XY =  , and no item in Y is an ancestor of any item in X. • The rule XY has confidencec in Dif c% of transactions in D that support X also support Y. • The rule XY has supportsin D if s% of transactions in D supports XY. Idit Haran, Data Mining Seminar, 2003

  10. Problem Statement • To find all generalized association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf) respectively. Idit Haran, Data Mining Seminar, 2003

  11. Clothes Footwear Outwear Shirts Shoes Hiking Boots Jackets Ski Pants Example • Recall the taxonomy: Idit Haran, Data Mining Seminar, 2003

  12. Example minsup = 30% minconf = 60% Idit Haran, Data Mining Seminar, 2003

  13. Clothes Footwear Outwear Shirts Shoes Hiking Boots Jackets Ski Pants Observation 1 • If the set{x,y} has minimum support, so do {x^,y^} {x^,y} and {x^,y^} • For example: if {Jacket, Shoes} has minsup, so will {Outwear, Shoes}, {Jacket,Footwear}, and {Outwear,Footwear} Idit Haran, Data Mining Seminar, 2003

  14. Clothes Footwear Outwear Shirts Shoes Hiking Boots Jackets Ski Pants Observation 2 • If the rule xy has minimum support and confidence, only xy^ is guaranteed to have both minsup and minconf. • The rule OutwearHiking Boots has minsup and minconf. • The rule OutwearFootwear has both minsup and minconf. Idit Haran, Data Mining Seminar, 2003

  15. Clothes Footwear Outwear Shirts Shoes Hiking Boots Jackets Ski Pants Observation 2 – cont. • However, the rules x^y and x^y^ will have minsup, they may not have minconf. • For example: The rules ClothesHiking Boots and ClothesFootwear have minsup, but not minconf. Idit Haran, Data Mining Seminar, 2003

  16. Interesting Rules – Previous Work • a rule XY is not interesting if:support(XY)  support(X)•support(Y) • Previous work does not consider taxonomy. • The previous interest measure pruned less than 1% of the rules on a real database. Idit Haran, Data Mining Seminar, 2003

  17. Interesting Rules – Using the Taxonomy • MilkCereal (8% support, 70% conf) • Milk is parent of Skim Milk, and 25% of sales of Milk are Skim Milk • We expect:Skim MilkCerealto have 2% support and 70% confidence Idit Haran, Data Mining Seminar, 2003

  18. R-Interesting Rules • A rule is XY is R-interesting w.r.t an ancestor X^Y^ if: or, • With R = 1.1 about 40-55% of the rules were prunes. real support(XY) expected support (XY) based on (X^Y^) > R • real confidence(XY) expected confidence (XY) based on (X^Y^) > R • Idit Haran, Data Mining Seminar, 2003

  19. Problem Statement (new) • To find all generalized R-interesting association rules (R is a user-specified minimum interest called min-interest) that have support and confidence greater than minsup and minconf respectively. Idit Haran, Data Mining Seminar, 2003

  20. Algorithms – 3 steps 1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB) 3. Prune all uninteresting rules from this set. *All presented algorithms will only implement step 1. Idit Haran, Data Mining Seminar, 2003

  21. Algorithms – 3 steps 1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB) 3. Prune all uninteresting rules from this set. *All presented algorithms will only implement step 1. Idit Haran, Data Mining Seminar, 2003

  22. Algorithms (step 1) • Input: Database, Taxonomy • Output: All frequent itemsets • 3 algorithms (same output, different run-time): Basic, Cumulate, EstMerge Idit Haran, Data Mining Seminar, 2003

  23. Algorithm Basic – Main Idea • Is itemset X is frequent? • Does transaction T supports X? (X contains items from different levels of taxonomy, T contains only leaves) • T’ = T + ancestors(T); • Answer: T supports X  X  T’ Idit Haran, Data Mining Seminar, 2003

  24. Algorithm Basic Count item occurrences Generate new k-itemsets candidates Add all ancestors of each item in t to t, removing any duplication Find the support of all the candidates Take only those with support over minsup Idit Haran, Data Mining Seminar, 2003

  25. Candidate generation • Join step • Prune step P and q are 2 k-1 frequent itemsets identical in all k-2 first items. Join by adding the last item of q to p Check all the subsets, remove a candidate with “small” subset Idit Haran, Data Mining Seminar, 2003

  26. Clothes Outwear Shirts Jackets Ski Pants Optimization 1 Filtering the ancestors added to transactions • We only need to add to transaction t the ancestors that are in one of the candidates. • If the original item is not in any itemsets, it can be dropped from the transaction. • Example:candidates: {clothes,shoes}.Transaction t: {Jacket, …} can be replaced with {clothes, …} Idit Haran, Data Mining Seminar, 2003

  27. Optimization 2 Pre-computing ancestors • Rather than finding ancestors for each item by traversing the taxonomy graph, we can pre-compute the ancestors for each item. • We can drop ancestors that are not contained in any of the candidates in the same time. Idit Haran, Data Mining Seminar, 2003

  28. Optimization 3 Pruning itemsets containing an item and its ancestor • If we have {Jacket} and {Outwear}, we will have candidate {Jacket, Outwear} which is not interesting. • support({Jacket}) = support({Jacket, Outwear}) • Delete ({Jacket, Outwear}) in k=2 will ensure it will not erase in k>2. (because of the prune step of candidate generation method) • Therefore, we can prune the rules containing an item an its ancestor only for k=2, and in the next steps all candidates will not include item + ancestor. Idit Haran, Data Mining Seminar, 2003

  29. Optimization 2: compute the set of all ancestors T* from T Optimization 3: Delete any candidate in C2 that consists of an item and its ancestor Optimization 1: Delete any ancestors in T* that are not present in any of the candidates in Ck Optimzation2: foreach item xt add all ancestor of x in T* to t. Then, remove any duplicates in t. Algorithm Cumulate Idit Haran, Data Mining Seminar, 2003

  30. Clothes Footwear Outwear Shirts Shoes Hiking Boots Jackets Ski Pants Stratification • Candidates: {Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes} • If {Clothes, Shoes} does not have minimum support, we don’t need to count either {Outwear,Shoes} or {Jacket,Shoes} • We will count in steps: step 1: count {Clothes, Shoes}, and if it has minsup - step 2: count {Outwear,Shoes}, if has minsup – step 3: count {Jacket,Shoes} Idit Haran, Data Mining Seminar, 2003

  31. Version 1: Stratify • Depth of an itemset: • itemsets with no parents are of depth 0. • others: depth(X) = max({depth(X^) |X^ is a parent of X}) + 1 • The algorithm: • Count all itemsets C0 of depth 0. • Delete candidates that are descendants to the itemsets in C0 that didn’t have minsup. • Count remaining itemsets at depth 1 (C1) • Delete candidates that are descendants to the itemsets in C1 that didn’t have minsup. • Count remaining itemsets at depth 2 (C2), etc… Idit Haran, Data Mining Seminar, 2003

  32. Tradeoff & Optimizations #candidatescounted #passes over DB Count each depth on different pass Cumulate Optimiztion 1: Count together multiple depths from certain level Optimiztion 2: Count more than 20% of candidates per pass Idit Haran, Data Mining Seminar, 2003

  33. Version 2: Estimate • Estimating candidates support using sample • 1st pass: (C’k) • count candidates that are expected to have minsup(we count these candidates as candidates that has 0.9*minsup in the sample) • count candidates whose parents expect to have minsup. • 2nd pass: (C”k) • count children of candidates in C’k that were not expected to have minsup. Idit Haran, Data Mining Seminar, 2003

  34. Example for Estimate minsup = 5% Idit Haran, Data Mining Seminar, 2003

  35. Version 3: EstMerge • Motivation: eliminate 2nd pass of algorithm Estimate • Implementation: count these candidates of C”k with the candidates in C’k+1. • Restriction: to create C’k+1 we assume that all candidates in C”k has minsup. • The tradeoff: extra candidates counted by EstMerge v.s. extra pass made by Estimate. Idit Haran, Data Mining Seminar, 2003

  36. Count item occurrences Generate a sample over the Database, in the first pass Generate new k-itemsets candidates from Lk-1C”k-1 Estimate Ck candidate’s support by making a pass over Ds. C’k = candidates that are expected to have minsup + candidates whose parents are expected to have minsup Find the support of C’kC”k-1 by making a pass over D Delete candidates in Ck whose ancestors in C’k don’t have minsup Remaining candidates in Ck that are not in C’k Add all candidate in C”k with minsup All candidate in C’k with minsup Algorithm EstMerge Idit Haran, Data Mining Seminar, 2003

  37. Stratify - Variants Idit Haran, Data Mining Seminar, 2003

  38. Size of Sample Pr[support in sample < a] Idit Haran, Data Mining Seminar, 2003

  39. Size of Sample Idit Haran, Data Mining Seminar, 2003

  40. Performance Evaluation • Compare running time of 3 algorithms:Basic, Cumulate and EstMerge • On synthetic data: • effect of each parameter on performance • On real data: • Supermarket Data • Department Store Data Idit Haran, Data Mining Seminar, 2003

  41. Synthetic Data Generation Idit Haran, Data Mining Seminar, 2003

  42. Minimum Support Idit Haran, Data Mining Seminar, 2003

  43. Number of Transactions Idit Haran, Data Mining Seminar, 2003

  44. Fanout Idit Haran, Data Mining Seminar, 2003

  45. Number of Items Idit Haran, Data Mining Seminar, 2003

  46. Reality Check • Supermarket Data • 548,000 items • Taxonomy: 4 levels, 118 roots • ~1.5 million transactions • Average of 9.6 items per transaction • Department Store Data • 228,000 items • Taxonomy: 7 levels, 89 roots • 570,000 transactions • Average of 4.4 items per transaction Idit Haran, Data Mining Seminar, 2003

  47. Results Idit Haran, Data Mining Seminar, 2003

  48. Conclusions • Cumulate and EstMerge were 2 to 5 times faster than Basic on all synthetic datasets. On the supermarket database they were 100 times faster ! • EstMerge was ~25-30% faster than Cumulate. • Both EstMerge and Cumulate exhibits linear scale-up with the number of transactions. Idit Haran, Data Mining Seminar, 2003

  49. Summary • The use of taxonomy is necessary for finding association rules between items at any level of hierarchy. • The obvious solution (algorithm Basic) is not very fast. • New algorithms that use the taxonomy benefits are much faster • We can use the taxonomy to prune uninteresting rules. Idit Haran, Data Mining Seminar, 2003

  50. THE END Idit Haran, Data Mining Seminar, 2003

More Related