1 / 61

AMCS/CS 340: Data Mining

Association Rules. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Outline: Mining Association Rules. Motivation and Definition High computational complexity Frequent itemsets mining Apriori algorithm – reduce the number of candidate

lark
Télécharger la présentation

AMCS/CS 340: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Rules AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. The Market-Basket Model • A large set of items • e.g., things sold in a supermarket • A large set of baskets, each of which is a small set of the items • e.g., each basket lists the things a customer buys once • Goal: Find “interesting” connections between items • Can be used to model any • many‐many relationship, • not just in the retail setting

  4. The Frequent Itemsets • Simplest question: Find sets of items that appear together “frequently” in baskets

  5. Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsupthreshold

  6. Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsupthreshold Example: Set minsup = 0.5 The frequent 2-itemsests : {Bread, Milk}, {Bread, Dipper} {Milk, Diaper}, {Diaper, Beer}

  7. Example: Definition: Association Rule • Association Rule • If-then rules, an implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X

  8. Application of Association Rule Items= products; Baskets= sets of products a customer bought; many people buy beer and diapers together  Run a sale on diapers; raise price of beer 2. Items = documents; Baskets = documents containing a similar sentence; Items that appear together too often could represent plagiarism 3. Items =words; Baskets= Web pages;  Co‐occurrence of relatively rare words may indicate an interesting relationship

  9. Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsupthreshold • confidence ≥ minconfthreshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

  11. Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67){Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) • Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemsethave identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

  12. Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. Computational Complexity • Given d unique items in all transactions: • Total number of itemsets = 2d -1 • Total number of possible association rules: If d=6, R = 602 rules d (#items) can be 100K (Wal‐Mart) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  14. Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Reduce the number of candidates • Complete search of frequent items = 2d -1 • Priory principle: • If an itemset is frequent, then all of its subsets must also be frequent if Y is frequent, all X are frequent • If an itemset is infrequent, then all of this parent-sets must also be infrequent if X is infrequent, all Y are infrequent Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. Illustrating Apriori Principle Database D Minimum Support = 3/5 Items (1-itemsets) Eliminate Scan D Generate Pairs (2-itemsets) Scan D Eliminate Generate Triplets (3-itemsets) Prune Scan D 2 Milk Not a frequent 3-itemset

  17. Apriori Algorithm • Let k=1 • Generate frequent itemsets of length 1 • Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. Factors Affecting Complexity • Choice of minimum support threshold • lowering support threshold results in more frequent itemsets • this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set • more space is needed to store support count of each item • if number of frequent items also increases, both computation and I/O costs may also increase • Size of database • since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width • transaction width increases with denser data sets • This may increase max length of frequent itemsets Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: sub-database test only! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  21. FP-tree construction Header Table Item frequency a 8 b 7 c 6 d 5 e 3 • Steps: • Scan DB once, find frequent 1-itemset (single item pattern) • Scan DB again, construct FP-tree • (One transaction  one path in the FP-tree)

  22. FP-tree construction: pointers Header table Pointers are used to assist frequent itemset generation

  23. Benefits of the FP-tree Structure • Completeness: • never breaks a long pattern of any transaction • preserves complete information for frequent pattern mining • Compactness • One transaction  one path in the FP-tree • Paths may overlap: • the more the paths overlap, the more compression achieved • Never be larger than the original database (if not count node-links and counts) • Best case: only one single tranche of nodes • The size of a FP-tree depends on how the items are ordered Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Different FP-tree by ordering items Header Table Item frequency a 8 b 7 c 6 d 5 e 3 Header Table Item frequency e 3 d 5 c 6 b 7 a 8

  25. Generating frequent itemset FP-Growth Algorithm: Bottom-up fashion to derive the frequent itemsets ending with a particular item Paths can be accessed rapidly using the pointers associated to e

  26. Generating frequent itemset FP-Growth Algorithm: Decompose the frequent itemset generation problem into multiple sub-problems

  27. Example: find frequent itemsets including e How to solve the sub-problem? • Construct conditional tree for e (if e is frequent) • Start from paths containing node e • Prefix Paths ending in {de} • conditional tree for {de} • Prefix Paths ending in {ce} • conditional tree for {ce} • Prefix Paths ending in {ae} • conditional tree for {ae}

  28. Example: find frequent itemsets including e • How to solve the sub-problem? • Check e is frequent or not • Convert the prefix paths into • conditional FP-tree • Update support counts • Remove nodee • Ignore infrequent node b Count(e) =3 > Minsup=2 Frequent Items: {e}

  29. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {de}) • Check dis frequent or not in the prefix paths ending in {de} Count(d,e) =2 (Minsup=2) Frequent Items: {e}{d,e}

  30. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {de}) • Convert the prefix paths (ending in {de}) into conditional FP-tree for {de} • Update support counts • Remove noded • Ignore infrequent node c Frequent Items: {e}{d,e}

  31. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {de}) • Check ais frequent or not in the prefix paths ending in {ade} Count(a) =2 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e}

  32. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ce}) • Check cis frequent or not in the prefix paths ending in {e} Count(c,e) =2 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e} {c,e}

  33. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ce}) • Convert the prefix paths (ending in {ce}) into conditional FP-tree for {ce} • Update support counts • Remove nodec Frequent Items: {e}{d,e} {a,d,e} {c,e}

  34. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ce}) • Check a is frequent or not in the prefix paths ending in {ace} Count(a) =1 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e} {c,e}

  35. Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ae}) • Check ais frequent or not in the prefix paths ending in {ae} Count(a,e) =2 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e} {c,e} {a,e}

  36. Mining Frequent Itemset using FP-tree • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • At each step, a conditional FP-tree is constructed by updating the frequency counts along the prefix paths and removing all infrequent items • Properties • Sub-problems are disjoint  No duplicate itemsets • FP-growth is an order of magnitude faster Apriorialgorithm (depends on the compaction factor of data) • No candidate generation, no candidate test • Use compact data structure • Eliminate repeated database scan • Basic operation is counting and FP-tree building Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  37. Compacting the Output of Frequent Itemsets ---Maximal vs Closed Itemsets Maximal Frequent itemsets: no immediate superset is frequent Closed itemsets: no immediate superset has the same count (> 0). Stores not only frequent information, but exact counts.

  38. Example: Maximal vs Closed Itemsets Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  39. Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  40. Rule Generation • Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement • If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

  41. Rule Generation • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) s(ABCD) s(ABD) c(ABCD) = ------------- c(ABD) =------------ s(ABC) s(AB) only s(ABC) < s(AB), s(ABCD) < s(ABD) (support has monotone property) • But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) Computing the confidence of a rule does not require additional scans of data

  42. Rule Generation for Apriori Algorithm • Apriori algorithm: • level-wise approach for generating association rules • each level: same number of items in rule consequent • rules with t consequent items are used to generate rules with t+1 consequent items

  43. Rule Generation for Apriori Algorithm • Candidate rule is generated by merging two rules that share the same prefixin the rule consequent • Join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC • Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  44. Pruned Rules (all rules containing item A as consequent) Rule Generation for Apriori Algorithm Low-confidence Rules

  45. Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  46. Interesting rules ? • Ruleswith high support and confidence may be useful, but not “interesting” • The “then” part is a frequent action in “if-them” rules • Example: {Milk,Beer}  {Diaper} (s=0.4, c=1.0) Prob(Diaper) = 4/5 = 0.8 • high interest suggests a cause that might be worth investigating

  47. Computing Interestingness Measure • Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y f11: the # of transactions containing both X and Yf10: the # of transactions containing only X, no Yf01: the # of transactions containing only Y, no Xf00: the # of transactions containing no X, no Y Used to define various measures • support (f11/|T|), confidence(f11/f1+), lift, Gini, J-measure, etc

  48. Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 15/20 = 0.75 • but P(Coffee) = 90/100 = 0.9 • Although confidence is high, rule is misleading • P(Coffee|Tea) =75/80= 0.9375 Drawback of Confidence

  49. Example: Lift/Interest • Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 15/20 = 0.75 • but P(Coffee) = 90/100 = 0.9 • Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

  50. There are lots of measures proposed in the literature Tan,Steinbach, Kumar Introduction to Data Mining

More Related