1 / 42

Data Mining (and machine learning)

Data Mining (and machine learning). DM Lecture 4: The A Priori Algorithm. Overview of My Lectures. All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html 25/9 Overview of DM (and of these 8 lectures)

ilori
Télécharger la présentation

Data Mining (and machine learning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining(and machine learning) DM Lecture 4: The A Priori Algorithm David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  2. Overview of My Lectures All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html • 25/9 Overview of DM (and of these 8 lectures) • 02/10:     Data Cleaning - usually a necessary first step for large amounts of data • 09/10  Basic Statistics for Data Miners - essential knowledge, and very useful • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry • NO THURSDAY LECTURE OCTOBER 23rd • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data • NO THURSDAY LECTURE November 6th • 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data • 20/11: Regression - the simplest algorithm for predicting data/class values • 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  3. Reading The main technical material (the Apriori algorithm and its variants) in this lecture is based on: Fast Algorithms for Mining Association Rules, by Rakesh Agrawal and Ramakrishan Sikant, IBM Almaden Research Center The pdf is on my teaching resources page. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  4. `Basket data’ A very common type of data; often also called transaction data. Next slide shows example transaction database, where each record represents a transaction between (usually) a customer and a shop. Each record in a supermarket’s transaction DB, for example, corresponds to a basket of specific items. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  5. ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  6. Numbers Our example transaction DB has 20 records of supermarket transactions, from a supermarket that only sells 9 things One month in a large supermarket with five stores spread around a reasonably sized city might easily yield a DB of 20,000,000 baskets, each containing a set of products from a pool of around 1,000 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  7. Discovering RulesA common and useful application of data mining A `rule’ is something like this: If a basket contains apples and cheese, then it also contains beer Any such rule has two associated measures: • confidence – when the `if’ part is true, how often is the `then’ bit true? This is the same as accuracy. • coverageor support – how much of the database contains the `if’ part? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  8. Example: What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey 2/20 of the records contain both beer and cheese, so coverage is 10% Of these 2, 1 contains honey, so confidence is 50% David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  9. ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  10. Interesting/Useful rules Statistically, anything that is interesting is something that happens significantly more than you would expect by chance. E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder. I.e: if you choose a basket at random: • There is a probability 0.1 that it contains bread. • There is a probability 0.04 that it contains washing-up powder. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  11. Bread and washing up powder What is the probability of a basket containing both bread and washing-up powder? The laws of probability say: If these two things are independent, chance is 0.1 * 0.04 = 0.004 That is, we would expect 0.4% of baskets to contain both bread and washing up powder David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  12. Interesting means surprising We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder. If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that: • Something is going on in shoppers’ minds: bread and washing-up powder are connected in some way. • There may be ways to exploit this discovery … put the powder and bread at opposite ends of the supermarket? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  13. Finding surprising rules Suppose we ask `what is the most surprising rule in this database? This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable. Looking only at rules of the form: if basket contains X and Y, then it also contains Z … our realistic numbers tell us that there may be around 500,000,000 distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c 1016 operations … Yes, it’s easy to use `type 2’ DM, say, to work out the confidence and coverage of a given rule. But type 5 DM is all about searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  14. Here are some interesting onesin our mini basket DB: • If a basket contains glue, then it also contains either beer or eggs confidence: 100% ; coverage 25% • If a basket contains apples and dates, then it also contains honey confidence 100% ; coverage 20% David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  15. The A Priori Algorithm There is nothing very special or clever about this algorithm; however it is simple, fast, and very good at finding interesting rules of a specific kind in baskets or other transaction data. It is used a lot in the R&D Depts of retailers in industry (or by consultancies who do work for them). But note that we will now talk about itemsets instead of rules. Also, the coverage of a rule is the same as the support of an itemset. Don’t get confused! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  16. Find rules in two stages Agarwal and colleagues divided the problem of finding good rules into two phases: • Find all itemsets with a specified minimal support (coverage).An itemset is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm can efficiently find all itemsets whose coverage is above a given minimum. • Use these itemsets to help generate interersting rules. Having done stage 1, we have considerably narrowed down the possibilities, and can do reasonably fast processing of the large itemsets to generate candidate rules. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  17. Terminology k-itemset: a set of k items. E.g. {beer, cheese, eggs} is a 3-itemset {cheese} is a 1-itemset {honey, ice-cream} is a 2-itemset support: an itemset has support s%if s% of the records in the DB contain that itemset. minimum support: the Apriori algorithm starts with the specification of a minimum level of support, and will focus on itemsets with this level or above. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  18. Terminology large itemset: doesn’t mean an itemset with many items. It means one whose support is at least minimum support. Lk : the set of all large k-itemsets in the DB. Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it generates this set, which contains all the k-itemsets that might be large, and then eventually generates the set above. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  19. Terminology sets: Let A be a set (A = {cat, dog}) and let B be a set (B = {dog, eel, rat}) and let C = {eel, rat} I use `A + B’ to mean A union B. So A + B = {cat, dog, eel. rat} When X is a subset of Y, I use Y – X to mean the set of things in Y which are not in X. E.g. B – C = {dog} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  20. ID a, b, c, d, e, f, g, h, i E.g. 3-itemset {a,b,h} has support 15% 2-itemset {a, i} has support 0% 4-itemset {b, c, d, h} has support 5% If minimum support is 10%, then {b} is a large itemset, but {b, c, d, h} Is a small itemset! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  21. The Apriori algorithm for finding large itemsets efficiently in big DBs 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) 3 {Ck = apriori-gen(Lk-1) 4 For each c in Ck, initialise c.count to zero 5 For all records r in the DB • {Cr = subset(Ck, r); For each c in Cr , c.count++ } 7 Set Lk:= all c in Ckwhose count >= minsup 8 } /* end -- return all of the Lksets. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  22. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets To start off, we simply find all of the large 1-itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get L1 = { {a}, {b}, {c}, {d}, {f}} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  23. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) We already have L1. This next bit just means that the remainder of the algorithm generates L2, L3 , and so on until we get to an Lk that’s empty. How these are generated is like this: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  24. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) 3 {Ck = apriori-gen(Lk-1) Given the large k-1-itemsets, this step generates some candidate k-itemsets that might be large. Because of how apriori-gen works, the set Ckis guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  25. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) 3 {Ck = apriori-gen(Lk-1) 4 For each c in Ck, initialise c.count to zero We are going to work out the support for each of the candidate k-itemsets in Ck, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  26. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) 3 {Ck = apriori-gen(Lk-1) 4 For each c in Ck, initialise c.count to zero 5 For all records r in the DB • {Cr = subset(Ck, r); For each c in Cr , c.count++ } We now take each record r in the DB and do this: get all the candidate k-itemsets from Ck that are contained in r. For each of these, update its count. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  27. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) 3 {Ck = apriori-gen(Lk-1) 4 For each c in Ck, initialise c.count to zero 5For all records r in the DB • {Cr = subset(Ck, r); For each c in Cr , c.count++ } 7 Set Lk:= all c in Ckwhose count >= minsup Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have Lk, We now go back into the for loop of line 2 and start working towards finding Lk+1 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  28. Explaining the Apriori Algorithm … 1: Find all large 1-itemsets 2: For (k = 2 ; while Lk-1 is non-empty; k++) 3 {Ck = apriori-gen(Lk-1) 4 For each c in Ck, initialise c.count to zero 5For all records r in the DB • {Cr = subset(Ck, r); For each c in Cr , c.count++ } • Set Lk:= all c in Ckwhose count >= minsup 8 } /* end -- return all of the Lksets. We finish at the point where we get an empty Lk. The algorithm returns all of the (non-empty) Lk sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  29. apriori-gen : notes Suppose we have worked out that the large 2-itemsets are: L2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}} apriori-gen now generates 3-itemsets that all may be large. An obvious way to do this would be to generate all of the possible 3-itemsets that you can make from {milk, noodles, tights, quorn}. But this would include, for example, {milk, tights, quorn}. Now, if this really was a large 3-itemset, that would mean the number of records containing all three is >= minsup; this means it would have to be true that the number of records containing {tights, quorn} is >= minsup. But, it can’t be, because this is not one of the large 2-itemsets. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  30. apriori-gen : the join step apriori-gen is clever in generating not too many candidate large itemsets, but making sure to not lose any that do turn out to be large. To explain it, we need to note that there is always an ordering of the items. We will assume alphabetical order, and that the datastructures used always keep members of a set in alphabetical order. a < b will mean that a comes before b in alphabetical order. Suppose we have Lkand wish to generate Ck+1 First we take every distinct pair of sets in Lk {a1, a2 , … ak} and {b1, b2 , … bk}, and do this: in all cases where {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  31. An illustration of that Suppose the 2-itemsets are: L2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}, {noodles, peas}, {noodles, tights}} The pairs that satisfy this: {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, are: {milk, noodles}|{milk, tights} {noodles, peas}|{noodles, quorn} {noodles, peas}|{noodles, tights} {noodles, quorn}|{noodles, tights} So the candidate 3-itemsets are: {milk, noodles, tights}, {noodles, peas, quorn} {noodles, peas, tights}, {noodles, quorn, tights} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  32. apriori-gen : the prune step Now we have some candidate k+1 itemsets, and are guaranteed to have all of the ones that possibly could be large, but we have the chance to maybe prune out some more before we enter the next stage of Apriori that counts their support. In the prune step, we take the candidate k+1 itemsets we have, and remove any for which some 2-subset of it is not a large k-itemset. Such couldn’t possibly be a large k+1-itemset. E.g. in the current example, we have (n = noodles, etc): L2 = { {milk, n}, {milk, tights}, {n, quorn}, {n, peas}, {n, tights}} And candidate k+1-itemsets so far: {m, n, t}, {n, p, q}, {n, p, t}, {n, q, t} Now, {p, q} is not a 2-itemset, so {n,p,q} is pruned. {p,t} is not a 2-itemset, so {n,p,t} is pruned {q,t} is not a 2-itemset, so {n,q,t} is pruned. After this we finally have C3 = {{milk, noodles, tights}} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  33. What can you say about the coverage of {apples, milk}? Understanding rules The Apriori algorithm finds interesting (i.e. frequent) itemsets. E.g. it may find that {apples, bananas, milk} has coverage 30% -- so 30% of transactions contain each of these three things. We can invent several potential rules, e.g.: IF basket contains apples and bananas, it also contains MILK. Suppose support of {a, b} is 40%; what is the confidence of this rule? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  34. Understanding rules II Suppose itemset A = {beer, cheese, eggs} has 30% support in the DB {beer, cheese} has 40%, {beer, eggs} has 30%, {cheese, eggs} has 50%, and each of beer, cheese, and eggs alone has 50% support.. What is the confidence of: IF basket contains Beer and Cheese, THEN basket also contains Eggs ? The confidence of a rule if A then B is simply: support(A + B) / support(A). So it’s 30/40 = 0.75 ; this rule has 75% confidence What is the confidence of: IF basket contains Beer, THEN basket also contains Cheese and Eggs ? 30 / 50 = 0.6 so this rule has 60% confidence David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  35. Understanding rules III If the following rule has confidence c: If A then B and if support(A) = 2 * support(B), what can be said about the confidence of: If B then A confidence c is support(A + B) / support(A) = support(A + B) / 2 * support(B) Let d be the confidence of ``If B then A’’. d is support(A+B / support(B) -- Clearly, d = 2c E.g. A might be milk and B might be newspapers David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  36. What this lecture was about • The Apriori algorithm for efficiently finding frequent large itemsets in large DBs • Associated terminology • Associated notes about rules, and working out the confidence of a rule based on the support of its component itemsets David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  37. Appendix David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  38. ID a, b, c, d, e, f, g A full run through of Apriori We will assume this is our transaction database D and we will assume minsup is 4 (20%) This will not be run through in the lecture; it is here to help with revision David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  39. First we find all the large 1-itemsets. I.e., in this case, all the 1-itemsets that are contained by at least 4 records in the DB. In this example, that’s all of them. So, L1 = {{a}, {b}, {c}, {d}, {e}, {f}, {g}} Now we set k = 2 and run apriori-gen to generate C2 The join step when k=2just gives us the set of all alphabetically ordered pairs from L1, and we cannot prune any away, so we have C2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  40. So we have C2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}} Line 4 of the Apriori algorithm now tells us set a counter for each of these to 0. Line 5 now prepares us to take each record in the DB in turn, and find which of those in C2 are contained in it. The first record r1 is: {a, b, d, g}. Those of C2 it contains are: {a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}. Hence Cr1 = {{a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}} and the rest of line 6 tells us to increment the counters of these itemsets. The second record r2 is:{c, d, e}; Cr2 = {{c, d}, {c, e}, {d, e}}, and we increment the counters for these three itemsets. … After all 20 records, we look at the counters, and in this case we will find that the itemsets with >= minsup (4) counters are: {a, d}, {c, e}. So, L2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  41. So we have L2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}} We now set k = 3 and run apriori-gen on L2 . The join step finds the following pairs that meet the required pattern: {a, c}:{a, d} {c, d}:{c, e} {c, d}:{c, f} {c, e}:{c, f} This leads to the candidates 3-itemsets: {a, c, d}, {c, d, e}, {c, d, f}, {c, e, f} We prune {c, d, e} since {d, e} is not in L2 We prune {c, d, f} since {d, f} is not in L2 We prune {c, e, f} since {e, f} is not in L2 We are left with C3 = {a, c, d} We now run lines 5—7, to count how many records contain {a, c, d}. The count is 4, so L3 = {a, c, d} David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  42. So we have L3 = {a, c, d} We now set k = 4, but when we run apriori-gen on L3 we get the empty set, and hence eventually we find L4 = {} This means we now finish, and return the set of all of the non-empty Ls – these are all of the large itemsets: Result = {{a}, {b}, {c}, {d}, {e}, {f}, {g}, {a, c}, {a, d}, {c, d}, {c, e}, {c, f}, {a, c, d}} Each large itemset is intrinsically interesting, and may be of business value. Simple rule-generation algorithms can now use the large itemsets as a starting point. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

More Related