1 / 189

Technologies for Mining Frequent Patterns in Large Databases

Technologies for Mining Frequent Patterns in Large Databases. Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http://www.cs.sfu.ca/~han. Tutorial Outline. What is frequent pattern mining? Frequent pattern mining algorithms Apriori and its variations

Mercy
Télécharger la présentation

Technologies for Mining Frequent Patterns in Large Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Technologies for Mining Frequent Patterns in Large Databases Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http://www.cs.sfu.ca/~han

  2. Tutorial Outline • What is frequent pattern mining? • Frequent pattern mining algorithms • Apriori and its variations • A multi-dimensional view of frequent pattern mining • Constraint-based frequent pattern mining • Recent progress on efficient mining methods • Mining frequent patterns without candidate generation • CLOSET: Efficient mining of frequent closet itemsets • FreeSpan: Towards efficient sequential pattern mining

  3. Part IWhat Is Frequent Pattern Mining? • What is frequent pattern? • Why frequent pattern mining? • Challenges in frequent pattern mining

  4. What Is Frequent Pattern Mining? • What is a frequent pattern? • Pattern (set of items, sequence, etc.) that occurs together frequently in a database [AIS93] • Frequent pattern: an important form of regularity • What products were often purchased together? — beers and diapers! • What are the consequences of a hurricane? • What is the next target after buying a PC?

  5. Application Examples • Market Basket Analysis • *  Maintenance Agreement What the store should do to boost Maintenance Agreement sales • Home Electronics  * What other products should the store stocks up on if the store has a sale on Home Electronics • Attached mailing in direct marketing • Detecting “ping-pong”ing of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients

  6. Frequent Pattern Mining—A Corner Stone in Data mining • Association analysis • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, text database analysis • Correlation or causality analysis • Clustering • Classification • Association-based classification analysis • Sequential pattern analysis • Web log sequence, DNA analysis, etc. • Partial periodicity, cyclic/temporal associations

  7. Association Rule Mining • Given • A database of customer transactions • Each transaction is a list of items (purchased by a customer in a visit) • Find all rules that correlate the presence of one set of items with that of another set of items • Example: 98% of people who purchase tires and auto accessories also get automotive services done • Any number of items in the consequent/antecedent of rule • Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).

  8. Basic Concepts • Rule form: “A® B [support s, confidence c]”. Support: usefulness of discovered rules Confidence: certainty of the detected association Rules that satisfy both min_sup and min_conf are called strong. • Examples: • buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%] • age(x, “30-34”) ^ income(x ,“42K-48K”) ® buys(x, “high resolution TV”) [2%,60%] • major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

  9. Rule Measures: Support and Confidence • Find all the rules X & Y  Z with minimum confidence and support • support,s, probability that a transaction contains {X, Y, Z} • confidence,c,conditional probability that a transaction having {X, Y} also contains Z. Customer buys both Customer buys diaper Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have • A  C (50%, 66.6%) • C  A (50%, 100%)

  10. Part IIFrequent pattern mining methods: Apriori and its variations • The Apriori algorithm • Improvements of Apriori • Incremental, parallel, and distributed methods • Different measures in association mining

  11. An Influential Mining Methodology — The Apriori Algorithm • The Apriori method: • Proposed by Agrawal & Srikant 1994 • A similar level-wise algorithm by Mannila et al. 1994 • Major idea: • A subset of a frequent itemset must be frequent • E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. Anyone is infrequent, its superset cannot be! • A powerful, scalable candidate set pruning technique: • It reduces candidate k-itemsets dramatically (for k > 2)

  12. Mining Association Rules — Example For rule AC: support = support({AC}) = 50% confidence = support({AC})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent. Min. support 50% Min. confidence 50%

  13. Procedure of Mining Association Rules: • Find the frequent itemsets: the sets of items that have minimum support (Apriori) • A subset of a frequent itemset must also be a frequent itemset, i.e., if {A  B} isa frequent itemset, both {A} and {B} should be a frequent itemset • Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) • Use the frequent itemsets to generate association rules.

  14. The Apriori Algorithm • Join Step Ckis generated by joining Lk-1with itself • Prune Step Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed. (Ck: Candidate itemset of size k) (Lk : frequent itemset of size k)

  15. Apriori—Pseudocode Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

  16. The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D

  17. How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

  18. How to Count Supports of Candidates? • Why counting supports of candidates a problem? • The total number of candidates can be very huge • One transaction may contain many candidates • Method: • Candidate itemsets are stored in a hash-tree • Leaf node of hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction

  19. Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

  20. Subset function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Example: Counting Supports of Candidates Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6

  21. Generating Strong Association Rules • Confidence(A B) = Prob(B|A) = support(A B)/support(A) • Example: L3={2,3,5} 2^3  5, confidence=2/2=100% 2^5  3, confidence=2/3=67% 3^5  2, confidence=2/2=100% 2  3^5, confidence=2/3=67% 3  2^5, confidence=2/3=67% 5  3^2, confidence=2/3=67%

  22. Efficient Implementation of Apriori in SQL • S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and Implications. In SIGMOD’98 • Implementations based on pure SQL-92 • Impossible to get good performance out of pure SQL based approaches alone • Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. • Get orders of magnitude improvement

  23. Improvements of Apriori • General ideas • Scan the transaction database as fewer passes as possible • Reduce number of candidates • Facilitate support counting of candidates

  24. DIC: Reduce Number of Scans • S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97 • Basic idea • Count the itemsets at the boundary in a lattice • Push the boundary dynamically • Using trie structure to keep track counters and reordering items to reduce counting costs

  25. Once all (k-1)-itemset of a k-itemset are all frequent, the counting of the k-itemset can begin Any upper nodes of an infrequent itemset should not be counted 1-itemsets 2-itemsets 1-itemsets … Example of DIC ABCD ABC ABD ACD BCD AB AC BC AD BD CD Transactions B C D A Apriori {} 2-items Itemset lattice and boundary DIC 3-items

  26. DIC: Pros and Cons • Number of scans • Can be reduced in some cases • But how about non-homogeneous data and high support situations? • Item reordering • “Item reordering did not work as well as we had hoped” • Performance • 30% gain at low support ends • 30% lose at high support ends

  27. DHP: Reduce the Number of Candidates • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 • Major features • Efficient generation for candidate itemsets • Effective reduction on transaction database size

  28. DHP: Efficient Generation for Candidates • In the k pass, count support for k-candidates, entries in hash table • A (k+1)-itemset in Lk*Lk is qualified as a (k+1)-candidate only if it passes the hash filtering, i.e., it is hashed into a hash entry whose value is no less than support threshold • Example • Candidates: a, b, c, d, e • Hash entries: {ab, ad, ae} {bd, be, de} … • Frequent 1-itemset: a, b, d, e • ab is not a candidate 2-itemset if the count of the hash bucket, {ab, ad, ae}, is below support threshold

  29. DHP: Effective Reduction on Database Size • An item in transaction t can be trimmed if it does not appear in at least k of the candidate k-itemsets in t • Examples • Transaction acd can be discarded if only ac is frequent • Transaction bce must be kept if bc, be, and cd are frequent

  30. Partition: Scan Database Only Twice • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 • Mine all frequent itemsets by scanning transaction database only twice

  31. Scan One in Partition • Divide database into n partitions. • A global frequent itemset must be frequent in at least one partition. • Process one partition in main memory at a time, for each partition • generate local frequent itemsets using the Apriori algorithm • also form tidlist for all itemsets to facilitate counting in the merge phase • tidlist: contains the transaction Ids of all transactions that contain the itemset within a given partition

  32. Scan Two in Partition • Merge local frequent itemsets to generate a set of all potential large itemsets • Count actual supports • Support can be computed from the tidlists

  33. Partition: Pros and Cons • Achieve both CPU and I/O improvements over Apriori • The number of distinct local frequent itemsets may be very large • tidlists to be maintained can be huge

  34. Sampling for Mining Frequent Itemsets • H. Toivonen. Sampling large databases for association rules. In VLDB’96 • Select a sample of original database, mine frequent itemsets within sample using Apriori • Scan database once to verify frequent itemsets found in sample, only bordersof closure of frequent itemsets are checked • Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent itemsets

  35. Challenges for the Sampling Method • How to sample a large database? • When support threshold is pretty low, sampling may not generate results good enough

  36. Incremental Association Mining • A transaction database and a set of frequent itemset already mined • A set of update transactions for transaction database, including insertion and deletion • How to update the frequent itemset for the updated transaction database? Frequent itemsets What are the updated frequent itemsets? Transaction database Update transactions

  37. FUP: Incremental Update of Discovered Rules • D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In ICDE’96 • View a database: original DB È incremental db. • A k-itemset (for any k) • frequent in DB È db if frequent in both DB and db. • infrequent in DB È db if also in both DB and db. • For those only frequent in DB, merge corresponding counts in db. • For those only frequent in db, search DB to update their itemset counts.

  38. Incremental Update of Discovered Rules • A fast updating algorithm, FUP (Cheung et al.’96) • View a database: original DB È incremental db. • A k-itemset (for any k), • frequent in DB È db if frequent in both DB and db. • infrequent in DB È db if also in both DB and db. • For those only frequent in DB, merge corresponding counts in db. • For those only frequent in db, search DB to update their itemset counts. • Similar methods can be adopted for data removal and update, or distributed/parallel mining.

  39. Parallel and Distributed Association Mining • D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In PDIS 1996 • M. Tamura and M. Kitsuregawa. Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. In VLDB 1999 • E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD’97 • M. Zaki, S. Parthasarathy, and M. Ogihara. Parallel algorithms for discovery of association rules. In Data Mining and Knowledge Discovery. Vol.1 No.4, 1997

  40. Interestingness Measures • Objective measures Two popular measurements: • support; and • confidence • Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if • it is unexpected (surprising to the user); and/or • actionable (the user can do something with it)

  41. Criticism to Support and Confidence • Example 1: (Aggarwal & Yu, PODS98) • Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal • play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. • play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

  42. Criticism to Support and Confidence (Cont.) • Example 2: • X and Y: positively correlated, • X and Z, negatively related • support and confidence of X=>Z dominates

  43. Other Interestingness Measures: Interest • Interest (lift) • taking both P(A) and P(B) in consideration • P(A^B)=P(B)*P(A), if A and B are independent events • A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated.

  44. Other Interestingness Measures: Conviction • Conviction • from implication: A  B  A  ( B) • factors in both P(A) and P(B) and has value 1 when the relevant items are completely unrelated (confidence does not) • rules which hold 100% of the time have the highest possible value  (interest does not)

  45. Collective Strength • Collective strength is a number between 0 and  with 1 as the break-even point where v(I) is the violation ratio of itemset I. An itemset is said to be in violationof a transaction if some of the items are present in the transaction, and others are not. v(I) is equal to the fraction of transactions which contain a proper non-null subset of I • Recasting collective strength as:

  46. Collective Strength (2) • Let I be a set of items {i1, i2, … ik}. Let pr denote the frequency of the item ir in the database. • the probability that the itemset I occurs in a transaction is • the probability that none of the items in I occurs in the transaction is • the expected fraction of transactions that contains at least one item in I, and where at least one item is absent:

  47. Collective Strength (3) • Example: • Collective Strength of I {X,Y}:

  48. Summary • Frequent pattern mining is an important data mining task • Apriori is an important frequent pattern mining methodology • A set of Apriori-like mining methods have been developed since 1994 • Interestingness measure is important at discovery interesting rules

  49. Technologies for Mining Frequent Patterns in Large Databases Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http://www.cs.sfu.ca/~han

  50. Tutorial Outline • What is frequent pattern mining? • Frequent pattern mining algorithms • Apriori and its variations • A multi-dimensional view of frequent pattern mining • Constraint-based frequent pattern mining • Recent progress on efficient mining methods • Mining frequent patterns without candidate generation • CLOSET: Efficient mining of frequent closet itemsets • FreeSpan: Towards efficient sequential pattern mining

More Related