1 / 34

Efficiently Mining Long Patterns from Databases

This paper discusses the Max-Miner algorithm, which efficiently extracts long patterns from databases. It presents techniques employed, support lower bounding, additional constraints, evaluation and experimental results.

ehudson
Télécharger la présentation

Efficiently Mining Long Patterns from Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiently Mining Long Patterns from Databasesby Roberto J. Bayardo Jr.IBM Research Center - Greeshma Neglur - Devangana Tarafdar

  2. Contents .. • Introduction • Max-Miner Overview • Techniques employed • Formalizing Max-Miner • Support Lower Bounding (Max-Miner and Apriori) • Additional constraints on Max-Miner • Evaluation & Experimental Results • Summary

  3. Introduction • Finding patterns in databases is the fundamental operation behind most data-mining tasks. • Most pattern mining algorithms (Apriori Like) operate efficiently on databases where the longest patterns are relatively short. • Examples of databases with long patterns: • Biological databases (DNA, proteins etc) • Databases with questionnaire results

  4. Disadvantages of Apriori Like Algorithms • Employ bottom-up search to enumerate every single frequent itemset (smaller sets to bigger sets) • To produce a frequent itemset of length ‘ℓ’ all 2ℓ subsets must be produced as they must also be frequent • Exponential complexity restricts them to discovering only short patterns

  5. Basic concepts and definitions • Data-set : set of transactions over a finite item domain • Itemset : A set of items from the finite item domain • Support of an Itemset I [Sup(I)] : The number of transactions that contain I • Frequent Itemset : Is one whose support >= minimum support(minSup) as specified by the user

  6. Maximal Frequent itemsets • Maximal frequent itemset is one that has no superset that is frequent • Example: • {1,2,3} – not maximal as its superset {1,2,3,4} is frequent

  7. Max-Miner overview • Max-Miner is an improvement over Apriori Like Algorithms • How ? • efficiently extracts only maximal frequent itemsets • Implicitly outputs all frequent itemsets as any frequent itemset is a subset of a maximal frequent itemset • Performance • Results in an improvement of two or more orders of magnitude over Apriori on long pattern datasets • Scales linearly in the number of maximal patterns irrespective of length of longest pattern

  8. Set enumeration trees • Illustrate how sets of items are to be completely enumerated in a search problem • Example for Item Domain = {1,2,3,4} { } 1 (25%) 2 (50%) 3 (58%) 4 (67%) 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 Gen-Sub Item-order

  9. Set enumeration tree ... • Max-Miner expands sets over an ordered and finite Item domain (D) • Max-Miner uses heuristic techniques to order the items and dynamically reorder them on a per-node basis • The order imposed on the item domain affects the parent/child relationships but not the completeness

  10. Each node in the tree is represented by a Candidate group (g) Candidate group (g) has two itemsets Head : h(g) , is the itemset enumerated by the node Tail : t(g), is the ordered itemset containing all items not in h(g) that can potentially appear in any subnode Techniques Employed … {} 2 1 h(g)={1} t(g)={2,3,4} 1,3 1,2 h(g)={1,2} t(g)={3,4} 1,2,3 1,2,4 h(g)={1,2,3} t(g)={4} 1,2,3,4 h(g)={1,2,3,4} t(g)={}

  11. Techinques Employed… • Ordering of the tail items reflects how the sub-nodes are to be expanded • Counting the support of g : • Support of itemsets h(g) • Support of itemsets h(g) U t(g) • Support of h(g) U {i} for all i Є t(g) • Support of itemsets other than h(g) are used for pruning

  12. Techniques employed • Subset pruning (Apriori Like) • If a subset is infrequent then its superset will also be infrequent - do not expand node • Abandons strict bottom-up traversal of search space, attempts to look ahead • How ? • Superset pruning (looking ahead) • If an itemset is frequent, then all its subsets will be frequent but not maximal – do not expand node

  13. Techniques Employed…. • Subset pruning is done before expanding sub-nodes, by removing any item i ,(i Єt(g)) from g, if h(g) U {i} is infrequent. • If h(g) U {i} is infrequent then any head of a sub node containing ‘i’ is also infrequent. • An example using the set enumeration tree described earlier is shown : {} h(g)={1} t(g)={2,3,4} 1 2 3 4 1,2 1,3 1,4 {1,3} is infrequent , hence {3} is removed from t(g) 1,2,3 1,2,4 1,3,4 1,2,3,4

  14. Techniques Employed…. • Superset pruning is done by halting sub-node expansion at any g for which h(g) U t(g) is frequent. Why can this be done ? • h(g) U t(g) is frequent implies that any itemset as specified by sub-node is frequent but not maximal. • An example using the set enumeration tree described earlier is shown : {} {2,3,4} is frequent , hence ‘look ahead’ and halt expansion of subnodes h(g)={2} t(g)={3,4} 1 2 3 4 2,3 2,4 2,3,4

  15. Running Example • Item_Domain = {5,8,9,2,1,3,10,7,6,4} • Minsup = 25% • Data Set of transactions

  16. Formalizing Max-Miner • The Max-Miner algorithm consists of 3 main sections • The main function Max-Miner : This accepts a data set and a minimum support. • Gen-Initial-Groups : This perform an initial scan over the dataset and identifies the item domain and begins the tree enumeration process. • Gen-Sub-Nodes :This generates the actual sub node.

  17. Formalizing Max-Miner Example-1 Example-2 Gen-Sub-Nodes

  18. Formalizing Max-Miner

  19. GEN-INITIAL-GROUPS(T, C) I = {5,8,9,2,1,3,10,7,6,4} F1 = {1 (25%), 2 (50%), 3(58%), 4(67%)} (after ordering) C = {( , 2, 3, 4), ( , 3, 4), ( , 4)} Return (4) Running Example h(g)={1} t(g)={2,3,4} { } 1 2 3 4 h(g)={2} t(g)={3,4} h(g)={3} t(g)={4} 1 2 3

  20. Running Example.. sup(h(g) U {i}) sup(1,2) = 25% sup(1,3) = 0% Sup(1,4) = 25% • Back in Max-Miner (after Gen-Initial-Groups) • F = {(4)} • C= {( , 2, 3, 4), -> 0% [sup(h(g) U t(g)] ( , 3, 4), -> 25% (as above) ( , 4)} -> 25% (as above) • F = {(4), (2 , 3, 4), (3, 4)} • Cnew = {} • Gen-Sub-Nodes(( , 2, 3, 4), Cnew) (after subset-pruning) • h(g) = {1} , t(g) = {2,4} • g’ = {( , 4)} • C = {( , 4)} • Return (1, 4) { } 1 2 3 1 2 3 4 1,2 1,3 1,4 3, 4 1 2, 3, 4 1,2 1,2 Max-Miner

  21. Formalise Max-Miner Gen-SubLB

  22. Back in Max-Miner (after Gen-Sub-Nodes) F = {(4), (2,3,4), (3,4), (1,4)} C = {( , 4)} After removing F = {(2,3,4) , (1,4)} C = {( , 4)} Next iteration of while loop C = {( , 4)} -> 25% (sup (h(g) U t(g)) F = {(2,3,4) , (1,4) , (1,2,4)} F = {(2,3,4) , (1,2,4)} C = {} - termination Running Example.. Final tree generated { } 1,2 1 2 3 1,2 1,2 1,2 Max-Miner Set-Enum

  23. Item Ordering Policies • Motivation – increase the effectiveness of superset-frequency pruning [when h(g) U t(g) is frequent] • Superset-pruning possible when many candidate groups have h(g) U t(g) to be frequent • Force most frequent items to appear in most candidate groups as they are more likely to be part of long frequent itemsets • Reorder items to position the most frequent items last (eg : item (4) in the set enumeration tree Set-enum

  24. Support Lower Bounding • Advantage : reduces number of candidate groups and facilitates additional super-set pruning • Idea is to compute a lower-bound on the support of an itemset by exploiting the available support of its subsets • drop(Is , e) = sup(Is) – sup(Is U {e}) [ ‘e’ is an item not in Is or Iand Is I] • Similarly for I : sup(I U {e}) = sup(I) – drop(I, e) • Lower-bound on sup(I U {e}) can be computed from sup(I) and upper-bound on drop(I, e) Number of transactions dropped when ‘e ’ is added to Is

  25. Support Lower-Bounding • Theorem: • sup(I) – drop(Is, e) is a lower-bound on the sup(I U {e}) • Proof : • Space of all transactions occupied by I is entirely contained within the space occupied by Is • Set of transactions dropped by extending itemset I with ‘e’ is a subset of the set of transactions dropped by extending Iswith ‘ e ’ • drop(Is ,e) is an upper-bound on drop(I, e) • Extending the concept : T is a itemset disjoint from I sup(I U T) = sup(I) - ∑ drop(Is, e) e Є T e e

  26. Support Lower-Bounding in Max-Miner Gen-Sub

  27. Support Lower-Bounding Apriori (Apriori-LB) • An itemset of length ‘k’ is a candidate iff every k-1 item subset was found frequent during the previous database pass • Support of candidate itemsets (Ck) of length ‘k’ is computed during database pass ‘k’ • Apriori-LB lower-bounds the support of a candidate itemset above or equal to minsup, using the support of subsets, and hence avoids explicit calculation of support • Savings resulting from counting the support of fewer candidate itemsets can be substantial

  28. Finds the cardinality ‘ℓ’ of the longest itemset in the set of frequent itemsets ‘F’ after each database pass. Performs the following pruning operations on ‘F’: Prune any frequent itemsets in F shorter than ‘ℓ’ Prune any candidate groups g in C such that |h(g) U t(g)| < ‘ℓ’ Max-Miner-LO

  29. Evaluation • Evaluation was performed on datasets with relatively wide record lengths, categorically-valued attributes and substantial number of records • The seven different datasets used were

  30. Max-Miner outperforms Apriori and Apriori-LB because fewer candidate itemsets are considered Unable to run Apriori and Apriori-LB at lower supports because of memory constraints Comparison with Apriori and Apriori-LB

  31. Evaluation

  32. Additional constraints can make mining feasible even when the number of maximal frequent itemsets is too large and support is very low Performance Max-Miner-LO

  33. Summary • Max-Miner applies several techniques for reducing the space of itemsets considered, through superset and subset frequency based pruning. • It can be used to achieve tractable completeness at low supports on complex datasets. • Some of these techniques can be used to modify and improve existing algorithms like Apriori.

  34. Thank You !

More Related