Efficiently Mining Long Patterns from Databases

Efficiently Mining Long Patterns from Databasesby Roberto J. Bayardo Jr.IBM Research Center - Greeshma Neglur - Devangana Tarafdar

Contents .. • Introduction • Max-Miner Overview • Techniques employed • Formalizing Max-Miner • Support Lower Bounding (Max-Miner and Apriori) • Additional constraints on Max-Miner • Evaluation & Experimental Results • Summary

Introduction • Finding patterns in databases is the fundamental operation behind most data-mining tasks. • Most pattern mining algorithms (Apriori Like) operate efficiently on databases where the longest patterns are relatively short. • Examples of databases with long patterns: • Biological databases (DNA, proteins etc) • Databases with questionnaire results

Disadvantages of Apriori Like Algorithms • Employ bottom-up search to enumerate every single frequent itemset (smaller sets to bigger sets) • To produce a frequent itemset of length ‘ℓ’ all 2ℓ subsets must be produced as they must also be frequent • Exponential complexity restricts them to discovering only short patterns

Basic concepts and definitions • Data-set : set of transactions over a finite item domain • Itemset : A set of items from the finite item domain • Support of an Itemset I [Sup(I)] : The number of transactions that contain I • Frequent Itemset : Is one whose support >= minimum support(minSup) as specified by the user

Maximal Frequent itemsets • Maximal frequent itemset is one that has no superset that is frequent • Example: • {1,2,3} – not maximal as its superset {1,2,3,4} is frequent

Max-Miner overview • Max-Miner is an improvement over Apriori Like Algorithms • How ? • efficiently extracts only maximal frequent itemsets • Implicitly outputs all frequent itemsets as any frequent itemset is a subset of a maximal frequent itemset • Performance • Results in an improvement of two or more orders of magnitude over Apriori on long pattern datasets • Scales linearly in the number of maximal patterns irrespective of length of longest pattern

Set enumeration trees • Illustrate how sets of items are to be completely enumerated in a search problem • Example for Item Domain = {1,2,3,4} { } 1 (25%) 2 (50%) 3 (58%) 4 (67%) 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 Gen-Sub Item-order

Set enumeration tree ... • Max-Miner expands sets over an ordered and finite Item domain (D) • Max-Miner uses heuristic techniques to order the items and dynamically reorder them on a per-node basis • The order imposed on the item domain affects the parent/child relationships but not the completeness

Each node in the tree is represented by a Candidate group (g) Candidate group (g) has two itemsets Head : h(g) , is the itemset enumerated by the node Tail : t(g), is the ordered itemset containing all items not in h(g) that can potentially appear in any subnode Techniques Employed … {} 2 1 h(g)={1} t(g)={2,3,4} 1,3 1,2 h(g)={1,2} t(g)={3,4} 1,2,3 1,2,4 h(g)={1,2,3} t(g)={4} 1,2,3,4 h(g)={1,2,3,4} t(g)={}

Techinques Employed… • Ordering of the tail items reflects how the sub-nodes are to be expanded • Counting the support of g : • Support of itemsets h(g) • Support of itemsets h(g) U t(g) • Support of h(g) U {i} for all i Є t(g) • Support of itemsets other than h(g) are used for pruning

Techniques employed • Subset pruning (Apriori Like) • If a subset is infrequent then its superset will also be infrequent - do not expand node • Abandons strict bottom-up traversal of search space, attempts to look ahead • How ? • Superset pruning (looking ahead) • If an itemset is frequent, then all its subsets will be frequent but not maximal – do not expand node

Techniques Employed…. • Subset pruning is done before expanding sub-nodes, by removing any item i ,(i Єt(g)) from g, if h(g) U {i} is infrequent. • If h(g) U {i} is infrequent then any head of a sub node containing ‘i’ is also infrequent. • An example using the set enumeration tree described earlier is shown : {} h(g)={1} t(g)={2,3,4} 1 2 3 4 1,2 1,3 1,4 {1,3} is infrequent , hence {3} is removed from t(g) 1,2,3 1,2,4 1,3,4 1,2,3,4

Techniques Employed…. • Superset pruning is done by halting sub-node expansion at any g for which h(g) U t(g) is frequent. Why can this be done ? • h(g) U t(g) is frequent implies that any itemset as specified by sub-node is frequent but not maximal. • An example using the set enumeration tree described earlier is shown : {} {2,3,4} is frequent , hence ‘look ahead’ and halt expansion of subnodes h(g)={2} t(g)={3,4} 1 2 3 4 2,3 2,4 2,3,4

Running Example • Item_Domain = {5,8,9,2,1,3,10,7,6,4} • Minsup = 25% • Data Set of transactions

Formalizing Max-Miner • The Max-Miner algorithm consists of 3 main sections • The main function Max-Miner : This accepts a data set and a minimum support. • Gen-Initial-Groups : This perform an initial scan over the dataset and identifies the item domain and begins the tree enumeration process. • Gen-Sub-Nodes :This generates the actual sub node.

Formalizing Max-Miner Example-1 Example-2 Gen-Sub-Nodes

Formalizing Max-Miner

GEN-INITIAL-GROUPS(T, C) I = {5,8,9,2,1,3,10,7,6,4} F1 = {1 (25%), 2 (50%), 3(58%), 4(67%)} (after ordering) C = {( , 2, 3, 4), ( , 3, 4), ( , 4)} Return (4) Running Example h(g)={1} t(g)={2,3,4} { } 1 2 3 4 h(g)={2} t(g)={3,4} h(g)={3} t(g)={4} 1 2 3

Running Example.. sup(h(g) U {i}) sup(1,2) = 25% sup(1,3) = 0% Sup(1,4) = 25% • Back in Max-Miner (after Gen-Initial-Groups) • F = {(4)} • C= {( , 2, 3, 4), -> 0% [sup(h(g) U t(g)] ( , 3, 4), -> 25% (as above) ( , 4)} -> 25% (as above) • F = {(4), (2 , 3, 4), (3, 4)} • Cnew = {} • Gen-Sub-Nodes(( , 2, 3, 4), Cnew) (after subset-pruning) • h(g) = {1} , t(g) = {2,4} • g’ = {( , 4)} • C = {( , 4)} • Return (1, 4) { } 1 2 3 1 2 3 4 1,2 1,3 1,4 3, 4 1 2, 3, 4 1,2 1,2 Max-Miner

Formalise Max-Miner Gen-SubLB

Back in Max-Miner (after Gen-Sub-Nodes) F = {(4), (2,3,4), (3,4), (1,4)} C = {( , 4)} After removing F = {(2,3,4) , (1,4)} C = {( , 4)} Next iteration of while loop C = {( , 4)} -> 25% (sup (h(g) U t(g)) F = {(2,3,4) , (1,4) , (1,2,4)} F = {(2,3,4) , (1,2,4)} C = {} - termination Running Example.. Final tree generated { } 1,2 1 2 3 1,2 1,2 1,2 Max-Miner Set-Enum

Item Ordering Policies • Motivation – increase the effectiveness of superset-frequency pruning [when h(g) U t(g) is frequent] • Superset-pruning possible when many candidate groups have h(g) U t(g) to be frequent • Force most frequent items to appear in most candidate groups as they are more likely to be part of long frequent itemsets • Reorder items to position the most frequent items last (eg : item (4) in the set enumeration tree Set-enum

Support Lower Bounding • Advantage : reduces number of candidate groups and facilitates additional super-set pruning • Idea is to compute a lower-bound on the support of an itemset by exploiting the available support of its subsets • drop(Is , e) = sup(Is) – sup(Is U {e}) [ ‘e’ is an item not in Is or Iand Is I] • Similarly for I : sup(I U {e}) = sup(I) – drop(I, e) • Lower-bound on sup(I U {e}) can be computed from sup(I) and upper-bound on drop(I, e) Number of transactions dropped when ‘e ’ is added to Is

Support Lower-Bounding • Theorem: • sup(I) – drop(Is, e) is a lower-bound on the sup(I U {e}) • Proof : • Space of all transactions occupied by I is entirely contained within the space occupied by Is • Set of transactions dropped by extending itemset I with ‘e’ is a subset of the set of transactions dropped by extending Iswith ‘ e ’ • drop(Is ,e) is an upper-bound on drop(I, e) • Extending the concept : T is a itemset disjoint from I sup(I U T) = sup(I) - ∑ drop(Is, e) e Є T e e

Support Lower-Bounding in Max-Miner Gen-Sub

Support Lower-Bounding Apriori (Apriori-LB) • An itemset of length ‘k’ is a candidate iff every k-1 item subset was found frequent during the previous database pass • Support of candidate itemsets (Ck) of length ‘k’ is computed during database pass ‘k’ • Apriori-LB lower-bounds the support of a candidate itemset above or equal to minsup, using the support of subsets, and hence avoids explicit calculation of support • Savings resulting from counting the support of fewer candidate itemsets can be substantial

Finds the cardinality ‘ℓ’ of the longest itemset in the set of frequent itemsets ‘F’ after each database pass. Performs the following pruning operations on ‘F’: Prune any frequent itemsets in F shorter than ‘ℓ’ Prune any candidate groups g in C such that |h(g) U t(g)| < ‘ℓ’ Max-Miner-LO

Evaluation • Evaluation was performed on datasets with relatively wide record lengths, categorically-valued attributes and substantial number of records • The seven different datasets used were

Max-Miner outperforms Apriori and Apriori-LB because fewer candidate itemsets are considered Unable to run Apriori and Apriori-LB at lower supports because of memory constraints Comparison with Apriori and Apriori-LB

Evaluation

Additional constraints can make mining feasible even when the number of maximal frequent itemsets is too large and support is very low Performance Max-Miner-LO

Summary • Max-Miner applies several techniques for reducing the space of itemsets considered, through superset and subset frequency based pruning. • It can be used to achieve tractable completeness at low supports on complex datasets. • Some of these techniques can be used to modify and improve existing algorithms like Apriori.

Thank You !

Efficiently Mining Long Patterns from Databases

Efficiently Mining Long Patterns from Databases

Presentation Transcript

Technologies for Mining Frequent Patterns in Large Databases

Profit Mining: From Patterns to Action

Mining Sequential Patterns

Efficiently Mining Long Patterns from Databases

Mining Sequential Patterns

Mining Sequential Patterns

Mining Logs for Long-Term Patterns

Constraint Mining of Frequent Patterns in Long Sequences

Mining Graph Patterns Efficiently via Randomized Summaries

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Mining Frequent Patterns

Mining Sequential Patterns

Mining Patterns from Protein Structures

Data Mining: Concepts and Techniques Mining sequence patterns in transactional databases

Mining Sequence Patterns in Transactional Databases

Mining Multidimensional Databases

IncSpan: Incremental Mining of Sequential Patterns in Large Databases

by Andrei Broder , IBM Research

Mining Probabilistically Frequent Sequential Patterns in Uncertain Databases

PrefixSpan﹕ Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Mining Graph Patterns Efficiently via Randomized Summaries

Mining Multidimensional Databases