An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases National Institute of Informatics,JAPAN Kyushu University, JAPAN Kyushu University, JAPAN Hokkaido University, JAPAN Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura 4/Oct/2004 Discovery Science

Transaction Database 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ・ Transaction database T: a database composed of transactions defined on itemset I i.e., 　T,t∈T, t⊆I -basket data -links of web pages -words in documents ・ A subset of I is called a pattern T＝ Real world data is often large and sparse

Occurrences of Pattern ・ For a pattern P, occurrenceof P :　 a transaction in T including P denotationof P:　 set of occurrences ofP ・ The size of denotation is called frequencyof P 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 denotation of {1,2} ＝{ {1,2,5,6,7,9}, {1,2,7,8,9} } T＝

Frequent Pattern patterns included in at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {2,7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ・Given a minimum supportθ, Frequent pattern: a pattern s.t. (frequency)≧θ (a subset of items, which is included in at least θ transactions) Ex.) T＝ Important role in discovering interesting knowledge ・However, # frequent patterns is often large…

Closed Pattern [Pasquier et. al. 1999] φ ・ Patterns having the same denotations  quite similar ・Classify patterns into equivalence classes by their denotations ・Closed pattern: the maximal in an equivalence class (＝ intersection of occurrences in the denotation) ・Closure of a pattern： the closed pattern belonging to its equivalence class 1,2,5 3,5,7 closed pattern non-closed patterns equivalence class

Advantages of Closed Pattern Completeness: [Mannila’96; Pasquier ‘99] -The set C of frequent closed patterns have the complete information on the set F of all frequent patterns and their frequencies-Any maximal association rule can be constructed from the set C -The set C is sufficient for building classification rules over itemsets Compactness: [Mannila ‘96] -|Maximal Frequent| ≦ |C| ≦ |F| -Frequent closed patterns are possibly exponentially fewer than |F| Th 1[This paper]:For any n and m, we can construct a database of n items and m transactions that |C| = O(m2) while |F| = 2Ω(n+m).

Problem and Result We propose prefix preserving closure extension, and an efficient algorithm LCM (Linear time Closed pattern Miner) ・PROBLEM: given a transaction database, find all frequent closed patterns ・ Many existing studies, theory and practice ・ Theoretical advantage: linear time in #frequent closed patterns, use small memory ・ Practical advantage: faster than the other algorithms for many datasets (almost all datasets of KDDcup and FIMI’03)

Existing Approach ・Frequent pattern mining based approaches: enumerate frequent patterns, and output closed patterns among them ・Reduce the computation time by avoiding non-closed patterns: During the enumeration, -eliminate unnecessary patterns from memory - prune unnecessary branches of the recursion (not complete)

Our Approach ・Existing algorithms - possiblyoperate many non-closed patterns - require much memory for storing obtained patterns We propose closure extension based enumeration operate closed patterns only (linear time) prefix preserving closure extension no memory for previously obtained patterns (small memory) some algorithms for fast computation faster then other algorithms

closure item + Closure Extension [Pasquier et. al. ’99] ・Closure extension: a rule for constructing a closed pattern from another closed pattern  add an item, and take its closure closed pattern ・Any closed pattern is a closure extension of at least one other closed pattern ・ Any closed pattern has strictly smaller size than any its closure extension

frequent Acyclic Relation [essentially Pasquier et. al. ’99] Closure extension induces an acyclic search graph ・We compute in linear time all closed patterns by closure extension ・ However, we still have to store obtained closed patterns in memory…

Prefix Preserving Closure Extension [new] ・Prefix preserving closure extension(ppc extension) is a variation of closure extension Def. closure tail of a closed pattern P ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) ＝P Def. H ＝closure(P∪{i}) (closure extension of P) is a ppc extension ofP ⇔i > closure tailand H ∩{1,…,i-1} ＝P ∩{1,…,i-1} no duplication occurs by depth-first search “Any” closed pattern H is generated from another “unique” closed pattern by ppc extension (i.e., from closure(H ∩{1,…,i-1}) )

frequent Relation of ppc extension [new] ・Any closed pattern is a ppc extension of unique closed pattern ppc extension forms a tree We can proceed depth-first search by ppc extension, without storing closed patterns in memory

Example φ ・ closure extension acyclic ・ ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} closure extension ppc extension {2,5} {2,7,9} T＝ {1,2,7,9} {2,3,4,5} {1,2,7,8,9} {1,2,5,6,7,9}

Fast Computation To generate a ppc extension for closed pattern P and item i, we 1.compute the denotation of P ∪{i} 2.compute the closure of P ∪{i} 3.compare the prefix We propose efficient algorithms for these tasks

Occurrence Deliver [new] 3 4 5 4 5 5 A B C 3 3 ・ Compute the denotations of P ∪{i}for alli’s at once, by transposing the trimmed database ・ Trimmed database is composed of - items to be added - transactions including P pattern:1,2 denotation:A,B,C linear time in the size of trimmed database A B C denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 B B C A A ・Efficient for sparse datasets

Anytime Database Reduction [new] ・ Reduce the database, by [fp-growth, etc] 　　◆Remove iteme,if e is included in less thanθ transactions or included in all transactions 　　◆merge identical transactions into one ・ Recursively apply trimming and this reduction, in the recursion database size becomes small in lower levels of the recursion ・ For taking closure,keep the intersection of merged transactions ← closure operation is to take the intersection of transactions

Experiments ・Computational environment CPU, memory:　AMD Athron XP 1600+, 224MB OS, Programming language, compiler:Linux,C, gcc ・Algorithms compared with, FP-growth, afopt, MAFIA, PATRICIAMINE, kDCI (All these marked high scores atcompetition FIMI03) ・ Datasets 13dataset of real world, machine learning, artificial datasets used in FIMI03 and KDD-cup, with specified supports Result ・Won 12 databases for every support (other than Accident dataset of middle supports) ・outperfroms especially smaller supports

results

Conclusion Closed patterns: representatives of frequent patterns [Pasquier et.al.’00] - much fewer than frequent patterns (possibly exponentially) - useful in compact representation and rule induction ・We proposed an algorithmLCM for mining closed patterns in databases -prefix preserving closure extension for tree-shaped search space -time complexity is linear in #closed patterns, and small memory footprint - practical speed up: occurrence deliverand anytime database reduction ・Experiments show that LCM outperforms other algorithms in most instances, in KDDcup and FIMI datasets, especially with small supports Future work: closed patterns for sequences, trees, and other structures LCM is submitted to FIMI04 competition, be looking forward to it!

List of Datasets Machine learning benchmark ・Chess ・Mushroom ・Pumsb ・Pumsb* ・Connect Aartificial datasets ・T10I4D100K ・T40I10D100K Real datasets ・BMS-WebVeiw-1 ・BMS-WebVeiw-2 ・BMS-POS ・Retail ・Kosarak ・Accidents

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

Presentation Transcript

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner

CURE: An Efficient Clustering Algorithm for Large Databases

An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases

CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data

An Algorithm for enumerating All Spanning Trees of a Directed Graph

CBW: An Efficient Algorithm for Frequent Itemset Mining

An Efficient Online Algorithm for Hierarchical Phoneme Classification

An Efficient Index Structure for String Databases

An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns

An efficient parameterized algorithm for m-set packing

An Efficient P-center Algorithm

An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases

An Efficient Index Structure for String Databases

Closed Syllable Patterns

Closed Syllable Patterns

An Efficient Algorithm for Read Matching in DNA Databases

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism

An Algorithm For Exploring Patterns In Clinical Genomic Data

A Fast Algorithm for Enumerating Bipartite Perfect Matchings