1 / 21

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases. National Institute of Informatics, JAPAN Kyushu University, JAPAN Kyushu University, JAPAN Hokkaido University, JAPAN. Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura. 4/ Oct/2004 Discovery Science.

val
Télécharger la présentation

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases National Institute of Informatics,JAPAN Kyushu University, JAPAN Kyushu University, JAPAN Hokkaido University, JAPAN Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura 4/Oct/2004 Discovery Science

  2. Transaction Database 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ・ Transaction database T: a database composed of transactions defined on itemset I i.e.,  T,t∈T, t⊆I -basket data -links of web pages -words in documents ・ A subset of I is called a pattern T= Real world data is often large and sparse

  3. Occurrences of Pattern ・ For a pattern P, occurrenceof P :  a transaction in T including P denotationof P:  set of occurrences ofP ・ The size of denotation is called frequencyof P 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 denotation of {1,2} ={ {1,2,5,6,7,9}, {1,2,7,8,9} } T=

  4. Frequent Pattern patterns included in at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {2,7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ・Given a minimum supportθ, Frequent pattern: a pattern s.t. (frequency)≧θ (a subset of items, which is included in at least θ transactions) Ex.) T= Important role in discovering interesting knowledge ・However, # frequent patterns is often large…

  5. Closed Pattern [Pasquier et. al. 1999] φ ・ Patterns having the same denotations  quite similar ・Classify patterns into equivalence classes by their denotations ・Closed pattern: the maximal in an equivalence class (= intersection of occurrences in the denotation) ・Closure of a pattern: the closed pattern belonging to its equivalence class 1,2,5 3,5,7 closed pattern non-closed patterns equivalence class

  6. Advantages of Closed Pattern Completeness: [Mannila’96; Pasquier ‘99] -The set C of frequent closed patterns have the complete information on the set F of all frequent patterns and their frequencies-Any maximal association rule can be constructed from the set C -The set C is sufficient for building classification rules over itemsets Compactness: [Mannila ‘96] -|Maximal Frequent| ≦ |C| ≦ |F| -Frequent closed patterns are possibly exponentially fewer than |F| Th 1[This paper]:For any n and m, we can construct a database of n items and m transactions that |C| = O(m2) while |F| = 2Ω(n+m).

  7. Problem and Result We propose prefix preserving closure extension, and an efficient algorithm LCM (Linear time Closed pattern Miner) ・PROBLEM: given a transaction database, find all frequent closed patterns ・ Many existing studies, theory and practice ・ Theoretical advantage: linear time in #frequent closed patterns, use small memory ・ Practical advantage: faster than the other algorithms for many datasets (almost all datasets of KDDcup and FIMI’03)

  8. Existing Approach ・Frequent pattern mining based approaches: enumerate frequent patterns, and output closed patterns among them ・Reduce the computation time by avoiding non-closed patterns: During the enumeration, -eliminate unnecessary patterns from memory - prune unnecessary branches of the recursion (not complete)

  9. Our Approach ・Existing algorithms - possiblyoperate many non-closed patterns - require much memory for storing obtained patterns We propose closure extension based enumeration operate closed patterns only (linear time) prefix preserving closure extension no memory for previously obtained patterns (small memory) some algorithms for fast computation faster then other algorithms

  10. closure item + Closure Extension [Pasquier et. al. ’99] ・Closure extension: a rule for constructing a closed pattern from another closed pattern  add an item, and take its closure closed pattern ・Any closed pattern is a closure extension of at least one other closed pattern ・ Any closed pattern has strictly smaller size than any its closure extension

  11. frequent Acyclic Relation [essentially Pasquier et. al. ’99] Closure extension induces an acyclic search graph ・We compute in linear time all closed patterns by closure extension ・ However, we still have to store obtained closed patterns in memory…

  12. Prefix Preserving Closure Extension [new] ・Prefix preserving closure extension(ppc extension) is a variation of closure extension Def. closure tail of a closed pattern P ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) =P Def. H =closure(P∪{i}) (closure extension of P) is a ppc extension ofP ⇔i > closure tailand H ∩{1,…,i-1} =P ∩{1,…,i-1} no duplication occurs by depth-first search “Any” closed pattern H is generated from another “unique” closed pattern by ppc extension (i.e., from closure(H ∩{1,…,i-1}) )

  13. frequent Relation of ppc extension [new] ・Any closed pattern is a ppc extension of unique closed pattern ppc extension forms a tree We can proceed depth-first search by ppc extension, without storing closed patterns in memory

  14. Example φ ・ closure extension acyclic ・ ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} closure extension ppc extension {2,5} {2,7,9} T= {1,2,7,9} {2,3,4,5} {1,2,7,8,9} {1,2,5,6,7,9}

  15. Fast Computation To generate a ppc extension for closed pattern P and item i, we 1.compute the denotation of P ∪{i} 2.compute the closure of P ∪{i} 3.compare the prefix We propose efficient algorithms for these tasks

  16. Occurrence Deliver [new] 3 4 5 4 5 5 A B C 3 3 ・ Compute the denotations of P ∪{i}for alli’s at once, by transposing the trimmed database ・ Trimmed database is composed of - items to be added - transactions including P pattern:1,2 denotation:A,B,C linear time in the size of trimmed database A B C denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 B B C A A ・Efficient for sparse datasets

  17. Anytime Database Reduction [new] ・ Reduce the database, by [fp-growth, etc]   ◆Remove iteme,if e is included in less thanθ transactions or included in all transactions   ◆merge identical transactions into one ・ Recursively apply trimming and this reduction, in the recursion database size becomes small in lower levels of the recursion ・ For taking closure,keep the intersection of merged transactions ← closure operation is to take the intersection of transactions

  18. Experiments ・Computational environment CPU, memory: AMD Athron XP 1600+, 224MB OS, Programming language, compiler:Linux,C, gcc ・Algorithms compared with, FP-growth, afopt, MAFIA, PATRICIAMINE, kDCI (All these marked high scores atcompetition FIMI03) ・ Datasets 13dataset of real world, machine learning, artificial datasets used in FIMI03 and KDD-cup, with specified supports Result ・Won 12 databases for every support (other than Accident dataset of middle supports) ・outperfroms especially smaller supports

  19. results

  20. Conclusion Closed patterns: representatives of frequent patterns [Pasquier et.al.’00] - much fewer than frequent patterns (possibly exponentially) - useful in compact representation and rule induction ・We proposed an algorithmLCM for mining closed patterns in databases -prefix preserving closure extension for tree-shaped search space -time complexity is linear in #closed patterns, and small memory footprint - practical speed up: occurrence deliverand anytime database reduction ・Experiments show that LCM outperforms other algorithms in most instances, in KDDcup and FIMI datasets, especially with small supports Future work: closed patterns for sequences, trees, and other structures LCM is submitted to FIMI04 competition, be looking forward to it!

  21. List of Datasets Machine learning benchmark ・Chess ・Mushroom ・Pumsb ・Pumsb* ・Connect Aartificial datasets ・T10I4D100K ・T40I10D100K Real datasets ・BMS-WebVeiw-1 ・BMS-WebVeiw-2 ・BMS-POS ・Retail ・Kosarak ・Accidents

More Related