1 / 43

Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation. Published in: KDD ‘2002 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining . Jay Ayres, Jason Flannick , Johannes Gehrke , and Tomi Yiu Dept. of Computer Science Cornell University.

janna
Télécharger la présentation

Sequential PAttern Mining using A Bitmap Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequential PAttern Mining using A Bitmap Representation Published in: KDD ‘2002 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining Jay Ayres, Jason Flannick, Johannes Gehrke, and TomiYiu Dept. of Computer Science Cornell University Speaker: 江奇偉 林詮量 呂謙 吳佑瑋

  2. Outline • Motivation • Introduction • Terms • The SPAM algorithm • Data representation • Experimental evaluation • Q&A and discussion

  3. Motivation • This paper present an algorithm to quickly find all frequent sequences in a list of transaction and efficiently do sequential pattern mining.

  4. Introduction • Depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechamisms. • Implementation of the search strategy combines a vertical bitmap representation of the database with efficient support counting.

  5. Terms • Database D is a set of tuples • Customer-id,trabsaction-id,itemset • Each tuple in D is referred to as a transaction

  6. Terms • Sequence representation: one sequence per customer

  7. Terms • For a given customer-id ,there are no transactions with the same transaction ID • Sequence of itemsetsordered by increasing tid

  8. Terms • Itemset: • Sequence: • Ex:

  9. Terms • Size of : • Length : • Ex:

  10. Terms • Subsequence • Supersequence • Ex:

  11. Terms • Support of a sequence : the percentage of sequences that contain ( is subsequence of ) • Ex: Support = 66.7%

  12. Terms • Given a support threshold • A sequence Sa is called a frequent sequential pattertn on if • The problem of mining sequential patterns is to find all frequent sequential patterns for a database .

  13. The SPAM Algorithm • Lexicographic Tree for Sequence • Depth First Tree Traversal • Pruning

  14. The SPAM Algorithm • Lexicographic Tree for Sequence • Assume that lexicographical ordering of the items in the database • If item occurs before item in the ordering, then we denote this by • E.g.

  15. The SPAM Algorithm • Lexicographic Tree for Sequence • This ordering can be extended to sequences by defining , if is a subsequence of • If is not a subsequence of then there is no relationship in this ordering.

  16. The SPAM Algorithm • Lexicographic Tree for Sequence • Root : • If is a node in the tree, then children are all nodes such that

  17. The SPAM Algorithm • Lexicographic Tree for Sequence • Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. • sequence-extended sequence • itemset-extended sequence

  18. The SPAM Algorithm • Lexicographic Tree for Sequence • Sequence-extension step () • the process of generating sequence-extended sequences • Itemset-extension step () • the process of generating itemset-extended sequences

  19. The SPAM Algorithm

  20. The SPAM Algorithm • Depth First Tree Traversal • If the support of a generated sequence s is greater than or equal to , we store that sequence and repeat DFS recursively on s. • If the support of s is less than , then we do not need to repeat DFS on s by the Apriori principle, since any child sequence generated from s will not be frequent.

  21. The SPAM Algorithm • Pruning • Pruning

  22. The SPAM Algorithm • Pruning • Pruning

  23. Data Structure

  24. S-step process • process Example • Let us consider the bitmap sequence-extended to • Propose • To find relation about which will buy after this transaction

  25. I-step process • process Example • Let us consider the bitmap item-extended to • Propose • To find relation about which will buy at the same transaction

  26. Experimental Evaluation • Experimental results on the performance of SPAM in comparison with PrefixSpan[9] and SPADE[12] • [9] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpanmining sequential patterns efficiently by prefix projected pattern growth. In ICDE 2001, pages 215–226, Heidelberg, Germany, Apr. 2001. • M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31–60, 2001.

  27. Environment • All the experiments were performed on a 1.7GHz Intel Pentium 4 PC machine with 1 gigabyte main memory, running Microsoft Windows 2000. All three algorithms are written in C++

  28. Comparison on Minimum Support Values • On small, medium, and large datasets for various minimum support values

  29. Compare on Synthetic Data Generation • Generated numerous synthetic datasets using the IBM AssocGen program with several factors • D:Number of customers in the dataset • C:Average number of transactions per customer • T:Average number of items per transaction • S:Average length of maximal sequences • I:Average length of transactions within the maximal sequences

  30. Small Datasets(1/2) SPADE SPAM PS

  31. Small Datasets(2/2) • The counting process is critical because it is performed many times at each recursive step • SPAM handles it in an extremely efficient manner called bitmap • For small datasets • The initial overhead needed to set up and use the bitmap representation in some cases outweigh the benefits of faster counting • PrefixSpan runs slightly faster for small datasets

  32. Medium-Sized Datasets SPADE PS SPAM

  33. Large dataset PS SPADE SPAM

  34. Comparison on Several Parameters (1/2) • Five of three parameter that increasing dataset size • Average number of items per transaction • Average number of transactions per customer • Average length of maximal sequences SPAM

  35. Comparison on Several Parameters (2/2) • Five of two parameter that increased the discrepancy between the running times • Average length of transactions within the maximal sequences • Number of customers in the dataset PS SPADE SPAM

  36. Consideration of Space Requirement • SPAM uses a depth-first traversal of the search space • Quite space-inefficient in comparison to SPADE

  37. Parameter Definition • SPAM • D:Number of customers in the database • C:Average number of transactions per customer • N:Total number of items across all of the transactions • SPADE • D:Number of customers in the database • C:Average number of transactions per customer • T:Average number of items per transaction

  38. Space Requirement • Total transaction: • SPAM requires bytes • SPADE requires byes • SPAM is less space-efficient then SPADE as long as

  39. Pros&Cons • Pros: • efficient when the sequential patterns in the database are very long. • Bitmap representation , / traversal, / pruning all contribute to an excellent runtime. • A salient feature of our algorithm is thatit incrementally outputs new frequent itemsets in an online fashion. • Cons: • SPAM assumes that the entire database completely fit into mainmemory.

  40. Discussion • To allow for efficient counting of support, we set size of sequence to

  41. Discussion

  42. Question • In figure 15. Why did the running time show by not an increasing curve when adjusted the parameter on “ average length of transactions within the maximal sequences” ? SPADE PS SPAM

  43. Thanks for attention

More Related