1 / 41

Incremental and Interactive Sequence Mining

Incremental and Interactive Sequence Mining. 組員:林弘緯、李袁碩、鄭智陽、黃大洋. Outline. Introduction Problem definition SPADE ISM algorithm Interactive Sequence Mining Performance Conclusion Discuss. Introduction.

chika
Télécharger la présentation

Incremental and Interactive Sequence Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incremental and Interactive Sequence Mining 組員:林弘緯、李袁碩、鄭智陽、黃大洋

  2. Outline • Introduction • Problem definition • SPADE • ISM algorithm • Interactive Sequence Mining • Performance • Conclusion • Discuss

  3. Introduction • Sequence mining is an important data mining task, where one attempts to discover frequent sequences over time, of attribute sets in large databases. • This problem was originally motivated by applications in the retail industry, including attached mailing, add-on sales and customer satisfaction.

  4. Introduction • Discovering all frequent sequences in a very large database can be very compute and I/O intensive because the search space size is essentially exponential in the length of the longest transaction sequence in it. • This high computational cost may be acceptable when the database is static since the discovery is done only once, and several approaches to this problem have been presented in the literature.

  5. Introduction • Most current work assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database.

  6. Introduction • However, many domains such as electronic commerce, stock analysis, collaborative surgery, etc., impose soft real-time constraints on the mining process. • Hence, there is a need for algorithms that maintain valid mined information across • database updates • user interactions (modifying/constraining the search space).

  7. Problem definition Sequence: A sequence is an ordered list (ordered in time) of non-empty itemsets. i.e. A->B, A->B->C Our database is a collection of customers, each with a sequence of transactions, each of which is an itemset.

  8. Problem definition Incremental Sequence Discovery Problem: Given an original database D of sequences, and a new increment to the database δ, and all frequent sequences in the database D + δ, with minimum possible recomputation and I/O.

  9. The SPADE Algorithm An algorithm for fast discovery of frequent sequences, which forms the basis for our incremental algorithm.

  10. The SPADE Algorithm Support Counting:

  11. The SPADE Algorithm

  12. ISM Algorithm • ISM • Incremental Sequence Mining • Using the ISL created by SPADE • Definition: • NB: Negative border • FS: Sequence set • Composed by two phase • Phase1 • Updating the supports of elements in NB and FS • Phase2 • Adding to NB and FS beyond what was done in Phase 1

  13. ISM Algorithm - Phase 1 • supportD+l(X) = supportD(X) + support(X)D’ – supportD”(X) • Symbol definition • D : Original data • D’ : Original data + New data • D” : Duplicate part • l : New data

  14. ISM Algorithm - Phase 1 • For each single item sequences p, we compute supportD(X),support(X)D’ and supportD”(X) and put p into a queue Q, which is empty at the beginning of computation • Then we repeat the following until Q is empty • We dequeue one element p from Q • We update the support of p • If the sequence p (of length k) is in the frequent set, all length k + 1 sequences that are already in ISL and that are generating ascendents of p are queued into Q • If the sequence, p, is in the negative border and its support suggests it is frequent, then this element is placed in NB-to-FS[k]

  15. ISM Algorithm - Phase 1 Example: Min_Support: 3

  16. ISM Algorithm - Phase 1 For each single item sequences , A,B,C first put into Q C B A

  17. ISM Algorithm - Phase 1 C B A:New Support =4 ……. B->A B->B A->B AB C B

  18. ISM Algorithm - Phase 1 ……. A->B AB C: New Support = 4 Original Support = 1 NB-to-FS C

  19. ISM Algorithm - Phase 1

  20. ISM Algorithm - Phase 1

  21. ISM Algorithm - Phase 2 • For all 1-sequence that have moved we intersect it with all possible other frequent 1-sequence • Adding the frequent 2-sequence into NB-to-FS[2] • Those are not frequent are placed in NBD+l

  22. ISM Algorithm - Phase 2

  23. ISM Algorithm - Phase 2

  24. ISM Algorithm - Phase 2

  25. ISM Algorithm - Phase 2 • For length two sequence, pick an element that has not been processed and create the frequent sets • Pass the frequent sets to SPADE algorithm • Repeat from step.4 until NB-to-FS are empty

  26. Memory Management • Using a two level disk-file indexing scheme with a hashing mechanism • Vertical database is partitioned into a number of blocks ????? Given Customer Item Block1

  27. Memory Management Block5 Given … Hash Function Customer Block17 ….. Block56

  28. Memory Management Block5 Given … Hash Function Customer Block17 Hash Function Item ….. Block56

  29. Memory Management Block5 Given … Hash Function Customer Block17 Hash Function Item ….. Block56

  30. Interactive Sequence Mining • Definition: • The idea in interactive sequence mining is • that an end user be allowed to query the • database for association rules at differing values of support and confidence. • The goal is to allow such interaction without excessive I/O or computation.

  31. Six Typical Set Of Queries • Simple Queries • Refined Queries • Quantified Queries • Including Queries • Excluding Queries • Hierarchical Queries

  32. Problem • Given such a lattice, we can produce answers to all but one (Hierarchical queries) of the queries described in the previous section at interactive speeds without going back to the original database Why:Hierarchical queries require the algorithm to treat a set of related items as one super-item.

  33. Hierarchical queries

  34. Performance: Experimental Evaluation • Database properties: [ C:the number of customers T: average transaction size lDl: total number of transaction ]

  35. Performance • The first experiment: • Experiment is to compare the time cost of • running SPADE algorithm on entire • database with that of running just • incremental algorithm for 4 databases • while fixing transaction percentage of • increment to 20%.

  36. Result: experiment 1

  37. The second experiment: • Experiment is to vary the support size(1%), • and for two databases. • Result:

  38. The third experiment: Experiment is to keep the support, the number of customers, and the transaction percentage constant, and vary the number of transactions per customer (10, 12 and15) • Result

  39. Interactive Performance We evaluated simple querying on supports ranging from 0.1%-0.25%, refined querying (support refined to 0.5% for all the datasets), priority querying (querying for the 50 sequences with highest support), including queries (including a random item) and excluding queries (excluding a random item). • Result

  40. Conclusion • In this paper, we propose novel techniques that maintain data structures for mining sequences in the presence of a) database updates, and b) user interaction. • At the cost of maintaining a summary data structure, the interactive approach performs several orders of magnitude faster than any current sequence mining algorithm. • One of the limitations of the incremental approach proposed in this paper is the size of the negative border, and the resulting memory utilization.

  41. Discuss • Incremental update? • Hash collisions? • Memory usage? • …

More Related