Incremental and Interactive Sequence Mining

Incremental and Interactive Sequence Mining 組員：林弘緯、李袁碩、鄭智陽、黃大洋

Outline • Introduction • Problem definition • SPADE • ISM algorithm • Interactive Sequence Mining • Performance • Conclusion • Discuss

Introduction • Sequence mining is an important data mining task, where one attempts to discover frequent sequences over time, of attribute sets in large databases. • This problem was originally motivated by applications in the retail industry, including attached mailing, add-on sales and customer satisfaction.

Introduction • Discovering all frequent sequences in a very large database can be very compute and I/O intensive because the search space size is essentially exponential in the length of the longest transaction sequence in it. • This high computational cost may be acceptable when the database is static since the discovery is done only once, and several approaches to this problem have been presented in the literature.

Introduction • Most current work assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database.

Introduction • However, many domains such as electronic commerce, stock analysis, collaborative surgery, etc., impose soft real-time constraints on the mining process. • Hence, there is a need for algorithms that maintain valid mined information across • database updates • user interactions (modifying/constraining the search space).

Problem definition Sequence: A sequence is an ordered list (ordered in time) of non-empty itemsets. i.e. A->B, A->B->C Our database is a collection of customers, each with a sequence of transactions, each of which is an itemset.

Problem definition Incremental Sequence Discovery Problem: Given an original database D of sequences, and a new increment to the database δ, and all frequent sequences in the database D + δ, with minimum possible recomputation and I/O.

The SPADE Algorithm An algorithm for fast discovery of frequent sequences, which forms the basis for our incremental algorithm.

The SPADE Algorithm Support Counting:

The SPADE Algorithm

ISM Algorithm • ISM • Incremental Sequence Mining • Using the ISL created by SPADE • Definition: • NB: Negative border • FS: Sequence set • Composed by two phase • Phase1 • Updating the supports of elements in NB and FS • Phase2 • Adding to NB and FS beyond what was done in Phase 1

ISM Algorithm - Phase 1 • supportD+l(X) = supportD(X) + support(X)D’ – supportD”(X) • Symbol definition • D : Original data • D’ : Original data + New data • D” : Duplicate part • l : New data

ISM Algorithm - Phase 1 • For each single item sequences p, we compute supportD(X),support(X)D’ and supportD”(X) and put p into a queue Q, which is empty at the beginning of computation • Then we repeat the following until Q is empty • We dequeue one element p from Q • We update the support of p • If the sequence p (of length k) is in the frequent set, all length k + 1 sequences that are already in ISL and that are generating ascendents of p are queued into Q • If the sequence, p, is in the negative border and its support suggests it is frequent, then this element is placed in NB-to-FS[k]

ISM Algorithm - Phase 1 Example: Min_Support: 3

ISM Algorithm - Phase 1 For each single item sequences , A,B,C first put into Q C B A

ISM Algorithm - Phase 1 C B A:New Support =4 ……. B->A B->B A->B AB C B

ISM Algorithm - Phase 1 ……. A->B AB C: New Support = 4 Original Support = 1 NB-to-FS C

ISM Algorithm - Phase 1

ISM Algorithm - Phase 2 • For all 1-sequence that have moved we intersect it with all possible other frequent 1-sequence • Adding the frequent 2-sequence into NB-to-FS[2] • Those are not frequent are placed in NBD+l

ISM Algorithm - Phase 2

ISM Algorithm - Phase 2 • For length two sequence, pick an element that has not been processed and create the frequent sets • Pass the frequent sets to SPADE algorithm • Repeat from step.4 until NB-to-FS are empty

Memory Management • Using a two level disk-file indexing scheme with a hashing mechanism • Vertical database is partitioned into a number of blocks ????? Given Customer Item Block1

Memory Management Block5 Given … Hash Function Customer Block17 ….. Block56

Memory Management Block5 Given … Hash Function Customer Block17 Hash Function Item ….. Block56

Interactive Sequence Mining • Definition: • The idea in interactive sequence mining is • that an end user be allowed to query the • database for association rules at differing values of support and confidence. • The goal is to allow such interaction without excessive I/O or computation.

Six Typical Set Of Queries • Simple Queries • Refined Queries • Quantified Queries • Including Queries • Excluding Queries • Hierarchical Queries

Problem • Given such a lattice, we can produce answers to all but one (Hierarchical queries) of the queries described in the previous section at interactive speeds without going back to the original database Why:Hierarchical queries require the algorithm to treat a set of related items as one super-item.

Hierarchical queries

Performance: Experimental Evaluation • Database properties: [ C:the number of customers T: average transaction size lDl: total number of transaction ]

Performance • The first experiment: • Experiment is to compare the time cost of • running SPADE algorithm on entire • database with that of running just • incremental algorithm for 4 databases • while fixing transaction percentage of • increment to 20%.

Result: experiment 1

The second experiment: • Experiment is to vary the support size(1%), • and for two databases. • Result:

The third experiment: Experiment is to keep the support, the number of customers, and the transaction percentage constant, and vary the number of transactions per customer (10, 12 and15) • Result

Interactive Performance We evaluated simple querying on supports ranging from 0.1%-0.25%, refined querying (support refined to 0.5% for all the datasets), priority querying (querying for the 50 sequences with highest support), including queries (including a random item) and excluding queries (excluding a random item). • Result

Conclusion • In this paper, we propose novel techniques that maintain data structures for mining sequences in the presence of a) database updates, and b) user interaction. • At the cost of maintaining a summary data structure, the interactive approach performs several orders of magnitude faster than any current sequence mining algorithm. • One of the limitations of the incremental approach proposed in this paper is the size of the negative border, and the resulting memory utilization.

Discuss • Incremental update? • Hash collisions? • Memory usage? • …

Incremental and Interactive Sequence Mining

Incremental and Interactive Sequence Mining

Presentation Transcript

Approaching Process Mining with Sequence Clustering

Interactive tools and programming environments for sequence analysis

Mining Sequence Classifiers for Early Prediction

Sequence in Mining

7. Sequence Mining

Data Stream Mining and Incremental Discretization

Incremental Mining Association Rules

Incremental Mining of Association Rules

Data Mining: Concepts and Techniques Mining sequence patterns in transactional databases

Mining Sequence Patterns in Transactional Databases

Incremental Context Mining for Adaptive Document Classification

Introduction on biological sequence indexing, searching and text mining

7. Sequence Mining

Sequence Data Mining: Techniques and Applications

Pattern Directed Mining Of Sequence Data

FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web L ogs

Interactive Event Sequence Query and Transformation

Mining Sequence Data

Data Mining the Yeast Genome Expression and Sequence Data

Interactive Sequence CUFMEM01A Use An Authoring Tool To Create An Interactive Sequence

Incremental Mining of Association Rules

Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data