A Novel Gap-Constrained Sequence Mining Algorithm: cSPADE

A new algorithm for gap constrained sequence mining Salvatore Orlando, Raffaele Perego,Claudio Silvestri Proceedings of 2004 ACM Symposium on Applied Computing Advisor：Jia-Ling Koh Speaker：Chun-Wei Hsieh 11/19/2004

Problem • A sequence occurs in under the minimum gap and maximum gap constraints, denoted as , if there exists integers such that , and

Min_gap constraint: • Let be an input database sequence. • If , then all its subsequences , , satisfy • . • Min_gap constraint is an anti-monotone constraint

Max_gap constraint: • Let be an input database sequence. • If , • then all its subsequences , , satisfy • . ? • Max_gap constraint is not an anti-monotone constraint

SPADE • A candidate k-sequence is made by a pair of frequent (k-1) –subsequences that share a common (k-2)-preffix. • . SPADE might loose candidates

Contiguous sequences 1. is obtained from by dropping an item from either or ; 2. is obtained from by dropping an item from , where ; 3. is a contiguous subsequence of , and is a contiguous subsequence of .

Prefix and Suffix Subsequence • Max_gap constraint becomes anti-monotone, when using contiguous subsequence. • A prefix or suffix of a sequence is a particular contiguous subsequence of

cSPADE • cSPADE solves the problem by using the contiguous subsequence concept. • It combines the (k-1)-prex and 2-sux of are contiguous subsequences of • . cSPADE destroys the prex-class

CCSM 1) Count-based phase: scanning the database and mining the and 2)The horizontal database is transformed into a vertical one. 3) Intersection-based phase: generating the candidate k-sequence by merging with such that

Candidate generation Figure 1: CCSM candidate generation.

Idlist intersection • To determine the support of a candidate k-sequence p, we have first to produce the associated idlist L(p). • , , and can be joined to produce • .

Idlist intersection • .

Idlist intersection Order • . • Left to right : store the eid of the last item/event • Right to left : store the eid of the first item/event • (sid,eid) (sid, first_eid,last_eid)

Idlist caching Figure 2: Example of cache usage. Figure 3: CCSM idlist reuse.

Experiment 1 Figure 4: Number of intersection operations actually performed using 2-ways, pure k-ways and cached k-waysintersection methods while mining two synthetic datasets.

Experiment 2 Figure 5: Number of frequent sequences in datasets CS11 (minsup=0:30) and CS21(minsup=0:40) as afunction of the pattern length for dierent values of the max gapconstraint.

Experiment 3 Figure 6: Execution times of CCSM and cSPADE on datasets CS11 (minsup=0:30) and CS21 (minsup=0:40)as a function of the max gap value.

Experiment 4 Figure 7: Execution times of CCSM and cSPADE on datasets CS11 and CS21 with a xed max gap constraint(max gap=8) as a function of the minimum support threshold.

A Novel Gap-Constrained Sequence Mining Algorithm: cSPADE

A Novel Gap-Constrained Sequence Mining Algorithm: cSPADE

Presentation Transcript

A Fast-Nonegativity-Constrained Least Squares Algorithm

KLT, a new algorithm for SETI

A New Algorithm for DNA Sequence and Assembly

Mining Sequence Classifiers for Early Prediction

A New Method For Numerical Constrained Optimization

Regular Expression Constrained Sequence Alignment

A Sub-quadratic Sequence Alignment Algorithm

A New Algorithm for 3D Isovist

Sequence in Mining

A new algorithm for bidirectional deconvolution

A GPU algorithm design for the Resource Constrained Project Scheduling Problem

7. Sequence Mining

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrics

A fast Prunning Algorithm for optimal Sequence Alignment

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment

Gap-filling algorithm

A Classical Apriori Algorithm for Mining Association Rules

A Parameterised Algorithm for Mining Association Rules

A Hierarchical Clustering Algorithm for Categorical Sequence Data

A Simple Algorithm for the Constrained Sequence Problems

7. Sequence Mining

Mining Sequence Data