Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets Genxing Yang Shanghai Software Test Key Lab Jin Pan, Peng Wang, Wei Wang, Baile ShiFudan University China CIT’05

Introduction • Preliminaries • Two Maximal Frequent Concatenate Sequences Algorithms • Performance Evaluation • Conclusion • My Thought

Introduction • How to efficiently discover long frequent concatenate sequence poses a great challenge for existing sequential pattern discovery algorithm. • Almost all previously proposed methods for mining sequential patterns are Apriori-like.Any super-pattern of a nonfrequent pattern cannot be frequent.

This paper proposes two novel maximal frequent concatenate sequences mining algorithms:MacosFSpan: Maximal frequent concatenate sequences using Fixed length SpanMacosVSpan: Maximal frequent concatenate sequences using Variable length Span

Preliminaries • A sequence α is called a frequent concatenate sequence in sequence database S if supports(α)≧ξ. • If α is frequent and no concatenate super sequence of α is frequent, α is a maximally frequent concatenate sequence.

Definition (Projected database):The α-projected database is the collection of postfixes of sequences in S w.r.t. prefix α.

<A>-projected database: <TCGTGACT>, <CT>, <TCGTT>, <TCGTGAAG>, <AG>, <G>, <TTG>, <TT>

Two Maximal Frequent Concatenate Sequences Algorithms • The MacosFSpan Algorithm • MacosVSpan: the Advanced Algorithm

The MacosFSpan Algorithm • The Fixed Length Span Method • Construct projected databases: <A>-, <C>-, <T>-, and <G>-projected database. • Suppose that there is a projected database with prefix α, we explore the first w items in each sequence of this projected database. • Calculate the support of these w-sequences in the projected database to grow the frequent sequence α and construct the corresponding projected databases Fspan Tree

The Maximal Problem • To solve the maximal problem, they adopt the suffix tree to store all the frequent sequences: • If the inserted sequence is contained by some sequence in the tree, they delete it from the result set and stop inserting. • If the inserted sequence contains some sequence in the tree, denoted as γ, they delete γfrom the result set, and continue to insert the other suffixes.

MacosVSpan • The fixed length span method is more efficient than traditional approaches. • However, one can observe that this method is inefficient under some circumstance, because it still recursively mines.

Consider the frequent sequence <GTCAACT> • We first obtain <AACT>, and then get <CAACT> when mining <C>-projected database. • And <TCAACT> and <GTCAACT> is obtained in turn.

When mining <C>-projected database, for its sequence <ATCGTT>, we only need to count the frequency of <ATCGT> in this projected database.

The Variable Length Span Method • Vspan Tree • Explore the first w items, denoted as i1,i2,…,iw , in each sequence s of current projected database. • For each item ij, for 1≤j ≤w, we match the subsequence starting from ij with prefix of each mined projected database.

If there exists a subsequence totally matching a prefix αof projected database, we continue to find the longest subsequence β following α in s, which matches a root-path in α-subtree. And then insert the potential frequent sequence (<i1, i2,…,ij-1>+α+β) into the vspan tree to count its frequency. • If not, they just insert <i1,i2,…,iw> into the tree, just like the fixed length span method.

Performance Evaluation • They generate some sequence seed, and then use them to generate the target sequence set. • Each dataset is named as PN50PL300SN50SL1000 • PN: number of sequence seeds generated • PL: average length of sequence seeds • SN: number of sequence in the dataset • SL: average length of the sequences

PN50PL200SN50SL1000 min_sup: 2

min_sup: 2 w: 10

2500 sequences The length of them are between 300-1200

Conclusion • This paper developed two novel and efficient algorithms to mine frequent concatenate sequences, called MacosFSpan and MacosVSpan. • MacosFSpan is much more efficient than the traditional methods. • MacosVSpan is more efficient than MacosFSpan, especially on data set with long frequent sequences.

My Thought • 透過使用者閱讀時的眼動軌跡，判斷其有興趣之關鍵字，自動替使用者搜尋相關文件，並評估搜尋的結果。

Match patterns, derive keywords & retrieve relevant documents Interest Pattern Database Document Database Find Interest Patterns

找出使用者可能有興趣的關鍵字 • 從使用者有興趣的文章中，透過心理系所得到的一些閱讀時的眼動特性，去猜測哪些是使用者可能是有興趣的關鍵字，ex: fixation or regression 次數較多 • 讓使用者回顧閱讀時的狀況，回答哪些字是他有興趣的 • 取出閱讀這些關鍵字或是關鍵字所在的句子的眼動軌跡，找出它們重複的眼動特徵

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets

Presentation Transcript

Parallel Mining of Maximal Frequent Itemsets form Databases

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining

Constraint Mining of Frequent Patterns in Long Sequences

Mining Frequent Patterns

CBW: An Efficient Algorithm for Frequent Itemset Mining

Mining Frequent Closed Cubes in 3D Datasets

Fast Algorithms for Mining Frequent Itemsets

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Efficient Algorithms for Mining Share-Frequent Itemsets

Efficient Algorithms for Mining Semi-structured Data

Fast Algorithms for Mining Frequent Itemsets

An efficient algorithm for detecting frequent subgraphs in biological networks

Mining Sequences

Fast Algorithms for Mining Frequent Itemsets

Mining Frequent Closed Cubes in 3D Datasets

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees