1 / 25

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets. Genxing Yang Shanghai Software Test Key Lab. Jin Pan, Peng Wang, Wei Wang, Baile Shi Fudan University China. CIT ’ 05. Introduction Preliminaries

shelly
Télécharger la présentation

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets Genxing Yang Shanghai Software Test Key Lab Jin Pan, Peng Wang, Wei Wang, Baile ShiFudan University China CIT’05

  2. Introduction • Preliminaries • Two Maximal Frequent Concatenate Sequences Algorithms • Performance Evaluation • Conclusion • My Thought

  3. Introduction • How to efficiently discover long frequent concatenate sequence poses a great challenge for existing sequential pattern discovery algorithm. • Almost all previously proposed methods for mining sequential patterns are Apriori-like.Any super-pattern of a nonfrequent pattern cannot be frequent.

  4. This paper proposes two novel maximal frequent concatenate sequences mining algorithms:MacosFSpan: Maximal frequent concatenate sequences using Fixed length SpanMacosVSpan: Maximal frequent concatenate sequences using Variable length Span

  5. Preliminaries • A sequence α is called a frequent concatenate sequence in sequence database S if supports(α)≧ξ. • If α is frequent and no concatenate super sequence of α is frequent, α is a maximally frequent concatenate sequence.

  6. Definition (Projected database):The α-projected database is the collection of postfixes of sequences in S w.r.t. prefix α.

  7. <A>-projected database: <TCGTGACT>, <CT>, <TCGTT>, <TCGTGAAG>, <AG>, <G>, <TTG>, <TT>

  8. Two Maximal Frequent Concatenate Sequences Algorithms • The MacosFSpan Algorithm • MacosVSpan: the Advanced Algorithm

  9. The MacosFSpan Algorithm • The Fixed Length Span Method • Construct projected databases: <A>-, <C>-, <T>-, and <G>-projected database. • Suppose that there is a projected database with prefix α, we explore the first w items in each sequence of this projected database. • Calculate the support of these w-sequences in the projected database to grow the frequent sequence α and construct the corresponding projected databases Fspan Tree

  10. The Maximal Problem • To solve the maximal problem, they adopt the suffix tree to store all the frequent sequences: • If the inserted sequence is contained by some sequence in the tree, they delete it from the result set and stop inserting. • If the inserted sequence contains some sequence in the tree, denoted as γ, they delete γfrom the result set, and continue to insert the other suffixes.

  11. MacosVSpan • The fixed length span method is more efficient than traditional approaches. • However, one can observe that this method is inefficient under some circumstance, because it still recursively mines.

  12. Consider the frequent sequence <GTCAACT> • We first obtain <AACT>, and then get <CAACT> when mining <C>-projected database. • And <TCAACT> and <GTCAACT> is obtained in turn.

  13. When mining <C>-projected database, for its sequence <ATCGTT>, we only need to count the frequency of <ATCGT> in this projected database.

  14. The Variable Length Span Method • Vspan Tree • Explore the first w items, denoted as i1,i2,…,iw , in each sequence s of current projected database. • For each item ij, for 1≤j ≤w, we match the subsequence starting from ij with prefix of each mined projected database.

  15. If there exists a subsequence totally matching a prefix αof projected database, we continue to find the longest subsequence β following α in s, which matches a root-path in α-subtree. And then insert the potential frequent sequence (<i1, i2,…,ij-1>+α+β) into the vspan tree to count its frequency. • If not, they just insert <i1,i2,…,iw> into the tree, just like the fixed length span method.

  16. Performance Evaluation • They generate some sequence seed, and then use them to generate the target sequence set. • Each dataset is named as PN50PL300SN50SL1000 • PN: number of sequence seeds generated • PL: average length of sequence seeds • SN: number of sequence in the dataset • SL: average length of the sequences

  17. PN50PL200SN50SL1000 min_sup: 2

  18. min_sup: 2 w: 10

  19. 2500 sequences The length of them are between 300-1200

  20. Conclusion • This paper developed two novel and efficient algorithms to mine frequent concatenate sequences, called MacosFSpan and MacosVSpan. • MacosFSpan is much more efficient than the traditional methods. • MacosVSpan is more efficient than MacosFSpan, especially on data set with long frequent sequences.

  21. My Thought • 透過使用者閱讀時的眼動軌跡,判斷其有興趣之關鍵字,自動替使用者搜尋相關文件,並評估搜尋的結果。

  22. Match patterns, derive keywords & retrieve relevant documents Interest Pattern Database Document Database Find Interest Patterns

  23. 找出使用者可能有興趣的關鍵字 • 從使用者有興趣的文章中,透過心理系所得到的一些閱讀時的眼動特性,去猜測哪些是使用者可能是有興趣的關鍵字,ex: fixation or regression 次數較多 • 讓使用者回顧閱讀時的狀況,回答哪些字是他有興趣的 • 取出閱讀這些關鍵字或是關鍵字所在的句子的眼動軌跡,找出它們重複的眼動特徵

More Related