1 / 116

Suffix trees

This text provides an overview of different string representations including Suffix Trees, Trie, Compressed Trie, and Suffix Tree Construction Algorithms like Ukkonen's Linear Time Construction. It also covers applications such as exact string matching, longest common substring, lowest common ancestors, and finding maximal palindromes.

ralphs
Télécharger la présentation

Suffix trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix trees

  2. Trie • A tree representing a set of strings. c a { aeef ad bbfe bbfg c } b e b d e f c f e g

  3. Trie (Cont) • Assume no string is a prefix of another c Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g

  4. Compressed Trie • Compress unary nodes, label edges by strings c  c a a b e b bbf d d eef e f c c f e g e g

  5. Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

  6. Suffix tree (Example) Let s=abab,a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

  7. Trivial algorithm to build a Suffix tree a b Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $

  8. a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

  9. a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

  10. a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

  11. $ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b 5 b $ a a $ b 4 b $ $ 3 2 1

  12. Analysis Takes O(n2) time to build. You can do it in O(n) time But, how come? does it take O(n) space ?

  13. Linear space ? • Consider the string aaaaaabbbbbb $ a b $ bbbbbb$ a b $ bbbbbb$ b a $ b bbbbbb$ a $ b bbbbbb$ c $ a b$ bbbbbb$ abbbbbb$

  14. To use only O(n) space encode the edge-labels • Consider the string aaaaaabbbbbb $ a b $ (7,13) a b $ bbbbbb$ b a $ b bbbbbb$ a $ b bbbbbb$ c $ a b$ bbbbbb$ abbbbbb$

  15. To use only O(n) space encode the edge-labels • Consider the string aaaaaabbbbbb (13,13) (1,1) (7,7) (7,13) (1,1) (7,7) (13,13) (7,13) (7,7) (1,1) (13,13) (7,7) (7,13) (1,1) (13,13) (7,7) (7,13) c (13,13) (1,1) (12,13) (7,13) (6,13)

  16. What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. We may also want to find all occurrences of P in T

  17. Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b 5 b $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

  18. $ a b 5 b $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

  19. Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To associate each suffix with a unique string in S add a different special char to each s

  20. Generalized suffix tree (Example) Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # $ { $ # b$ b# ab$ ab# bab$ aab# abab$ } a b 4 5 # $ a b a 3 b b 4 $ # # a $ 2 1 b $ 3 2 1

  21. So what can we do with it ? Matching a pattern against a database of strings

  22. Longest common substring (of two strings) Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 4 5 # $ a b a 3 b b 4 $ # # a $ 2 1 b $ Find such node with largest “string depth” 3 2 1

  23. Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

  24. Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 4 5 # $ a b a 3 b b 4 $ # # a $ 2 1 b $ 3 2 1

  25. Finding maximal palindromes • A palindrome: caabaac, cbaabc • Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

  26. Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

  27. Let s = cbaaba$ then sr = abaabc# a # b $ c 7 7 a b $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 $ 3 3 c # a $ 4 1 2 2 1

  28. Analysis O(n) time to identify all palindromes

  29. Can we construct a suffix tree in linear time ?

  30. Ukkonen’s linear time construction ACTAATC A A 1

  31. ACTAATC AC A 1

  32. ACTAATC AC AC 1

  33. ACTAATC AC AC C 1 2

  34. ACTAATC ACT AC C 1 2

  35. ACTAATC ACT ACT C 1 2

  36. ACTAATC ACT ACT CT 1 2

  37. ACTAATC ACT T ACT CT 1 3 2

  38. ACTAATC ACTA T ACT CT 1 3 2

  39. ACTAATC ACTA T ACTA CT 1 3 2

  40. ACTAATC ACTA T ACTA CTA 1 3 2

  41. ACTAATC ACTA TA ACTA CTA 1 3 2

  42. ACTAATC ACTAA TA ACTA CTA 1 3 2

  43. ACTAATC ACTAA TA ACTAA CTA 1 3 2

  44. ACTAATC ACTAA TA ACTAA CTAA 1 3 2

  45. ACTAATC ACTAA TAA ACTAA CTAA 1 3 2

  46. ACTAATC ACTAA TAA A CTAA 3 2 A CTAA 4 1

  47. Phases & extensions • Phase i is when we add character i i • In phase i we have i extensions of suffixes

  48. ACTAATC ACTAAT TAA A CTAA 3 2 A CTAA 4 1

  49. ACTAATC ACTAAT TAA A CTAA 3 2 A CTAAT 4 1

  50. ACTAATC ACTAAT TAA A CTAAT 3 2 A CTAAT 4 1

More Related