1 / 50

Suffix arrays

Suffix arrays. Suffix array. We loose some of the functionality but we save space. Let s = abab. Sort the suffixes lexicographically: ab, abab, b, bab. The suffix array gives the indices of the suffixes in sorted order. 2. 0. 3. 1. How do we build it ?. Build a suffix tree

rhett
Télécharger la présentation

Suffix arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix arrays

  2. Suffix array • We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2 0 3 1

  3. How do we build it ? • Build a suffix tree • Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. • O(n) time

  4. How do we search for a pattern ? • If P occurs in T then all its occurrences are consecutive in the suffix array. • Do a binary search on the suffix array • Takes O(mlogn) time

  5. 10 7 4 1 0 9 8 6 3 5 2 Example Let S = mississippi i L ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R

  6. How do we accelerate the search ? Maintain l = LCP(P,L) Maintain r = LCP(P,R) Assume l ≥ r r l L M R

  7. If l = r then start comparing M to P at l + 1 r l L M R

  8. l > r r l L M R

  9. Someone whispers LCP(L,M) LCP(L,M)> l r l L M R

  10. Continue in the right half LCP(L,M)> l r l L M R

  11. LCP(L,M)< l r l L M R

  12. Continue in the left half LCP(L,M)< l r l L M R

  13. LCP(L,M)= l start comparing M to P at l + 1 r l L M R

  14. Analysis If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison  O(m + logn) time

  15. Construct the suffix array without the suffix tree

  16. Linear time construction Recursively ? Say we want to sort only suffixes that start at even positions ?

  17. Change the alphabet Every pair of characters is now a character You in fact sort suffixes of a string shorter by a factor of 2 !

  18. Change the alphabet a a b a a b $ 2 1 2

  19. But we do not gain anything…

  20. Divide into triples y a b b a b o d a b a d $ abb ada bba do$

  21. Divide into triples y a b b a b o d a b a d $ abb ada bba do$ y a b b a b o d a b a d $ bba dab bad o$$

  22. 3 7 0 1 6 4 2 5 10 11 1 4 8 2 7 5 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 Sort recursively 2/3 of the suffixes 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 0 1 2 3 4 7 5 6 abb ada bba do$ bba dab bad o$$ 3 7 1 2 4 6 4 5

  23. 10 11 1 4 8 2 7 5 Sort the remaining third 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 (a, 7) (y, 1) (b, 2) (a, 5)  (y, 1) (a, 7) (b, 2) (a, 5) 0 9 3 6

  24. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 6 10 11 1 4 8 2 7 5 1

  25. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 6 10 11 4 8 2 7 5 1 6

  26. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 10 11 4 8 2 7 5 1 6 4

  27. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 10 11 8 2 7 5 1 6 4 9

  28. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 3 10 11 8 2 7 5 1 6 4 9 3

  29. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 8 2 7 5 1 6 4 9 3 8

  30. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 2 7 5 1 6 4 9 3 8 2

  31. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 7 5 1 6 4 9 3 8 2 7

  32. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 5 1 6 4 9 3 8 2 7 5

  33. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 1 6 4 9 3 8 2 7 5

  34. Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 1 6 4 9 3 8 2 7 5 10 11

  35. summary 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 1 6 4 9 3 8 2 7 5 10 11 0 When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes

  36. Compute LCP’s 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  37. Crucial observation 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)} bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  38. Find LCP’s of consecutive suffixes 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(11,0) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  39. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(8,2) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  40. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(9,3) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  41. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 1 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(6,4) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  42. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 1 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(7,5) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  43. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 1 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(1,6) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  44. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(2,7) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  45. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(3,8) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  46. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(4,9) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  47. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 1 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(5,10) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  48. 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(10,11) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  49. Analysis 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 The starting position deceases by 1 in every iteration. So it cannot increase more than O(n) times bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1

  50. We need more LCPs for search 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 4 10 11 12 1 7 5 3 9 2 8 6 1 6 4 9 3 8 2 7 5 10 11 0 0 5 4 1 3 1 0 0 2 1 0 Linearly many, calculate the all bottom up

More Related