1 / 19

Boyer-Moore Algorithm

Boyer-Moore Algorithm. 3 main ideas right to left scan bad character rule good suffix rule. Right to left scan. 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q. t b c b b. t b c b b. t b c b b. t b c b b. Bad character rule. Definition

bunny
Télécharger la présentation

Boyer-Moore Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boyer-Moore Algorithm • 3 main ideas • right to left scan • bad character rule • good suffix rule

  2. Right to left scan 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b t b c b b t b cb b t b cb b

  3. Bad character rule • Definition • For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P. • R(x) is defined to be 0 if x is not in P • Usage • Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. • Shift P to the right by max(1, i - R(T[k])) places • Hopefully more than 1

  4. Illustration of bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b cb b t b cb b i = 5, R(z) = 0, so max(1, 5-0) = 5 t b c bb i = 5, R(t) = 1, so max(1, 5-1) = 4 i = 4, R(t) = 1, so max(1, 4-1) = 3 t b cb b

  5. Extended bad character rule • Definition • For each character x in the alphabet, let R(x,i) denote the position of the right-most occurrence of character x P[1..i-1]. • R(x,i) is defined to be 0 if x is not in P[1..i-1]. • Usage • Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. • Shift P to the right by max(1, i - R(T[k],i)) places • Hopefully more than 1

  6. Illustration of extended bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c b b x t b p t x c t b p q b t c tb i = 4, R(b) = 5, so max(1, 4-5) = 1 b t ct b i = 4, R(b,4) = 1, so max(1, 4-1) = 3 b t ct b

  7. Implementation Issues • Bad character rule • Space required: O(|S|) for the number of characters in the alphabet • Calculate R[] matrix in O(n) time (exercise) • Extended bad character rule • Space required: full table is O(n|S|) • Smaller implementation: O(n) • Preprocess time: O(n) • Search time impact: increases search time by at worst twice the number of mismatches • See book for details (pg 18)

  8. Observations • Bad character rules • work well in practice with large alphabets like the English alphabet • work less well with small alphabets like DNA • Do not guarantee linear worst-case run-time • Give an example of such a case

  9. Strong good suffix rule part 1 • Situation • P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1) • The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the left of t’ in P is different than P(i-1) • Shift P so that t’ matches up with T[j..j+n-i]

  10. Illustration of suffix rule part 1 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t u b a b v q x r s t q c a b d a b d a b q c ab d a b d a b q c ab d a b d a b

  11. Preprocessing for suffix rule part 1 • Definitions • For each i, L’(i) is the largest position less than n such that P[i..n] matches a suffix of P[1..L’(i)] and that the character preceding that suffix is not equal to P(i-1). • For string P, Nj(P) is the length of the longest suffix of the substring P[1..j] that is also a suffix of P • Observations • Nj(P) = Zn-j+1(Pr) • L’(i) is the largest j < n such that Nj(P) = |P[i..n]| which equals n-i+1 • If L’(i) > 0, shift P by n-L’(i) places to the right

  12. Z-based computation of L’(i) for (i=1;i<=n;i++) L’(i) = 0; for (j=1; j<=n-1; j++) { k = n-Nj(P)+1; L’(k) = j; }

  13. Strong good suffix rule part 2 • If L’(i) = 0 then … • Let t’’ = the largest suffix of P[i..n] that is also a prefix of P, if one exists. • If t’’ exists, shift P so that prefix of P matches up with t’’ at end of T[j..j+n-i]. • Otherwise, shift P past T[j+n-i].

  14. Illustration of suffix rule part 2 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t a b a b v q x r s t a b a b s t ab a b a ba b s t a b a b a ba b s t a b a b

  15. Preprocessing for suffix rule part 2 • Definitions • For each i, let l’(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. Otherwise, let l’(i)=0. • Observations • l’(i) = the largest j <= |P[i..n]| such that Nj(P) = j • Question • How does l’(i) relate to l’(i+1)? • The same unless Nn-i+1(P) = n-i+1

  16. Z-based computation of l’(i) l’[n+1] = 0; for (i=n;i>=2;i--) if (N[n-i+1] = = (n-i+1)) l’[i] = n-i+1; else l’[i] = l’[i+1]; }

  17. Addendum to suffix rule • Shift by 1 if there is an immediate mismatch • That is, if P(n) mismatches with the corresponding character in T

  18. Boyer-Moore Overview • Precompute L’(i), l’(i) for each position in P • Precompute R(x) or R(x,i) for each character x in S • Align P to T • Compare right to left • On mismatch, shift by the max possible from (extended) bad character rule and good suffix rule and return to compare

  19. Observations I • Original Boyer-Moore algorithm uses “weak good suffix rule” without using the mismatch information • This is not sufficient to prove that the search part of Boyer-Moore runs in linear time in the worst case • Using only strong good suffix rule, can prove a worst-case time of O(n) provided P is not in T • If P is in T, original Boyer-Moore runs in Q(nm) time in the worst case, but this can be corrected with simple modifications • Using only the bad character shift rule leads to O(nm) time in the worst-case, but works in sublinear time on random strings

More Related