300 likes | 426 Vues
This lecture discusses the Boyer-Moore algorithm, a preferred choice for exact string searching, particularly in biosequence analysis. The algorithm operates in sublinear time by utilizing three main strategies: right-to-left scanning, the bad character rule, and the strong good suffix rule. These methods facilitate large pattern shifts and promise O(n+m) running time when the pattern is absent from the text. Additionally, the lecture highlights the algorithm's performance comparison with DNA sequences and its extension links for cases when the pattern is present.
E N D
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005
Boyer-Moore • Method of choice for exact string search, for a single pattern • Typically, examines fewer than m characters of the text (sublinear time) • Linear worst case running time • Conceptually very similar to K-M-P, but more complicated to running time proof • Empirically, better for english text than DNA sequence
Boyer-Moore • Three key ideas • Right to left scan • Bad character rule • (Strong) good suffix rule • The combination of these ideas can produce large pattern shifts. • Provable O(n+m) running time when pattern is not in the text • need extension for case when pattern is in the text to achieve linear running time.
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^ P: tpabxab *
Right to left scan / bad character rule 0 1 123456789012345678 T:xpbctbxabpqxctbpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab
Bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from Pshift left end of P to k+1 of T If right-most T(k) in P is to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^
Extended bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from P[1…i-1]shift left end of P to k+1 of T For right-most T(k) in P to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position
Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^
Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab
(Extended) bad character rule • For all x in Σ, R(x) is the position of the right-most occurrence of x in P. R(x) is zero if x is absent from P. • Comp. r-to-l, mismatch i of P, k of T: shift P right max[1,i-R(T(k))] positions • For extended bad character rule, need to lookup R(x,i)
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab
(Strong) good suffix rule Substring t of T matches suffix of P: • Find the right-most copy t’ in Ps.t. t’ is not a suffix of P andchar to left of t’ in P ≠ char to left of t in Pshift P to align t’ in P with t in T • If no such t’ shift P so that the longest proper prefix of P aligns with suffix of P
(Stong) good suffix rule Definitions: L(i) – max j < n such that P[i…n] matches suffix of P[1…j], 0 if no such j. L’(i) – max j < n such that P[i…n] matches suffix of P[1…j] and char. before suffix ≠ P(i-1), 0 if no such j. Weak and strong shifts for first part of good suffix rule.
Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.
Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. (!) compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S. Compute Nj(P) as Zn-j+1(reverse(P)).
Computing L’(i) • L’(i) – max j < n s.t. Nj(P) = |P[i…n]| = (n – i +1)
(Strong) good suffix rule Definition: l’(i) – length of the longest prefix of P that is also a suffix of P[i…n], 0 if no such prefix exists. l’(i) – max j < (n – i + 1) s.t. Nj(P) = j
Boyer-Moore psuedo code Compute L’(i), l’(i), and R(x) for x in Σ. k = n while k ≤ n i = n, h = k while i > 0 and P(i) = T(h) i--; h-- if i = 0 occurrence of P in T k = k + n – l’(2) else If L’(i+1) > 0, λ = L’(i+1), λ = l’(i+1) k = k + max{ 1, i - R(T(h)), n – λ }
Running time analysis • Notice that unlike K-M-P, we might re-compare text characters that matched in a previous iteration. • Worst instance does Θ(nm) total comparisons, but only if P is in T • If P is not in T, O(n+m) running time • complicated proof! • What goes wrong when P is in T?
Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^^^^^^^
Galil’s Extention • Comparing r-to-l, n of P aligned to k of T, matched at character s of T: If pos 1 of P shifts past s, thenprefix of P matches in T up to pos k. • skip these comparisons • Sufficient for linear time bound, whether or not P is in T or not.
Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^
Galil’s Extention 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab
Lessons From B-M • Sub-linear time is possible • But we still need to read T from disk! • Bad cases require periodicity in P or T • matching random P with T is easy! • Large alphabets mean large shifts • Small alphabets make complicated shift data-structures possible • B-M better for “english” and amino-acids than for DNA.