Exact String Search

Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005

Boyer-Moore • Method of choice for exact string search, for a single pattern • Typically, examines fewer than m characters of the text (sublinear time) • Linear worst case running time • Conceptually very similar to K-M-P, but more complicated to running time proof • Empirically, better for english text than DNA sequence

Boyer-Moore • Three key ideas • Right to left scan • Bad character rule • (Strong) good suffix rule • The combination of these ideas can produce large pattern shifts. • Provable O(n+m) running time when pattern is not in the text • need extension for case when pattern is in the text to achieve linear running time.

Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^

Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^ P: tpabxab *

Right to left scan / bad character rule 0 1 123456789012345678 T:xpbctbxabpqxctbpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

Bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from Pshift left end of P to k+1 of T If right-most T(k) in P is to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position

Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^

Extended bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from P[1…i-1]shift left end of P to k+1 of T For right-most T(k) in P to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position

Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^

Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab

(Extended) bad character rule • For all x in Σ, R(x) is the position of the right-most occurrence of x in P. R(x) is zero if x is absent from P. • Comp. r-to-l, mismatch i of P, k of T: shift P right max[1,i-R(T(k))] positions • For extended bad character rule, need to lookup R(x,i)

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab

(Strong) good suffix rule Substring t of T matches suffix of P: • Find the right-most copy t’ in Ps.t. t’ is not a suffix of P andchar to left of t’ in P ≠ char to left of t in Pshift P to align t’ in P with t in T • If no such t’ shift P so that the longest proper prefix of P aligns with suffix of P

(Stong) good suffix rule Definitions: L(i) – max j < n such that P[i…n] matches suffix of P[1…j], 0 if no such j. L’(i) – max j < n such that P[i…n] matches suffix of P[1…j] and char. before suffix ≠ P(i-1), 0 if no such j. Weak and strong shifts for first part of good suffix rule.

Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.

Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. (!) compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S. Compute Nj(P) as Zn-j+1(reverse(P)).

Computing L’(i) • L’(i) – max j < n s.t. Nj(P) = |P[i…n]| = (n – i +1)

(Strong) good suffix rule Definition: l’(i) – length of the longest prefix of P that is also a suffix of P[i…n], 0 if no such prefix exists. l’(i) – max j < (n – i + 1) s.t. Nj(P) = j

Boyer-Moore psuedo code Compute L’(i), l’(i), and R(x) for x in Σ. k = n while k ≤ n i = n, h = k while i > 0 and P(i) = T(h) i--; h-- if i = 0 occurrence of P in T k = k + n – l’(2) else If L’(i+1) > 0, λ = L’(i+1), λ = l’(i+1) k = k + max{ 1, i - R(T(h)), n – λ }

Running time analysis • Notice that unlike K-M-P, we might re-compare text characters that matched in a previous iteration. • Worst instance does Θ(nm) total comparisons, but only if P is in T • If P is not in T, O(n+m) running time • complicated proof! • What goes wrong when P is in T?

Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^^^^^^^

Galil’s Extention • Comparing r-to-l, n of P aligned to k of T, matched at character s of T: If pos 1 of P shifts past s, thenprefix of P matches in T up to pos k. • skip these comparisons • Sufficient for linear time bound, whether or not P is in T or not.

Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^

Galil’s Extention 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab

Lessons From B-M • Sub-linear time is possible • But we still need to read T from disk! • Bad cases require periodicity in P or T • matching random P with T is easy! • Large alphabets mean large shifts • Small alphabets make complicated shift data-structures possible • B-M better for “english” and amino-acids than for DNA.

Exact String Search

Exact String Search

Presentation Transcript

Fast Exact String Matching On the GPU

Efficient Approximate Search on String Collections Part II

“Exact”

Boyer-Moore string search algorithm

String

Exact string backgrounds from boundary data

Exact String Matching, Suffix Trees, and Applications

Exact Reasoning: conditioning search and hybrids

Average Case Analysis of an Exact String Matching Algorithm

String

Exact Analysis of Exact Change

Fast exact k nearest neighbors search using an orthogonal search tree

Search Interfaces and String Manipulation

Exact String Matching Algorithms

String

Rules in Exact String Matching Algorithms

Developing Your Search String

String

Exact String Search

Efficient Approximate Search on String Collections Part II