A fast algorithm for approximate string matching on gene sequences

A fast algorithm for approximate string matching on gene sequences Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside

Outline • Background and motivation • Idea and analysis for FAAST • Experimental results • Conclusion

Background • Approximate string matching pattern: P = p1p2…pm text: T = t1t2…tn • K-mismatch • K-difference • Applications: text processing and gene sequence analysis.

Motivation of FAAST • Motivation: Gene sequence acquisition • Modeled as the k-mismatch problem Primers: AAGTC CCGTA AAGTC………CCGTA TACTT………CCGTT … ACGTC………GCGTA … AAGTC………CCGTA … ACGTC………GCGTA

Algorithms for the k-mismatch problem • 1992, Shift-Add by Baeza-Yates and Gonnet. • 1996, BM with Shift-Add by El-Mabrouk and Crochemore. • 1993, BM extention (bad-charcter rule) by Tarhio-Ukkonen. • 1994, BM extention (good-suffix rule) by Baeza-Yates and Gonnet.

FAAST • Further generalization on Tarhio-Ukkonen algorithm. tj-m+1 tj-m+2 …… tj-k …tj-2 tj-1 tj p1 p2 …… pm-k … pm-2 pm-1 pm --check last k+1 tj-m+1 tj-m+2 … tj-k-x+1…tj-k …tj-2 tj-1 tj p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check last k+x

Algorithm outline T: AACTGTTAACTTGCGACTAG (k=2, x=2) P:AAATCGTAAC AAATCGTAAC Χ AAATCGTAAC Χ ……… Χ AAATCGTAAC ☺-after first shift (6)

An example k=2, x=3, m=10, n=20 T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST

Construction of shift table • Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches. T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC …. AAGTCGTAAC

Construction details • Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l. • dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t.Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.

Construction details – cont’d • P: AAGTCGTAAC (k=2, x=3, l=[1..8]),Vkx[tj-k-x+1…tj, l] and dkx[tj-k-x+1…tj]

Theoretical support • Correctness of FAAST • We use random string assumption • Average shift distance • Total number of character comparisons

Correctness of FAAST • Theorem 1. When P is aligned with tj-k-x+1…tj,we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P. tj-m+1 tj-m+2 … tj-k-x +1 …tj-2 tj-1 tj p1 p2 …pi-k-x+1 …pi-2 pi-1 pi ……pm – current p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i < i’)

Average shift distance • Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is: Pkx = 1- Σi=0x-1Ck+Xi(1-p)k+x-ipi • Theorem 1. The avg. shift distance of FAAST is: Ekxd = Σs=0∞s(1-Pkx)s-1Pkx = 1/Pkx

Average shift distance under diff x.

Total character comparisons • Lemma 2. The expected number of comparisons between two shifts is: Ekxc = (k+X) / (1-p) • Theorem 2. The expected total comparisons for text of length n is: TEkxc = nPkx (k+X) / (1-p)

Total character comparisons

Difference of total character comparisons under different x

Experimental result • A PC with 2.8GHz CPU and 1G memory • Simulated random string testing • Real DNA gene sequence data

Result on simulated sequences • Text: 2M bases sequence, Pattern: 39 bases, k=3.

Result on real sequences • Text: 150 bacteria DNA sequences, k=3 • Text: 150 fungi DNA sequences, k=3

Conclusion • Competitive algorithm for k-mismatch problem on gene sequence. • Time and memory increase with larger x and alphabet size.

A fast algorithm for approximate string matching on gene sequences

A fast algorithm for approximate string matching on gene sequences

Presentation Transcript

Fast Exact String Matching On the GPU

Faster Approximate String Matching over Compressed Text

Approximate String Matching using Compressed Suffix Arrays

A Fast String Matching Algorithm

XML data management and approximate string matching

A Fast String Matching Algorithm

Approximate String Matching

A Fast Deferred Shading Algorithm for Approximate Indirect Illumination

A fast algorithm for Maximum Subset Matching

Rules for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching

Fast Algorithm for String Matching with k Mismatches

Fast Approximate Point Set Matching for Information Retrieval

A Fast String Searching Algorithm

String Matching: Knuth-Morris-Pratt algorithm

brute force string matching algorithm

Faster Algorithm for String Matching with k Mismatches

Approximate Boyer-Moore String Matching

Multipattern String Matching On A GPU

XML data management and approximate string matching

Filter Algorithms for Approximate String Matching

Approximate String Matching