220 likes | 377 Vues
A fast algorithm for approximate string matching on gene sequences. Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside. Outline. Background and motivation Idea and analysis for FAAST Experimental results Conclusion. Background.
E N D
A fast algorithm for approximate string matching on gene sequences Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside
Outline • Background and motivation • Idea and analysis for FAAST • Experimental results • Conclusion
Background • Approximate string matching pattern: P = p1p2…pm text: T = t1t2…tn • K-mismatch • K-difference • Applications: text processing and gene sequence analysis.
Motivation of FAAST • Motivation: Gene sequence acquisition • Modeled as the k-mismatch problem Primers: AAGTC CCGTA AAGTC………CCGTA TACTT………CCGTT … ACGTC………GCGTA … AAGTC………CCGTA … ACGTC………GCGTA
Algorithms for the k-mismatch problem • 1992, Shift-Add by Baeza-Yates and Gonnet. • 1996, BM with Shift-Add by El-Mabrouk and Crochemore. • 1993, BM extention (bad-charcter rule) by Tarhio-Ukkonen. • 1994, BM extention (good-suffix rule) by Baeza-Yates and Gonnet.
FAAST • Further generalization on Tarhio-Ukkonen algorithm. tj-m+1 tj-m+2 …… tj-k …tj-2 tj-1 tj p1 p2 …… pm-k … pm-2 pm-1 pm --check last k+1 tj-m+1 tj-m+2 … tj-k-x+1…tj-k …tj-2 tj-1 tj p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check last k+x
Algorithm outline T: AACTGTTAACTTGCGACTAG (k=2, x=2) P:AAATCGTAAC AAATCGTAAC Χ AAATCGTAAC Χ ……… Χ AAATCGTAAC ☺-after first shift (6)
An example k=2, x=3, m=10, n=20 T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST
Construction of shift table • Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches. T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC …. AAGTCGTAAC
Construction details • Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l. • dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t.Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.
Construction details – cont’d • P: AAGTCGTAAC (k=2, x=3, l=[1..8]),Vkx[tj-k-x+1…tj, l] and dkx[tj-k-x+1…tj]
Theoretical support • Correctness of FAAST • We use random string assumption • Average shift distance • Total number of character comparisons
Correctness of FAAST • Theorem 1. When P is aligned with tj-k-x+1…tj,we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P. tj-m+1 tj-m+2 … tj-k-x +1 …tj-2 tj-1 tj p1 p2 …pi-k-x+1 …pi-2 pi-1 pi ……pm – current p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i < i’)
Average shift distance • Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is: Pkx = 1- Σi=0x-1Ck+Xi(1-p)k+x-ipi • Theorem 1. The avg. shift distance of FAAST is: Ekxd = Σs=0∞s(1-Pkx)s-1Pkx = 1/Pkx
Total character comparisons • Lemma 2. The expected number of comparisons between two shifts is: Ekxc = (k+X) / (1-p) • Theorem 2. The expected total comparisons for text of length n is: TEkxc = nPkx (k+X) / (1-p)
Experimental result • A PC with 2.8GHz CPU and 1G memory • Simulated random string testing • Real DNA gene sequence data
Result on simulated sequences • Text: 2M bases sequence, Pattern: 39 bases, k=3.
Result on real sequences • Text: 150 bacteria DNA sequences, k=3 • Text: 150 fungi DNA sequences, k=3
Conclusion • Competitive algorithm for k-mismatch problem on gene sequence. • Time and memory increase with larger x and alphabet size.