1 / 22

A fast algorithm for approximate string matching on gene sequences

A fast algorithm for approximate string matching on gene sequences. Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside. Outline. Background and motivation Idea and analysis for FAAST Experimental results Conclusion. Background.

Télécharger la présentation

A fast algorithm for approximate string matching on gene sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A fast algorithm for approximate string matching on gene sequences Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside

  2. Outline • Background and motivation • Idea and analysis for FAAST • Experimental results • Conclusion

  3. Background • Approximate string matching pattern: P = p1p2…pm text: T = t1t2…tn • K-mismatch • K-difference • Applications: text processing and gene sequence analysis.

  4. Motivation of FAAST • Motivation: Gene sequence acquisition • Modeled as the k-mismatch problem Primers: AAGTC CCGTA AAGTC………CCGTA TACTT………CCGTT … ACGTC………GCGTA … AAGTC………CCGTA … ACGTC………GCGTA

  5. Algorithms for the k-mismatch problem • 1992, Shift-Add by Baeza-Yates and Gonnet. • 1996, BM with Shift-Add by El-Mabrouk and Crochemore. • 1993, BM extention (bad-charcter rule) by Tarhio-Ukkonen. • 1994, BM extention (good-suffix rule) by Baeza-Yates and Gonnet.

  6. FAAST • Further generalization on Tarhio-Ukkonen algorithm. tj-m+1 tj-m+2 …… tj-k …tj-2 tj-1 tj p1 p2 …… pm-k … pm-2 pm-1 pm --check last k+1 tj-m+1 tj-m+2 … tj-k-x+1…tj-k …tj-2 tj-1 tj p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check last k+x

  7. Algorithm outline T: AACTGTTAACTTGCGACTAG (k=2, x=2) P:AAATCGTAAC AAATCGTAAC Χ AAATCGTAAC Χ ……… Χ AAATCGTAAC ☺-after first shift (6)

  8. An example k=2, x=3, m=10, n=20 T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST

  9. Construction of shift table • Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches. T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC …. AAGTCGTAAC

  10. Construction details • Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l. • dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t.Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.

  11. Construction details – cont’d • P: AAGTCGTAAC (k=2, x=3, l=[1..8]),Vkx[tj-k-x+1…tj, l] and dkx[tj-k-x+1…tj]

  12. Theoretical support • Correctness of FAAST • We use random string assumption • Average shift distance • Total number of character comparisons

  13. Correctness of FAAST • Theorem 1. When P is aligned with tj-k-x+1…tj,we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P. tj-m+1 tj-m+2 … tj-k-x +1 …tj-2 tj-1 tj p1 p2 …pi-k-x+1 …pi-2 pi-1 pi ……pm – current p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i < i’)

  14. Average shift distance • Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is: Pkx = 1- Σi=0x-1Ck+Xi(1-p)k+x-ipi • Theorem 1. The avg. shift distance of FAAST is: Ekxd = Σs=0∞s(1-Pkx)s-1Pkx = 1/Pkx

  15. Average shift distance under diff x.

  16. Total character comparisons • Lemma 2. The expected number of comparisons between two shifts is: Ekxc = (k+X) / (1-p) • Theorem 2. The expected total comparisons for text of length n is: TEkxc = nPkx (k+X) / (1-p)

  17. Total character comparisons

  18. Difference of total character comparisons under different x

  19. Experimental result • A PC with 2.8GHz CPU and 1G memory • Simulated random string testing • Real DNA gene sequence data

  20. Result on simulated sequences • Text: 2M bases sequence, Pattern: 39 bases, k=3.

  21. Result on real sequences • Text: 150 bacteria DNA sequences, k=3 • Text: 150 fungi DNA sequences, k=3

  22. Conclusion • Competitive algorithm for k-mismatch problem on gene sequence. • Time and memory increase with larger x and alphabet size.

More Related