1 / 35

Approximate Matching of Run-Length Compressed Strings Algorithm

An algorithm for finding approximate occurrences of short patterns in long texts based on edit distance computation. Differentiate between three kinds of edit distances and optimize for efficiency. Extend algorithm to global edit distance.

rseals
Télécharger la présentation

Approximate Matching of Run-Length Compressed Strings Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Matching Of Run-Length Comressed Strings By Makinen, Navarro and Ukkonen

  2. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m) time algorithm that compute their edit distance. Let A be a short pattern, a B be a long text and a threshold parameter K. We will show an algorithm that will report all the approximate occurrences of A in B Which are at an edit distance K or less from the pattern.

  3. Example of simple edit distance: A= aaabbc B= abbdb

  4. Example of simple edit distance: A= aaabbc B= abbdb

  5. Example of simple edit distance: A= aaabbc B= abbdb

  6. Example of simple edit distance: A= aaabbc B= abbdb

  7. Example of simple edit distance: A= aaabbc B= abbdb

  8. We distinct between three kinds of edit distance: Levenshtein distance - DL (A,B) : Insertion =1 ,Deletion =1,Substitution=1. distance - DID (A,B) : Insertion =1 ,Deletion =1,Substitution=∞ (no Substitution). Global distance - DG (A,B) : Arbitrary coast for Insertion ,Deletion ,Substitution.

  9. KeyWords • Run-Length compressed aaaabbcccaab = (a,4),(b,2),(c,3),(a,2),(b,1). • White Box • Black Box

  10. Dividing the Edit Distance matrix into boxes

  11. Dividing the Edit Distance matrix into boxes

  12. An O(mn’ + m’n) Algorithm for DL • Equal Letter Box (White): • “Copying” the values from the upper left borders to the bottom right borders, using as much diagonal moves as possible. 8 8 8 7 7 8 • In an Equal letter box, DID = DL because no substitutions are needed.

  13. An O(mn’ + m’n) Algorithm for DL • Different Letter Box (Black): • Filling only the borders: • 1 + min (t-1 + min relevant upper border , s-1 + min relevant left border ).

  14. (3,5) t > s (5,3) t < s (5,5) t = s

  15. 9 10 9 9 11 10 1+min( 3-1 + min(7,8) , 1-1 + min(8,9)) = 9 1+min( 3-1 + min(7,8,8) , 2-1 + min(7,8,9)) = 9 1+min( 3-1 + min(7,8,8,8) , 3-1 + min(7,7,8,9)) = 10 1+min( 3-1 + min(8,8,8,9) , 4-1 + min(7,7,8,9)) = 11 1+min( 2-1 + min(8,8,9) , 4-1 + min(7,7,8)) = 10 1+min( 1-1 + min(8,9) , 4-1 + min(7,7)) = 9

  16. Three different points along our computation: s t > s t = s t < s t

  17. Extending the Algorithm to Global Edit Distance. Inside a given box there are only three different costs involved : CI - insertion cost. CD - deletion cost. CS substitution cost. Since the triangle inequity holds : CS < CI + CD we will not differentiate between white box and black box. We assume without loss of generality that the cost CI and CD are the same in all boxes

  18. An O(mn’ + m’n) Algorithm for DG Filling the borders: For each cell (s,t) in the border : cell(s,t) = min(upper triangle values, leftmost triangle values) triangle values = min( relevant border cells + each cell’s path to (s,t) ) path = CS * number of diagonal moves + CI * number of insertions + CD * number of deletions. example

  19. Example CI = 7 CD = 4 CS = 10 13 17 16 20 20 21 min ( min(9+1*7 , 8+1*10) , min(7+1*10 + 2*4 , 8+3*4)) =16 min (min(9+2*7 , 8+1*10+1*7 , 7+2*10) , min(7+2*10 + 1*4 , 8+2*4+1*10 , 8+3*4)) =20 min (min(9+3*7 , 8+1*10+2*7 , 7+2*10+1*7 , 7+3*10) , min(7+3*10 , 8+2*10+1*4 , 8+1*10+2*4 , 8+3*4)) =20 min (min(9+4*7 , 8+1*10+3*7 , 7+2*10+2*7 , 7+3*10+1*7) , min(8+3*10 , 8+2*10+1*4 , 8+1*10+2*4 , 9+3*4)) =21 min(min(8+3*7 , 7+1*10+2*7 , 7+2*10+1*7) , min(8+2*10 , 8+1*10+1*4,9+2*4)) =17 min (min(7+3*7 , 7+1*10+2*7) , min( 8+1*10 ,9+1*4)) =13

  20. Complexity: Stays O(m’n + n’m) same as the Levenshtein algorithm because we only add constant time calculations (multiplications)

  21. Example:

  22. Example:

  23. Example:

  24. Example:

  25. Example:

  26. Example:

  27. Notice There are two cases computing the borders: 1. cell(s,t) gets its minimum value from the left border. 2. cell(s,t) gets its minimum value from the top border. In case 1, the minimum path cost to cell(s,t) can be written as CellValue = BorderCellValue + PathCost(BorderCell, Cell) = BorderCellValue + Diagonal * CS + (Position – Diagonal) * CI = BorderCellValue + Diagonal * (CS – CI) + Position * CI. BORDERCELLVALUE + DIAGONAL * (CS – CI) is notdependant on Position! Hence it can be calculated in advance, and kept in an array for all border cells, allowing each CellValue calculation to spend only constant time. Same applies for case 2, by changing CI to CD.

  28. Approximate Searching Given a string A (short pattern ) ,a string B ( long text ) and a threshold parameter K ,we are interested reporting all the “approximate occurrences “ of A in B using K errors or less. (the position of substrings in B that are at distance K or less from the pattern A.) A = aabca B = aaeeeeeebbbbbbbaaaaaabbbbbbcccaaaccccabcdcdcdeeeaaab K = 3

  29. Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined ,and every text position which is smaller then K is reported as a match .

  30. Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined ,and every text position which is smaller then K is reported as a match .

  31. Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined ,and every text position which is smaller then K is reported as a match .

  32. More efficient algorithm – pattern and text are run-length compressed: Fill the matrix only at beginning of text runs. Complete the first m columns only. Complexity = O(m2n’+R) For each run – m*m, there are n’ runs, R is the size of the output.

  33. Improving the trivial algorithm Problem: We would have wanted to apply DL, but if the text is very large, m’n may be a lot bigger than m2n’. Solution: Combine the two. Evaluating only the borders of the runs, and only the first m cells of each run, yields an O(m’m + m +m) per run of the text, multiplying by n’ to the final O(m’mn’ + R) complexity. דוגמה

  34. thank you

More Related