350 likes | 366 Vues
An algorithm for finding approximate occurrences of short patterns in long texts based on edit distance computation. Differentiate between three kinds of edit distances and optimize for efficiency. Extend algorithm to global edit distance.
E N D
Approximate Matching Of Run-Length Comressed Strings By Makinen, Navarro and Ukkonen
Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m) time algorithm that compute their edit distance. Let A be a short pattern, a B be a long text and a threshold parameter K. We will show an algorithm that will report all the approximate occurrences of A in B Which are at an edit distance K or less from the pattern.
Example of simple edit distance: A= aaabbc B= abbdb
Example of simple edit distance: A= aaabbc B= abbdb
Example of simple edit distance: A= aaabbc B= abbdb
Example of simple edit distance: A= aaabbc B= abbdb
Example of simple edit distance: A= aaabbc B= abbdb
We distinct between three kinds of edit distance: Levenshtein distance - DL (A,B) : Insertion =1 ,Deletion =1,Substitution=1. distance - DID (A,B) : Insertion =1 ,Deletion =1,Substitution=∞ (no Substitution). Global distance - DG (A,B) : Arbitrary coast for Insertion ,Deletion ,Substitution.
KeyWords • Run-Length compressed aaaabbcccaab = (a,4),(b,2),(c,3),(a,2),(b,1). • White Box • Black Box
An O(mn’ + m’n) Algorithm for DL • Equal Letter Box (White): • “Copying” the values from the upper left borders to the bottom right borders, using as much diagonal moves as possible. 8 8 8 7 7 8 • In an Equal letter box, DID = DL because no substitutions are needed.
An O(mn’ + m’n) Algorithm for DL • Different Letter Box (Black): • Filling only the borders: • 1 + min (t-1 + min relevant upper border , s-1 + min relevant left border ).
(3,5) t > s (5,3) t < s (5,5) t = s
9 10 9 9 11 10 1+min( 3-1 + min(7,8) , 1-1 + min(8,9)) = 9 1+min( 3-1 + min(7,8,8) , 2-1 + min(7,8,9)) = 9 1+min( 3-1 + min(7,8,8,8) , 3-1 + min(7,7,8,9)) = 10 1+min( 3-1 + min(8,8,8,9) , 4-1 + min(7,7,8,9)) = 11 1+min( 2-1 + min(8,8,9) , 4-1 + min(7,7,8)) = 10 1+min( 1-1 + min(8,9) , 4-1 + min(7,7)) = 9
Three different points along our computation: s t > s t = s t < s t
Extending the Algorithm to Global Edit Distance. Inside a given box there are only three different costs involved : CI - insertion cost. CD - deletion cost. CS substitution cost. Since the triangle inequity holds : CS < CI + CD we will not differentiate between white box and black box. We assume without loss of generality that the cost CI and CD are the same in all boxes
An O(mn’ + m’n) Algorithm for DG Filling the borders: For each cell (s,t) in the border : cell(s,t) = min(upper triangle values, leftmost triangle values) triangle values = min( relevant border cells + each cell’s path to (s,t) ) path = CS * number of diagonal moves + CI * number of insertions + CD * number of deletions. example
Example CI = 7 CD = 4 CS = 10 13 17 16 20 20 21 min ( min(9+1*7 , 8+1*10) , min(7+1*10 + 2*4 , 8+3*4)) =16 min (min(9+2*7 , 8+1*10+1*7 , 7+2*10) , min(7+2*10 + 1*4 , 8+2*4+1*10 , 8+3*4)) =20 min (min(9+3*7 , 8+1*10+2*7 , 7+2*10+1*7 , 7+3*10) , min(7+3*10 , 8+2*10+1*4 , 8+1*10+2*4 , 8+3*4)) =20 min (min(9+4*7 , 8+1*10+3*7 , 7+2*10+2*7 , 7+3*10+1*7) , min(8+3*10 , 8+2*10+1*4 , 8+1*10+2*4 , 9+3*4)) =21 min(min(8+3*7 , 7+1*10+2*7 , 7+2*10+1*7) , min(8+2*10 , 8+1*10+1*4,9+2*4)) =17 min (min(7+3*7 , 7+1*10+2*7) , min( 8+1*10 ,9+1*4)) =13
Complexity: Stays O(m’n + n’m) same as the Levenshtein algorithm because we only add constant time calculations (multiplications)
Notice There are two cases computing the borders: 1. cell(s,t) gets its minimum value from the left border. 2. cell(s,t) gets its minimum value from the top border. In case 1, the minimum path cost to cell(s,t) can be written as CellValue = BorderCellValue + PathCost(BorderCell, Cell) = BorderCellValue + Diagonal * CS + (Position – Diagonal) * CI = BorderCellValue + Diagonal * (CS – CI) + Position * CI. BORDERCELLVALUE + DIAGONAL * (CS – CI) is notdependant on Position! Hence it can be calculated in advance, and kept in an array for all border cells, allowing each CellValue calculation to spend only constant time. Same applies for case 2, by changing CI to CD.
Approximate Searching Given a string A (short pattern ) ,a string B ( long text ) and a threshold parameter K ,we are interested reporting all the “approximate occurrences “ of A in B using K errors or less. (the position of substrings in B that are at distance K or less from the pattern A.) A = aabca B = aaeeeeeebbbbbbbaaaaaabbbbbbcccaaaccccabcdcdcdeeeaaab K = 3
Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined ,and every text position which is smaller then K is reported as a match .
Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined ,and every text position which is smaller then K is reported as a match .
Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined ,and every text position which is smaller then K is reported as a match .
More efficient algorithm – pattern and text are run-length compressed: Fill the matrix only at beginning of text runs. Complete the first m columns only. Complexity = O(m2n’+R) For each run – m*m, there are n’ runs, R is the size of the output.
Improving the trivial algorithm Problem: We would have wanted to apply DL, but if the text is very large, m’n may be a lot bigger than m2n’. Solution: Combine the two. Evaluating only the borders of the runs, and only the first m cells of each run, yields an O(m’m + m +m) per run of the text, multiplying by n’ to the final O(m’mn’ + R) complexity. דוגמה