Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006

String-to-string correction

Traditional string-to-string correction(Wagner&Fischer 1974, Lawrence&Wagner 1975,…) • CONTEXT: • Finite set of symbols (alphabet) • Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) • Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) • Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B • INPUT: • Two words A and B • OUTPUT: • Distance between A and B Seminarium IPIPAN, 24/04/2006

Examples of elementaryedit operations • Insertion of a letter montermontaer, montermontrer • Deletion ofa letter montermontr, montermonte • Replacement of a letter by another monter ponter, monterconter • Transposition oftwo adjacent letters monter mnoter, montermontre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation. Seminarium IPIPAN, 24/04/2006

Edit sequence • Edit sequence = sequence of elementary edit operations • For each couple of words X and Y many edit sequences exist that transform X into Y. • Example 1: transformingsorting intostring : • sorting  srting  sting  string (3 operations) • sorting  sotring  string (2 operations) • sorting  srting  string (2 operations) • sorting  strting  string (2 operations) • sorting  srting  sting sing sring string (5 operations) • ................. • Example 2: transformingabc intoca : • abc  ac ca(2 operations) • abc  cabc  cac  ca (3 operations) • From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. Linear sequence Linear sequence Linear sequence Linear sequence Seminarium IPIPAN, 24/04/2006

Edit (error) distance • Costof an edit sequence = sum of costs of all elementary operations included in the sequence • sortingsrtingstingstring (3 operations) cost = 3 • sortingsotringstring (2 operations) cost = 2 • sortingsrtingstingsingsringstring (5 operations) cost = 5 • Edit distance(error distance) between two wordsXand Y (ed(X,Y)) = minimal cost of all edit sequences transforming X intoY : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account Seminarium IPIPAN, 24/04/2006

Calculatingthe edit distance(1/4) Notation : wordX= x1 x2 ... xi ...xn; theprefixof lenghtiofX : X[i]= x1 x2 ... xi i X X[i] It is possible to calculatethe distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases i X[i+1] If xi+1 = yj+1then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) Y[j+1] j Seminarium IPIPAN, 24/04/2006

Replacement’s cost Calculatingthe edit distance(2/4) If xi = yj+1and xi+1 = yj (the 2 last characters may be inverted) then4 sub-casesare possible: • The cheapest sequence transforming X[i+1] into Y[j+1] containsa transpositionof xiand xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 i X[i+1] Transposition’s cost Y[j+1] j • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthereplacementof xi+1by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthe l’insertionof yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthedeletionof xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Insertion’s cost Deletion’s cost Seminarium IPIPAN, 24/04/2006

Replacement’s cost Calculatingthe edit distance(3/4) i X[i+1] OTHERWISE (ifxi+1 yj+1, and (xi yj+1 orxi+1 yj)) then3 sub-casesare possible: Y[j+1] j • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthereplacementof xi+1by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containstheinsertionof yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 • The cheapest sequencetransforming X[i+1] into Y[j+1] containsthedeletionof xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Insertion’s cost Deletion’s cost Seminarium IPIPAN, 24/04/2006

Calculatingthe edit distance(4/4) Edit distancebetweenX[i] and Y[j] - recursivedefinition: For i=0,...,m, j=0,...,n: 1°ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j])ifxi+1 = yj+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]),if xi=yj+1 etxi+1 = yj 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]),otherwise ed(X[i],Y[j+1])} Seminarium IPIPAN, 24/04/2006

case [i,j] containstheedit distance betweenthe prefix [1,..,i] of the one word andthe prefixe [1,...,j] of the other word j m i n case [n,m] containsthe edit distance between the 2 words Calculation the edit distance : dynamic programming Seminarium IPIPAN, 24/04/2006

j+1 i+1 Dynamic programming: case 1 xi+1= yj+1 Seminarium IPIPAN, 24/04/2006

j+1 i+1 Dynamic programming : case 2 xi+1= yj and xi+1= yj Seminarium IPIPAN, 24/04/2006

j+1 i+1 Dynamic programming : case 3 xi+1 yj+1 et (xi+1 yj ou xi+1 yj) Seminarium IPIPAN, 24/04/2006

String-to-language correction

String-to-language correction: problem definition • CONTEXT: • Finite set of symbols (alphabet) • Elementary edit operations on symbols (as before) with their costs (1 per operation) • Edit sequences (as before) • Edit distance(error distance) between words: as before • INPUT: • Regular grammar describing words (a finite set of words in particular) • Incorrect word A(unrecognizable by the grammar) • Threshold t • OUTPUT: • A set of correct words B1, B2, …, Bn whose distance from A stays within t (the nearest neighbors of A) Seminarium IPIPAN, 24/04/2006

String-to-language correction: simplistic approach • METHOD: • For each word B recognizable by the grammar calculate the edit distance matrix between A and B. • Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t). • FAISABILITY: • Impossible in case of infinite languages • COMPLEXITY: O(n * m * |D|) Seminarium IPIPAN, 24/04/2006

String-to-language correction: threshold-controlled depth-first exploration of an FSA(Oflazer 1996, …) Seminarium IPIPAN, 24/04/2006

String correction with respect to a deterministic FSA (1/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found apple Seminarium IPIPAN, 24/04/2006

String correction with respect to a deterministic FSA (2/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 s 6 5 4 3 3 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found apple Seminarium IPIPAN, 24/04/2006

String correction with respect to a deterministic FSA (3/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l e 5 4 3 2 2 s 6 5 4 3 3 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found • A backtrancking results in deleting the current column apple Seminarium IPIPAN, 24/04/2006

String correction with respect to a deterministic FSA (4/4) e Word to be corrected : *aply, threshold 2 2 4 p p a 8 s e Part of the matrix calculated only once for all valid words sharing the same prefix appl 9 1 5 7 y l a p y 3 6 l y 5 4 3 2 1 • Each time a transition is followed a new column is calculated in the edit distance matrix • If we get to a final state and the edit distance remains within the thershold  a new candidate has been found • A backtrancking results in deleting the current column apple apply Seminarium IPIPAN, 24/04/2006

Controlling the searchspace by the threshold 2 a c Word to be corrected : abcbb, t=2 b 1 8 d 9 b • If the current column exceeds the threshold the whole path is cut off Seminarium IPIPAN, 24/04/2006

Tree-to-tree correction

Tree-to-tree correction(Selkow 1977,…) • CONTEXT: • Finite set of node symbols (alphabet) • Elementary edit operations on trees: • Insertion of a leaf • Deletion of a leaf • Renaming of a node (leaf or internal node) • Non negatif cost for each elementary operation • Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) • Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B • INPUT: • Two trees A and B • OUTPUT: • Distance between A and B Seminarium IPIPAN, 24/04/2006

B root(B) B0 B2 B3 B1 Comparing two trees(Selkow 1977,…) A • A partial tree A0:i is the root of A and its subtrees A0,...,Ai • The comparison is based on comparing roots, and then recursively comparing the roots’ subtrees b a root(A) A0 A2 A1 e c c d d c c e f b b e e b d b B0:2 A0:1 Seminarium IPIPAN, 24/04/2006

j m i n Edit distance matrix between two trees(Selkow 1977,…) case [i,j] containstheedit distance betweenthepartial trees A0:i andB0:j case [-1,-1] containsthecost of renaming root(A) into root(B) case [n,m] containsthe edit distance between the 2 trees Seminarium IPIPAN, 24/04/2006

j i Calculation of the tree matrix(Selkow 1977,…) Adding the cost od deleting Ai (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4 Adding the edit distance between Ai and Bj (here +0) Adding the cost of inserting Bj (here +1) Seminarium IPIPAN, 24/04/2006

 Extension to the correction of XML-documents <root> </root> • The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db* • The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996) • The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977) <x> </x> <y> </y> <z> </z> </a> <a> </c> <c> Seminarium IPIPAN, 24/04/2006

Main idea String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued) Seminarium IPIPAN, 24/04/2006

j i Edit distance matrix with edit sequences case [i,j] containstheedit distance betweenthepartial trees A0:i andB0:j, and the edit sequence necessary to transform A0:i intoB0:j Seminarium IPIPAN, 24/04/2006

Bibliography • Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document Trees. Technical Report 95-372, Department of Computing and Information Science, Queen’s University, Kingston, Ontario. • Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302 • Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp. 381-402 • Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp. 177-183 • Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477 • Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89 • Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp. 184-186 • Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp. 265-268 • Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol.21(1), pp. 168-173 Seminarium IPIPAN, 24/04/2006

Some details of the state of the art • Wagner & Fischer (1974): • Elegant and solid theoretical definition of the string-to-string correction problem • 3 elementary operations on single letters admitted (insertion, deletion, replacement) • Model of a trace describing the edit distance between two strings • Dynamic programming method • Lowrance & Wagner (1975) • Additional elementary operation: inversion of two adjacent letters • Restriction of the cost function • Du & Chang (1992): • Cost 1 for each elementary operation • Restriction to linear editing sequences • Application to the nearest neighbor search in a dictionary, with a threshold • Oflazer (1996): • Nearest-neighbor search in finite-state automata • Application to large natural-language dictionaries • Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003): • Tree-to-tree correction problem • Mihov & Schulz (2004): • Levenshtein automaton • Backward dictionary • Bouchou, B. & Halfeld Ferrari Alves, M. (2003): • Incremental validation of XML documents resulting from updates: human-computer interaction Seminarium IPIPAN, 24/04/2006

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Presentation Transcript

Quantum Error Correction

Differential GPS

Self-correction and Fluency in ESL Speaking Development

1099-MISC REPORTING / CORRECTION REMINDERS

RECIDIVISM STUDY PROPOSAL

Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Strings

Geometric Correction of Imagery

Export documents

Processing XML Documents

Trees

Error Detection and Correction

Trees

Combinatorial Pattern Matching

Chapter 10 Binary Trees

Power Factor Correction Capacitors

Coupling Correction Through Beam Position Data

241-423 Advanced Data Structures and Algorithms

Persuasive Speech

Overview of Peter D. Turney’s Work on Similarity

Activator