Analysis of Tree Edit Distance Algorithms Serge Dulucq and H é l è ne

Analysis of Tree Edit Distance AlgorithmsSerge Dulucq and Hélène B89902009黃鼎翔 B89902011 田知本 B89902045 巨彥霖

Outline • Introduction • Edit Distance for Trees and Forests • Cover Strategies

Introduction • Edit Distance for Trees and Forests • Cover Strategies

Motivation • One way of comparing two ordered trees is by measuring their edit distance • Application areas • Comparison of hierarchically structured data • Alignment of RNA secondary structures in computational biology • Two algorithms using dynamic programming • Zhang-Shasha • Klein

Purpose • A general analysis of dynamic programming for edit distance algorithm • Study the complexity of those decompositions by counting the exact number of distinct recursive calls • Define a new edit distance algorithm for trees which improves original algorithms with respect to the number of recursive calls

Trees and forests 2 2 • A tree is a node (called the root) connected to an ordered sequence of disjoint trees • Such a sequence is called a forest • We write l(A1◦…◦An) for the tree composed of the node l connected to the sequence of trees A1, …, An ≠ 3 4 4 3 5 1 5 1 l ˙˙˙ A2 A1 An ˙˙˙

F 1 10 2 4 7 8 9 • |F| denotes the number of nodes of the forest F • SF(F) is the set of all subforests of F • F(i), i is a node of F, denotes the subtree of F rooted at i • deg(i) is the degree of i, that is the number of children of i 3 5 6 |F| = 10 4 ∈ SF(F) 9 5 6 2 F(2) = 3 deg(4) = 2

Edit distance • Let F and G be two forests. The edit distancebetween F and G, denoted d(F, G), is the minimal cost of edit operations needed to transform F into G • Operations • Substitution • Insertion • Deletion • Let Cs, Ci, Cddenote the costs of substitution, insertion, deletion

Recursive relationship(1/3) • Strings • u, v are strings; x, y are alphabet symbols • d(xu, yv) = min{ Cd(x) + d(u, yv), Ci(y) + d(xu, v), Cs(x, y) + d(u, v) } • d(ux, vy) = min{ Cd(x) + d(u, vy), Ci(y) + d(ux, v), Cs(x, y) + d(u, v) } y y x u y v

Recursive relationship(2/3) • Trees • l, l’ are roots; F, F’ are forests • d(l(F), l’(F’)) = min{ Cd(l) + d(F, l’(F’)), Ci(l’) + d(l(F), F’), Cs(l, l’) + d(F, F’) } l’ l l’ l’

Recursive relationship(3/3) • Forests • T, T’ are forests • Left decomposition d(l(F)◦T, l’(F’)◦T’) = min{ Cd(l) + d(F◦T, l’(F’)◦T’), Ci(l’) + d(l(F)◦T, F’◦T’), d(l(F), l’(F’)) + d(T, T’) } • Right decomposition d(T◦l(F), T’◦l’(F’)) = min{ Cd(l) + d(T◦F, T’◦l’(F’)), Ci(l’) + d(T◦l(F), T’◦F’), d(l(F), l’(F’)) + d(T, T’) } • direction to indicate left or right

Example Left decomposition 4 3 1 3 4 5 2 3 2 4 5 5 4 5 4 5 2 Right decomposition 3 4 5 1 3 2 4 5 4 3 5 2 4 5 2 4 5 4 5 4 2 2

Strategy & Relevant forests • Let F and G be two forests. A strategyis a mapping fromSF(F)×SF(G) to {left, right} • Let (F, F’) be a pair of forests provided with a strategyφ.The set RFφ(F, F’) of relevant forestsis defined as the least subset of SF(F)×SF(F’) such that ifthe decomposition of (F, F’) meets the pair (G, G’), then (G, G’) belongs to RFφ(F, F’) • RFφ(F) and RFφ(F’) denote the projection of RFφ(F, F’) on SF(F) and SF(F’) • #relevant denote the number of relevant forests

Proposition(1/2) • F=F’=Ø → RFφ(F, F’)=Ø • φ(F, F’)=left, F=l(G)◦T, F’=Ø → RFφ(F, F’) = {(F, F’)}∪RFφ(G◦T, F’) • φ(F, F’)=right, F=T◦l(G), F’=Ø → RFφ(F, F’) = {(F, F’)}∪RFφ(T◦G, F’) • φ(F, F’)=left, F=Ø, F=l’(G’)◦T’ → RFφ(F, F’) = {(F, F’)}∪RFφ(F, G’◦T’) d(l(G)◦T, l’(G’)◦T’) = min{ Cd(l) + d(G◦T, l’(G’)◦T’), Ci(l’) + d(l(G)◦T, G’◦T’), Cs(l(G), l’(G’)) + d(G◦T, G’◦T’) } d(T◦l(G), T’◦l’(G’)) = min{ Cd(l) + d(T◦G, T’◦l’(G’)), Ci(l’) + d(T◦l(G), T’◦G’), Cs(l(G), l’(G’)) + d(T◦G, T’◦G’) }

Proposition(2/2) • φ(F, F’)=right, F=Ø, F’=T’◦l’(G’) →RFφ(F, F’) = {(F, F’)}∪RFφ(F, T’◦G’) • φ(F, F’)=left, F=l(G)◦T, F’=l’(G’)◦T’ → RFφ(F, F’) = {(F, F’)}∪ RFφ(G◦T, F’)∪ RFφ(F, G’◦T’)∪RFφ(l(G), l’(G’))∪RFφ(T, T’) • φ(F, F’)=right, F=T◦l(G), F’=T’◦l’(G’) → RFφ(F, F’) = {(F, F’)}∪ RFφ(T◦G, F’)∪ RFφ(F, T’◦G’)∪RFφ(l(G), l’(G’))∪RFφ(T, T’) d(l(G)◦T, l’(G’)◦T’) = min{ Cd(l) + d(G◦T, l’(G’)◦T’), Ci(l’) + d(l(G)◦T, G’◦T’), Cs(l(G), l’(G’)) + d(G◦T, G’◦T’) } d(T◦l(G), T’◦l’(G’)) = min{ Cd(l) + d(T◦G, T’◦l’(G’)), Ci(l’) + d(T◦l(G), T’◦G’), Cs(l(G), l’(G’)) + d(T◦G, T’◦G’) }

Lemma 1 • Given a tree A=l(A1◦…◦An), for any strategy we have #relevant(A) ≥ |A| - |Ai|+ #relevant(A1) +…+ #relevant(An) where i∈[1…n] is such that the size of Aiis maximal

Proof(1/2) Let F = A1◦…◦An ⇒ RF(A) = {A}∪RF(F) ⇒ #relevant(A) = 1 + #relevant(F) When n=1: F = A1, A=l(A1) ⇒ #relevant(A) = 1 + #relevant(A1) ≥ |A| - |A1| + #relevant(A1) When n>1: Suppose left, Let A1 = l(F1), T = A2◦…◦An RF(F) = {F}∪RF(A1)∪RF(T)∪RF(F1◦T) | RF(F1◦T) – (RF(F1)∪RF(T)) | ≥ min{|F1|, |T|} ⇒ #relevant(F) ≥ 1 + #relevant(A1) + #relevant(T) + min{|F1|, |T|} Let j∈[2…n] st |Aj| is maximal among |A2|, …, |An| ⇒ #relevant(F) ≥ 1 + #relevant(A1) +…+ #relevant(An) + |T| - |Aj| + min{|F1|, |T|}

Take a look #relevant(A) ≥ |A| - |Ai| + #relevant(A1) +…+ #relevant(An) ⇒ #relevant(F) ≥|F| + |Ai| + #relevant(A1) +…+ #relevant(An) #relevant(F) ≥1 + |T| - |Aj| + min{|F1|, |T|} + #relevant(A1) +…+ #relevant(An)

Proof(2/2) 1 + |T| - |Aj| + min{|F1|, |T|} ≥ |F| - |Ai| 1) If |F1| ≤ |T| ⇒ 1 + |T| + min{|F1|, |T|} = |F| Since |Aj| ≤ |Ai| ∴1 + |T| - |Aj| + min{|F1|, |T|} = |F| - |Aj| ≥ |F| - |Ai| 2) If |F1| > |T| ⇒ |F| - |Ai| = |T| (∵i=1) ∴1 + |T| - |Aj| + min{|F1|, |T|} = 1 + |T| + |T| - |Aj| ≥ 1 + |T| > |F| - |Ai| ∴ #relevant(F) ≥ |F| - |Ai| + #relevant(A1) +…+ #relevant(An) ⇒ #relevant(A) ≥ |A| - |Ai| + #relevant(A1) +…+ #relevant(An)

Lemma 2 • For every nature number n, there exists a tree A of size n such that for any strategy, #relevant(A) has a lower bound in O(n logn) • For complete balanced binary tree Tn of size n, prove by induction on n that #relevant(Tn) ≥ (n+1)log2(n+1)/2

Idea • Suppose the direction is left RF(l(F)◦T) = {l(F)◦T}∪RF(l(F))∪RF(F◦T)∪RF(T) • Since T⊆F◦T,We want to eliminate in priority nodes of F in F◦T, such that RF(F◦T) and RF(T) share relevant forests as most as possible!

Cover • Let F be a forest. A cover r of F is a mapping from F to F∪{left, right}satisfying for each node i in F • if deg(i) = 0 or 1, then r(i)∈{left, right} • if deg(i) > 1, then r(i) is a child of i 2 2 4 3 4 3 1 1 left, right

Cover strategy • Given a pair of trees (A, B) and a cover r for A, we associate a unique strategyφas follows. • if deg(i) = 0 or 1, then φ(A(i), G) = r(i), for each forest G in B • If A(i) is of the form l(A1◦…◦An) with n > 1, then let p∈{1, …, n} such that the favorite child r(i) is the root of Ap. For each forest G of B, we define • φ(A(i), G) = right whenever p = 1, left otherwise • φ(T◦Ap◦…◦An, G) = left, for each forest T of A1◦…◦Ap-1 • φ(Ap◦T, G) = right, for each forest T of Ap+1◦…◦An • The tree A is called the cover tree. A strategy is a cover strategy if there exists a cover tree associated to it

φ(A(i), G) = right whenever p = 1, left otherwise • φ(T◦Ap◦…◦An, G) = left, for each forest T of A1◦…◦Ap-1 • φ(Ap◦T, G) = right, for each forest T of Ap+1◦…◦An i A(i) G A2 A1 A4 A3

Some Tasks • The order of our Tasks • 研究Tree A … • 研究Tree B … • 將 Tree A & Tree B的研究資料做結合 • 求得# distinct pairs (recursively)

研究 Tree A …

Tree A • Focus on relevant(A) (detail) • Cover strategies in A • A將牽引著B 走

Lemma 3 • (F(i), G(j))∈RF(F,G) 1 j F 1 i G This is trivial

Lemma 4 RF(l(F)◦T) = {l(F) ◦T, F1 ◦T, ….. ,Fk◦T}∪RF(l(F))∪RF(T) 這是幹什麼的呢? Term : k = |F| : F所有node的個數 Fk+1 為 Fk 作left decomposition 而得到的forest , so F1 , F2 , …… , Fk 是由一連串的left decomposition 所產生的 forests. 目標 : 利用cover strategy 為 φ(l(F) ◦ T) = left 看看是否可以減少recursive的次數?

RF(l(F)◦T) T F Since cover strategy, the direction is left T T F F RF(l(F)) RF(T) RF(F◦T) RF(l(F)◦T) = {l(F) ◦T} ∪RF(l(F)) ∪ RF(T) ∪RF(F◦T)

RF(F◦T) Continue…….. T F Since cover strategy, the direction is left T T F1 F1 ∈RF(l(F)) RF(T)

T So ……. F T T F F {F1 ◦T , ….. , Fk◦T}

Conclusion RF(l(F)◦T) = {l(F) ◦T, F1 ◦T, ….. ,Fk◦T}∪RF(l(F))∪RF(T)

Lemma 5 • #relevant(A) = |A| - |Aj| + #relevant(A1) + #relevant(A2) +… + #relevant(An) Term : A = l(A1 ◦A2 ◦ … ◦ An). Aj 為 A的favorite child. 目標 : 算出一個cover tree的relevant forests的個數

A l … … A1 An Aj Aj 為A的 favorite child j∈[1…n]

Part 1 : |A| - |Aj| Note : Φ(A(i), G) = right whenever p = 1, left otherwise Φ(T◦Ap◦…◦An, G) = left, for each forest T of A1◦…◦Ap-1 Φ(Ap◦T, G) = right, for each forest T of Ap+1◦…◦An 說明 :由於Aj 為 A的 favorite child , 所以|A| - |Aj| 相當於在算{A} ∪ {所有包含Aj的 forests} 的個數 Aj

Part 2: #relevant(A1) + #relevant(A2) + … + #relevant(An) Note : RF(A1◦A2◦A3◦A4◦... ◦An) ={A1◦A2◦A3◦A4◦... ◦An} ∪RF(F1◦A2◦A3◦A4◦... ◦An)∪RF(A1)∪RF(A2◦A3◦A4◦... ◦An ) A1 A2 A3 A4 An …..

Conclusion • #relevant(A) = • |A| - |Aj| + #relevant(A1) + #relevant(A2) + • … + #relevant(An)

free node • 什麼是free node? • 不是獨生子 • 不是父母最愛的孩子 • Definition • the root of A • the node whose parent is of degree grater than 1 and is not the favorite child favorite child free node

研究 Tree B…

Tree B • B 是被 A 牽引著走 • So no any cover strategy • Focus on following three things: • Rightmost forests • Leftmost forests • Special forests

Three Things (1) Rightmost ∪ leftmost = special？ NO！ • Definition • Rightmost forests 由 B 開始，做一連串的 left decomposition到結束，產生的所有 subforests • Leftmost forests 由 B 開始，做一連串的 right decomposition到結束，產生的所有 subforests • special forests 由 B 開始，做一連串的 left or right decomposition到結束，產生的所有 subforests

B 1 2 3 4 5 6 7 2 3 4 2 5 6 7 3 4 3 4 4 4 5 6 6 7 5 6 5 6 7 7 7 5 6 7 5 6 example Left decomposition all rightmost forests of B

Three Things (2) • Three categories • relevant forests of A fall within three categories • (α) those are compared with all rightmost forests of B • (β) those are compared with all leftmost forests of B • (γ) those are compared with all special forests of B why？

Three Things (3) • The of rightmost , leftmost , special forests ( ) • #right(B) = ∑(|B(i)|,i∈B) - ∑(|B(i)|,i is a rightmost child) • #left(B) = ∑(|B(i)|,i∈B) - ∑(|B(i)|,i is a leftmost child) • #special(B) = |B|(|B|+3) / 2 - ∑(|B(i)|,i∈B) number #right#left#special

說明 #right(B) , #left(B) • Rightmost forests – all cover strategies are that “favorite child is rightmost child” because of all left decomposition • Leftmost forests – all cover strategies are that “favorite child is leftmost child” because of all right decomposition #right(B) =∑(|B(i)|,i∈B) - ∑(|B(i)|,i is a rightmost child) #right(B) = |B| - |B右| + #right(B1) + … + #right(Bn) recursively #left(B) =∑(|B(i)|,i∈B) - ∑(|B(i)|,i is a leftmost child) #left(B) = |B| - |B左| + #left(B1) + … + #left(Bn) recursively #relevant(B) = |B| - |Bj| + #relevant(B1) + … + #relevant(Bn) Review

結合

comparison • two types (對於A) • Tree’s comparison • free node • favorite child • Forests’ comparison

Analysis of Tree Edit Distance Algorithms Serge Dulucq and H é l è ne