1 / 18

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Linear Time Construction of Suffix Tree. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. High-level of Ukkonen’s Algorithm. Ukkonen’s algorithm is divided into m phases . In phase i +1, tree i +1 is constructed from i

lester-shaw
Télécharger la présentation

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Time Construction of Suffix Tree Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU

  2. High-level of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. a a a b b b 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a phases b b a a extensions 1 2

  3. 1234567890 How suffix links help? MISSISSIPI I M S I I S S S P 3 : MIS 6 : MISSIS 4 : MISS 2 : MI 5 : MISSI 10: MISSISSIPI 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 1 : M S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2

  4. What is achieved so far? Not so much. Worst-case running time is O(m2) for a phase.

  5. Trick1: Skip/Count Trick There must be a γ path from s(v).

  6. Trick1: Skip/Count Trick Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy Nodes 2 2 3 3 Edge length But what does it buy in terms of worst-case bounds? There must be a γ path from s(v).

  7. Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). s(v)=1 v=2 s(v)=3 v=3 s(v)=5 v=4

  8. Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension All operations except down-walk takes constant time Only needs to analyze down walk time

  9. Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension • Decreases current node-depth by at most one • Decreases node-depth by at most another one • Each down walk moves to greater node-depth • Over the entire phase, current node-depth is decremented by at most 2mtimes • Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3mover the entire phase All operations except down-walk takes constant time Only needs to analyze down walk time • Total number of edge traversal bounded by 3m • Since each edge traversal is constant, in a phase all the down-walking is O(m).

  10. Complexity • There are m phases • Each phase takes O(m) • So the running time is O(m2) Two more tricks and we are done

  11. Simple Implementation Detail • Suffix tree may require O(m2) space • Consider the string • Every suffix begins with a distinct character, so there are 26 edges out of the root. • Requires 26x27/2 characters in all • So O(m) is impossible to achieve in this representation.

  12. Alternative Representation of Suffix TreeEdge Label Compression 1 2 3 4 56789 0 1 2 Could be 8,9 A fragment of the suffix tree Edge label compressed Number of edge at most 2m – 1, and two numbers are written in a edge, so space is O(m)

  13. 1234567890 MISSISSIPI M S I I S S S 5 : MISSI 6 : MISSIS 4 : MISS 3 : MIS 2 : MI 7 : 1234567 7 : MISSISS 8 : MISSISSI 8 : 12345678 1 : M S I S I I I S S S S S S I S I S Explicit Extension I I Implicit extension 3 4 1 2 Observation 1: Rule 2 is a show stopper. We stop further extension.

  14. 1234567890 MISSISSIPI M S I I S S S 8 : MISSISSI 7 : MISSISS 3 : MIS 6 : MISSIS 8 : 12345678 7 : 1234567 4 : MISS 1 : M 5 : MISSI 2 : MI S I S I I I S 1,7 S S S 3,7 S S 4,7 S 2,7 S Explicit Extension The major cost e 3 = 8 4 1 Observation 2: Once a leaf always a leaf 2

  15. 1234567890 MISSISSIPI M S I I S S S 8 : MISSISSI 7 : MISSISS 3 : MIS 6 : MISSIS 8 : 12345678 7 : 1234567 4 : MISS 1 : M 5 : MISSI 2 : MI S I S I I I S 1,7 S S S 3,7 S S 4,7 S 2,7 S Explicit Extension The major cost e 3 = 8 4 1 Once a leaf always a leaf 2 At any phase the cost is only for explicit extension

  16. 1234567890 MISSISSIPI 9,9 S M I I S S S P 9 : MISSISSIP 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 9 : 123456789 2 : MI 8 : 12345678 1 : M 3 : MIS 4 : MISS S I 9,9 S I 2,5 9,9 I 9 I S 1,9 S S S I 3,9 S S P P 4,9 S 2,9 6,9 P I S 9,9 9,9 I 6 I 5 8 P e 3 = 9 7 4 1 Once a leaf always a leaf 2 At any phase the cost is only for explicit extension

  17. 1234567890 MISSISSIPI 8 : 12345 9 : 123456789 Since there are only m phases, the total number of explicit extension is bounded by 2m So the total number of down-walk is bounded by O(m) Or The time to construct the suffix tree is bounded by O(m)

  18. Reference • Chapter 6: Algorithms on Strings, Trees and Sequences

More Related