Faster Suffix Tree Construction With Missing Suffix Links

Faster Suffix Tree Construction With Missing Suffix Links By Richard Cole and Ramesh Hariharn Present by B89502027資管三黃輔中 B89705013資管四蔡欣穆

What is “Missing Suffix Link” • The definition of suffix link implies str( link(x) ) is str(x) with 1st symbol removed • Where link(x) must be a “NODE”. • When link(x) is not a node…. The suffix link is missing! • 2 situations • Parameterized string • Suffix tree for 2-Dimension array

The problem is… • Parameterized strings and 2D array • The node degree may not be bound by constant, i.e., some polynomial of n • Farach [5]: solved polynomial but not missing suffix link • Baker[1], Kosaraju[11] solved parameterized string but not polynomial (n logn) • Giancarlo[7] solved 2D array but still in (n logn) • We can solve both with O(n)!!!

Our contribution & tools • Putting additional nodes and suffix links to the suffix tree but still in space O(n) and time O(n) • Providing a failure probability of inverse exponential, i.e., hashing scheme.

General Settings • Quasi-suffix collection • An ordered collection of strings s1, s2, … sn iff the following hold • |s1| = n, and |si| = |si-1| -1, therefore |sn|=1 • No si is a prefix of another sj • Suppose si and sj have common prefix of length L >0, then si+1 and sj+1 have a common prefix of length at least L – 1. aabb$ abb$ bb$ b$

General Settings(cont’d) • Multiple quasi-suffix collection • Several quasi-suffix collections have L strings in all • Any pairs of strings si, sj hold conditions 2 & 3 of quasi-suffix collection • Character Oracle • Supply the ith character of the jth string of the collection on demand in O(1) time

Suffix trees for parameterized strings • Each s of string s’ is transformed to num(s), e.g., ?b?b?$ => 0b2b2$ • How is condition 1 hold? • How is condition 2 hold? • How is condition 3 hold? 0bb3b2$ bb0b2$ b0b2$ 0b2$ b0$ 0$

Suffix trees for 2D arrays • There are m+n-1 diagonals in m x n array • For each diagonal form a square array • For each square array, decomposing in a “┘”shapes, • Each “┘” is mapped to a number (Giancarlo[7]), and a square is a string num(s), forming quasi-suffix collection (each with different ending symbol) • since m+n-1 diagonals, m+n-1 square for a multiple quasi-suffix collection

First! McCreight’s Algorithm • Definition of suffix link • Since condition 3 must satisfied with equality, suffix link is defined for each node x and link(x) is defined to be a node. • Two stages: rescanning and, possibly, scanning • Rescan down from link( par(x) )until position for link(x) found • If node not present, insert one and an edge for the leaf (no scan) • Otherwise, just scan down (as we did in ukkonen) • In either case, link(x) is well defined!

Two problems • Link( par(x) ) may not be defined • The lack of node at link(x)! • Since condition 3 need not satisfied only with equality, i.e., in our parameterized string case!

Our Algorithm • Two modifications to McCreight’s • Traversing up to find an ancestor with suffix link • Copy nodes backwards from the destination found above • Re-definition of suffix link • link(x) is node y such that if str(x) is the longest common prefix of si and sj, then str(y) must be the longest common prefix of si+1 and sj+1, where |str(y)| = |str(x)| -1. • link(x) need not be defined for every node x!

Some definition • nanc(x), nearest ancestor of x with suffix link • Real/imaginary node • If new scanning stage begin within an edge, (condition 3 with > property) we use an imaginary node. • Imaginary node has only 1 child, whereas real node has at least 2! • At most O(n) real nodes and imaginary nodes (since leaves at most n)

Some facts • Number of real and imaginary node is O(n) • Total number of children of real and imaginary nodes are O(n) • Total length of scanned portion is O(n)

More features • Back propagation nodes • Must have suffix link • Only one child • When scanning down from link(nanc(x)) to link(x), every 2 node (not including the first and the last) are back-propagated.

Invariant 1 • If a node x is back-propagated in direction u, then its parent is not back-propagated in direction u’ where u’ is a prefix of u.

Time Complexity • Two to be analyze • Finding nanc(x) • Rescanning down • Creating a new back-propagated node • Upgrade imaginary node to back-propagated node, by adding suffix link to it! • Adding a real/imaginary node for link(x) • Time = O(1) + 1 + 2

Bounding back-Propagated node • Defining BP tree • All node except root are back-propagated node • BP forest • Trees rooted at various real/imaginary nodes that are back-propagated. (Imagine the suffix tree as BP forest!) • Decomposing BP tree into paths • From root down to a node y such that either • 1. no valid direction for y • 2. there exist a direction u but in which y has not been back propagated! • Decomposing recursively

Bounding back-Propagated node (cont’d) • Extend paths on suffix tree backward (on direction not imply by back-propagation node) until either • 1. a node is reached • 2. no valid direction is available • Lemma 1: two distinct extended path cann’t intersect. • Lemma 2 :if an extended path terminated at node y (not by running out of valid direction), y cannot be back-propagated node. • Lemma 3: total number of path is O(n), and hence total number of pack-propagated node is O(n)

Time Complexity (cont’d) • The process of finding nanc(x) is just the same way discussed in Ukkonen bounded by O(n) • Combining with lemma 3, we have the theorem

The Hashing Scheme • Goal • Hash O(n) pairs [node#, following symbol] • 空間複雜度 O(n), 時間複雜度 O(1) query • 失敗率 inverse exponential

FKS Perfect Hashing • Fredman, Komlos, Szemeredi • Refer to textbook for Algorithm • Hash n items from range [0…poly(n)] into [0…Θ(n)] • Ensure probability without collision >= ½

The Static Hashing Scheme • Choose positive constantε • When ε→0, failure probability ↓ • Total time & space of DS will be linear with factor 1/ε • 我們把範圍為[1 … nc ]的n個items hash到一個 imaginary array A of size nc

The Static Hashing Scheme(cont’d) • Step 1 (建構partition tree) • #of node O(n) • Has nεchildren • Each children associate with a distinct subarray of A of size nc-ε • Each leaf (subarray) with more than nεitems is recursively partitioned • Total size O(n)

The Static Hashing Scheme(cont’d) • Step 2 • Using FKS Perfect Hashing • Several trials will be required since only ½ • 計算total time complexity • Total size of sub problem is n • Each sub problem is nε

The Static Hashing Scheme(cont’d) • Size categories • Divide leaves into O(logn) categories • For a categories i , the leaves size are in the range nε/(4i+1) … nε/(4i) for i>=0 • We will show that • time for this category is proportional to the sum of size of the leaves in this category + O(n/2i) • With failure probability • It follows that total time O(n) with failure probability

The Static Hashing Scheme(cont’d) • Succeed • Items in a leaf are perfect hashed • Round • One trials for each of the relevant leaves • Group • Organization of rounds

The Static Hashing Scheme(cont’d) • How to grouping rounds? • 0th Group: 在某個category中所有unsuccessful leaves小於n1-ε2i / (log n)前的所有rounds • jth Group:在某個category中所有unsuccessful leaves在n1-ε2i / (2j * log n)與n1-ε2i / (2j-1 * log n)之間的所有rounds (j >= 1)

The Static Hashing Scheme(cont’d) • We will show failure probability of rounds in group • 0th: of rounds O( i + log log n) with failure probability • jth : of rounds O( 2j ) with failure probability • Failure probability: (over all groups) • First of all, we show that total time taken in j groups :

The Static Hashing Scheme(cont’d) • Secondary: we show rounds in 0th group • Leaves in ith category are at most n / (nε/4i+1) • n / (nε/4i+1) * (1/2)x = n1-ε*2i / log n ==> x = 2 + i + log log n (the rounds in 0th group) • In Chernoff bound [2], If #u unsuccessful leaves, at some instance of time, then half these leaves succeed in the next 2k rounds, with failure probability 1/(2Θ(#uk)) • Failure probability at end of 0th group is thus (k=1)

The Static Hashing Scheme(cont’d) • we show rounds in jth group • K = 2j • Has 2*2j rounds • #u = n1-ε2i / (2j * log n) • With failure probability • Totally O(log n) groups, thus total failure probability

Faster Suffix Tree Construction With Missing Suffix Links

Faster Suffix Tree Construction With Missing Suffix Links

Presentation Transcript

Suffix trees and suffix arrays

Suffix Trees

Pattern Matching: Suffix Tree Applications

On the Sorting-Complexity of Suffix Tree Construction

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Suffix - ly

Suffix trees

McCrieght’s algorithm for linear-time suffix tree construction

Suffix Trees and Suffix Arrays

Suffix Trees

Suffix Tree

Suffix tree and suffix array techniques for pattern analysis in strings

Suffix arrays

Suffix Trees

Compressed Suffix Arrays and Suffix Trees

SUFFIX TREES

Suffix Trees and Suffix Arrays

Trie/Suffix Trie/Suffix Tree

Suffix Trees

Suffix Tree and Suffix Array

Suffix Trees