590 likes | 695 Vues
This article introduces the concept of suffix arrays and discusses a search algorithm for efficient online string searches. It explains the construction of suffix arrays in O(NlogN) time and space complexity, utilizing the longest common prefix (lcp) information. The article also covers the use of lcp values to improve search efficiency.
E N D
Suffix Arrays:A new method for on-linestring searches • Udi Manber • Gene Myers May 1989 Presented by: Oren Weimann
Introduction - Problem definition “Is W a substring of A?” • |A|=N and |W|=P • A = a0a1…aN-1 • Ai = suffix beginning at index i = aiai+1…aN-1 W= badgfbb A= abccbbadgfbbcahgjf A= abccbbadgfbbcahgjf
Introduction – what is a suffix array? Example: A = assassin Pos[2] = 6 (A6 = in) Pos
Introduction – what is a suffix array? A lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1}
Introduction – what is a suffix tree? Example: • A trie that contains all suffixes of A: A = assassin s a s s a s s i n s 1 i i a n i n s a n i s s 6 s 5 4 i n i n n 0 3 2
The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.
The Search algorithm - Definitions • For any string u, up = u1u2u3…….up (or u if |u| p) • Let “ “ denote a Lexicographical order, We say u v up vp • Note that for any choice of p: • Note that W is a substring of A there is an isuch that W
The Search algorithm – how does the array help us know if W is a substring of A? • We define a search interval: LW = min {k | W APos[k] or k = N} RW = max {k | W APos[k] or k = -1} • W matches ai ai+1 ...ai+P-1 i=Pos[k] for some k [LW, RW]
Example: A = assassin Pos Option 1 Option 2 Option 3
Why finding LW,RW == Finding the matches: • If LW > RW => W is not a substring of A. • Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW] Pos W>APos[k] W<APos[k] LW RW
The Search algorithm –The easy way - O(PlogN) W=“abcx” Pos M R L Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2. In the end LWR
The Search algorithm using lcp values in O(P+logN) – Definitions: Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: • l = lcp(APos[L], W) • r=lcp(W, APos[R]) • Llcp[M] = lcp(APos[L] APos[M]) • Rlcp[M] = lcp(APos[M], APos[R])
The Search algorithm using lcp values in O(P+logN) Example: A=“abcx” l = 3 r = 2 Pos Llcp[M]=4 Rlcp[M]=2 M R L Note that Llcp[M] is well defined because every midpoint M has one LM and one RM
So how do we use l,r,Llcp[M] ?Example: W=abcx Llcp[M]=4 l=3 r=2 R L M Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>APos[L] • W>APos[M] • Go right • l is unchanged = 3
Example: W=abcx (cont.) Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) APos[L] <APos[M] • W<APos[M] • Go left • r = Llcp[M] = 2 Llcp[M]=2 l=3 r=2 M L R
Example: W=abcx (cont.) Llcp[M]=3 r=2 l=3 L M R Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare Wl and APos[M]l until Wl+j APos[M]l+j • Go right or left according to Wl+j, APos[M]l+j • new l or r = (l+j) • Number of comparisons = j+1
The Search algorithm using lcp values-complexity In each iteration there are maximum j+1 comparisons, when in total • Total comparisons (P + #Iterations) • O(P+logN) running time • Requires only 3N-sized arrays
The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.
Construction of suffix array in O(NlogN) Sorting the suffixes in a unique Radix sort – We Will have O(logN) stages (numbered 1,2,4,8,16…) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H– thus, in stage H the suffixes are sorted by )
Construction of suffix array –The general idea If Ai, Aj H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H. first bucket second bucket third bucket fourth bucket Ai Aj Aj+H Ai+H H=2:
Construction of suffix array –The general idea (cont.) • Let Ai be in first H-bucket after stage H • Ai starts with smallest H-symbol string • Ai-H should be first in its H-bucket H=2: Ai Ai-H
Construction of suffix array –The algorithm • Go over the suffix array: • For each Ai: Move Ai-H to next available place in its H-bucket • The suffixes are now sorted according to -order • Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later)
Construction of suffix array –The algorithm Example: A = assassin A2 A3 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A0 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A6 A5 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A6 A7 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A2 A1 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A4 A5 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A0 A1 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin A3 A4 H=1 Ai sets Ai-1
Construction of suffix array –The algorithm Example: A = assassin Go over array to get new 2-buckets lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket H=1 Ai sets Ai-1 back
Construction of suffix array –The algorithm Example: A = assassin A0 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A1 A3 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A4 A6 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A7 A5 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A2 A0 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A3 A5 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A1 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin A2 A4 H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin Go over array to get new 4-buckets H=2 Ai sets Ai-2
Construction of suffix array –The algorithm Example: A = assassin That’s it, we are sorted! H=4
Construction of suffix array –Complexity Summary • Sorting by first char – O(N) • O(logN) stages of O(N) operations = O(NlogN) • Total - time: O(NlogN) - space: 2 integer arrays of size N back
The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.
How to find Longest Common Prefixes – the general idea • We don’t care what is the lcp between suffixes in the same H-bucket. • For Ap, Aq in the same H-bucket but different 2H-buckets: • H lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H that is why Ap+H,Aq+H Are in different H-buckets, but which ones?
How to find Longest Common Prefixes – the general idea • If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how? • If not, Then: lcp(APos[i], APos[j]) = {lcp(APos[k],APos[k+1])}
How to find Longest Common Prefixes – the general idea lcp(Ap+H, Aq+H) = min{1,1,2} = 1 H=2 1 1 2 Ap+h Aq+h Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H
How to find lcp – algorithm and data structures – Hgt[] During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]), initialized so that Hgt[i]=N+1 for every i. • In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. • In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket
H=1 assin assassin in n sin ssin sassin ssassin 9 0 0 0 9 9 9 H=2 assin assassin in n sassin sin ssin ssassin 9 0 0 0 1 1 9 How to find lcp – Hgt[] example: lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1
H=4 ssin assassin assin in n sassin sin ssassin 3 0 0 0 1 1 2 How to find lcp – Hgt[] example (cont.) lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2
How to find lcp –data structures We need a data structure that will contain lcp(APos[j], APos[i]) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree.