120 likes | 262 Vues
Combinatorial aspects of the Burrows-Wheeler transform. Sabrina Mantaci Antonio Restivo Marinella Sciortino. University of Palermo. Burrows-Wheeler Transform.
E N D
Combinatorial aspects of the Burrows-Wheeler transform Sabrina Mantaci Antonio Restivo Marinella Sciortino University of Palermo
Burrows-Wheeler Transform • In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that: • the transformed string is easier to compress than the original one. • the original string can be recovered; • The use of this preprocessing allowed to define a class of lossless data compression algorithms that: • achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv; • obtains a compression ratio close to the best statistical modelling techniques.
FL 0 a a b r a c 1 a b r a c a 2 a c a a b r 3 b r a c a a 4 c a a b r a 5 r a c a a b I • OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering. How does BWT work ? • INPUT:w = abraca • Lexicographically sort the cyclic rotations of w • The following properties hold: • the character L[i] is followed in w by F[i]; • for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L.
F 0 a 1 a 2 a 3 b 4 c 5 r L c 0 a 1 r 2 a 3 a 4 b 5 I : 0 1 2 3 4 5 1 3 4 5 0 2 = w= a b r a c a Reversibility The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w. • Given L=BWT(w)=caraab and I=1: • Construct F by alphabetically sorting the letters in L • Define a permutation on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L; • Starting from position I, we can recover w=w0 … wn as follows: • wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))
REMARK: Two words x and y are conjugate BWT(x)=BWT(y) • PROPOSITION: • If and BWT(v)=a0a1…an-1then BWT(u)= ; • If BWT(v)=a0a1…an-1and BWT(u)= then there exists a • conjugate u’ of u such that u’=vd. We can deduce that: Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.
Standard Words d1, d2,…,dn,… a sequence of natural numbers d10, >0 i =2,…,n Consider the sequence {sn}n0 defined as: • s is a characteristic Sturmian word • {sn}0 is called approximating sequence of s • (d1, d2,…,dn,… )is the directive sequence of s • Each finite word snis a standard word
Characterization of standard words • A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2.(extremal case of Fine and Wilf theorem) • A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}. • Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.
Ia={0,1,…,q-1} Ib={q,q+1,…,n-1} : {0,1,…n-1} {a,b} defined as: (x )=a if x Ia, b otherwise. a a b 0 1 7 a 2 b 6 3 5 a 4 b a • THEOREM: Let w=x0x1…xn-1in {a,b}* , |w|a=q and |w|b=p. • w is a standard word with suffix ba xi= • w is a standard word with suffix ab xi= REMARK: Let u=u0u1…un-1, v=v0v1…vn-1 If ui= and vi= then u and v are conjugate. Rotations Standard words can also be generated by rotations. Let p,q2 such that gcd(p,q)=1 and n=p+q. p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n) If n=8, p=3, q=5,… w=abaababa
THEOREM:Let u be a word over the alphabet {a,b}. BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word. In particular, in order to reconstruct u from BWT(u) and the index I: if I=p then u is a standard word with suffix ba if I=p-1 then u is a standard word with suffix ab COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word. A new characterization of standard words
: F 0 a 1 a 2 a 3 a 4 a 5 b 6 b 7 b L b 0 b 1 b 2 a 3 a 4 a 5 a 6 a 7 Idea of the proof: The permutation giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n). Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).
Further Research Further Research • Study extremal case of the BWT for k-letters alphabets with k>2. • For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*. • This property does work neither with 3-Standard words nor with balanced words. • Does a relation between the complexity function of a word w and the structure of BWT(w) exist? • Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy. • We found negative results L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language
Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba. Denote by vRthe reversal word of v and by v the word obtained by interchanging a with b and vice-versa. Then: BWT(mn(a))=vvR Where v=b2n-2a2n-3b2n-4...b20a if n is even v=b2n-2a2n-3b2n-4...a20b if n is odd Further Research • Is it possible to characterize interesting families of words in terms of their BWT?