1 / 12

Combinatorial aspects of the Burrows-Wheeler transform

Combinatorial aspects of the Burrows-Wheeler transform. Sabrina Mantaci Antonio Restivo Marinella Sciortino. University of Palermo. Burrows-Wheeler Transform.

Télécharger la présentation

Combinatorial aspects of the Burrows-Wheeler transform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combinatorial aspects of the Burrows-Wheeler transform Sabrina Mantaci Antonio Restivo Marinella Sciortino University of Palermo

  2. Burrows-Wheeler Transform • In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that: • the transformed string is easier to compress than the original one. • the original string can be recovered; • The use of this preprocessing allowed to define a class of lossless data compression algorithms that: • achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv; • obtains a compression ratio close to the best statistical modelling techniques.

  3. FL 0 a a b r a c 1 a b r a c a 2 a c a a b r 3 b r a c a a 4 c a a b r a 5 r a c a a b I • OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering. How does BWT work ? • INPUT:w = abraca • Lexicographically sort the cyclic rotations of w • The following properties hold: • the character L[i] is followed in w by F[i]; • for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L.

  4. F 0 a 1 a 2 a 3 b 4 c 5 r L c 0 a 1 r 2 a 3 a 4 b 5 I  : 0 1 2 3 4 5 1 3 4 5 0 2  = w= a b r a c a Reversibility The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w. • Given L=BWT(w)=caraab and I=1: • Construct F by alphabetically sorting the letters in L • Define a permutation  on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L; • Starting from position I, we can recover w=w0 … wn as follows: • wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))

  5. REMARK: Two words x and y are conjugate  BWT(x)=BWT(y) • PROPOSITION: • If and BWT(v)=a0a1…an-1then BWT(u)= ; • If BWT(v)=a0a1…an-1and BWT(u)= then there exists a • conjugate u’ of u such that u’=vd. We can deduce that: Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.

  6. Standard Words d1, d2,…,dn,… a sequence of natural numbers d10, >0 i =2,…,n Consider the sequence {sn}n0 defined as: • s is a characteristic Sturmian word • {sn}0 is called approximating sequence of s • (d1, d2,…,dn,… )is the directive sequence of s • Each finite word snis a standard word

  7. Characterization of standard words • A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2.(extremal case of Fine and Wilf theorem) • A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}. • Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.

  8. Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}  : {0,1,…n-1} {a,b} defined as:  (x )=a if x Ia, b otherwise. a a b 0 1 7 a 2 b 6 3 5 a 4 b a • THEOREM: Let w=x0x1…xn-1in {a,b}* , |w|a=q and |w|b=p. • w is a standard word with suffix ba  xi= • w is a standard word with suffix ab  xi= REMARK: Let u=u0u1…un-1, v=v0v1…vn-1 If ui= and vi= then u and v are conjugate. Rotations Standard words can also be generated by rotations. Let p,q2 such that gcd(p,q)=1 and n=p+q. p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n) If n=8, p=3, q=5,… w=abaababa

  9. THEOREM:Let u be a word over the alphabet {a,b}. BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word. In particular, in order to reconstruct u from BWT(u) and the index I: if I=p then u is a standard word with suffix ba if I=p-1 then u is a standard word with suffix ab COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word. A new characterization of standard words

  10.  : F 0 a 1 a 2 a 3 a 4 a 5 b 6 b 7 b L b 0 b 1 b 2 a 3 a 4 a 5 a 6 a 7 Idea of the proof: The permutation  giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n). Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).

  11. Further Research Further Research • Study extremal case of the BWT for k-letters alphabets with k>2. • For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*. • This property does work neither with 3-Standard words nor with balanced words. • Does a relation between the complexity function of a word w and the structure of BWT(w) exist? • Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy. • We found negative results L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language

  12. Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba. Denote by vRthe reversal word of v and by v the word obtained by interchanging a with b and vice-versa. Then: BWT(mn(a))=vvR Where v=b2n-2a2n-3b2n-4...b20a if n is even v=b2n-2a2n-3b2n-4...a20b if n is odd Further Research • Is it possible to characterize interesting families of words in terms of their BWT?

More Related