1 / 17

Wavelet Trees

Wavelet Trees. Ankur Gupta Butler University. Text Dictionary Problem. The input is a text T drawn from an alphabet Σ . We want to support the following queries char ( i ) – returns the symbol at position i rank c ( i ) – the number of c’ s from T up to i

tracen
Télécharger la présentation

Wavelet Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wavelet Trees Ankur Gupta Butler University

  2. Text Dictionary Problem • The input is a text T drawn from an alphabet Σ. We want to support the following queries • char(i) – returns the symbol at position i • rankc(i) – the number of c’s from T up to i • selectc(i) – the ith occurrence of symbol c in T • Text T can be compressed to nH0 space, answering queries in • O(log Σ) time using the wavelet tree [GGV03] • O(log logΣ) time using [GMR06], but space is more • When Σ = polylog(n), queries can be answered in O(1) time [FMMN04]

  3. Actually compute rank1(10) = 5 Compute rankr(10) (answer is 2) preparedpeppers 110101001011011 Actually compute rank1(5) = 2 prprppprs 010100011 eaedee 101111 Actually compute rank0(2)=2 rrrs 0001 a 1 eedee 11011 ppppp 11111 d 1 eeee 1111 s 1 Actually compute rank1(2)=2 rrr 111

  4. Actually compute select1(4) = 6 Compute selectr(2) (answer is 6) preparedpeppers 110101001011011 Actually compute select1(2)=4 prprppprs 010100011 eaedee 101111 Actually compute select1(2)=2 rrrs 0001 a 1 ppppp 11111 eedee 11011 rrr 111 s 1 d 1 eeee 1111 Actually compute select1(2)=2

  5. Actually compute char(7)=0 select0(7)=3 Compute char(7) (answer is e) preparedpeppers 110101001011011 Actually compute char(3)=1 select1(3)=2 eaedee 101111 prprppprs 010100011 Actually compute char(2)=1 rank1(2)=2 rrrs 0001 a 1 eedee 11011 ppppp 11111 Actually compute char(2)=1 rank1(2)=2 d 1 eeee 1111 s 1 rrr 111

  6. Some comments • Don’t have to store any of the “all 1s” nodes • That’s just to help for the example. • What does the wavelet tree imply? • Converts representation of a finite string on an alphabet to representation of many bitvectors. • Useful to achieve, ultimately, high-order compression. • Easy to implement – very simple structure and query pattern

  7. Shapin’ Up ToSomething Special • What about the shape of a wavelet tree? • Does it affect space? No. (You will see why in a bit.) • Time? Yes. • Good news! • Reorganize it to optimize query time. . . • Use a Huffman orientation based on query access. • If you choose symbol frequency, you now can search in O(H0) time instead of O(log Σ).

  8. Wavelet Tree Space/Time • Simple bitvectors • n bits per level and log |Σ|levels • n log |Σ|overall bits • O(n log logn / log n) extra bits for rank/select [J89] • Same space as original text but can now support rankc/selectc/char in O(log |Σ|)time. (RAM) • Fancy • [RRR02] Gets O(nH0) + O(n log logn / log n) bits of space with O(log |Σ|)query time

  9. Even Skewed Is a Shape • Consider a totally skewed wavelet ``tree’’ • i.e. set symbol a to 0 and all others to 1 • The tree will look like a line, and will take this much space [RRR02]. . . • This telescopes into the multinomial coefficient, regardless of the dependency list • Simple exercise to check this fact • Thus, shape doesn’t affect the space.

  10. Empirical Entropy • Text T of n symbols drawn from alphabet Σ (n lg |Σ| bits) • Entropy: measure to assess compression size based on text • Higher order entropy Hh(of order h) • Considers contextx of neighboring h symbols • Each Prob[y|x] term is thus conditioned on context x of h symbols • Note that Hh(T) ≤ lg |Σ| • Now the text takes nHh≤ n lg |Σ| bits of space to encode

  11. One Text Indexing ResultBecause Frankly, There Are Lots • Main Results (using CSA [GGV03]) • Space usage: nHh + o(n log |Σ|) bits • Search time: O(mlog |Σ| + polylog(n)) time • Can improve to o(m) time with constant factor more space • When the text T is highly compressible (i.e. nHh = o(n)), we achieve the first index sublinear in both space and search time • Second-order terms represent the space needed for • Fast indexing • Storing count statistics for the text • Obtain nearly tight bounds on the encoding length of the Burrows-Wheeler Transform (BWT)

  12. Rankred 1 1 1 1 2 3 4 4 SA0 4 7 5 1 8 6 2 3 4 7 5 1 8 6 2 3 Tell Me More!How Do You Do It? Neighbor function Φ0 tells the position in the suffix array of the next suffix (in text order) Text Positions For this example, suppose we know SA1 Encode increasing subsequences together to get zero-order entropy SA0 4 7 5 1 8 6 2 3 Φ0 3 5 6 7 4 2 8 1 Subdivide subsequences and encode to get high-order entropy • For even index, use SA1.Example:SA0[5] = 2·SA1[Rankred(5)] = 8. • For odd index, use neighbor function Φ0. Example:SA0[2] = SA0[Φ0(5)] – 1 = 7. SA1 2 4 3 1 It turns out that the neighbor function Φ is the primary bottleneck for space. Perform these steps recursively for a compressed suffix array

  13. Burrows-Wheeler Transform (BWT) and the Neighbor Φ function • The Φ function has a strong relationship to the Burrows-Wheeler Transform (BWT) • The BWT has had a profound impact on a myriad of fields. • Simply put, it pre-processes an input text T by a reversible transform. • The result is easily compressible using simple methods. • The BWT (and the Φ function) are at the heart of many text compression and indexing techniques, such as bzip2. • We also call the Φ function the FL mapping from the BWT.

  14. Burrows-Wheeler Transform (BWT)

  15. A Shifty Little BWT listi list s

  16. Where Oh Where Is MyWavelet Tree? • For each list from the previous slide, we store a wavelet tree to achieve 0th order entropy • The collection of 0th order compressors gives high-order entropy based on the context (not shown in this talk). • Technical point: number of alphabet symbols cannot be more than text length • We “rerank” symbols to match this requirement (negligible extra cost in space, O(1) time)

  17. Any questions?

More Related