1 / 20

The Wavelet Trie : Maintaining an Indexed Sequence of Strings in Compressed Space

The Wavelet Trie : Maintaining an Indexed Sequence of Strings in Compressed Space. Roberto Grossi Giuseppe Ottaviano * Università di Pisa * Part of the work done while at Microsoft Research Cambridge. Disclaimer. THIS HAS NOTHING TO DO WITH WAVELETS!. Indexed S tring S equences.

emlyn
Télécharger la présentation

The Wavelet Trie : Maintaining an Indexed Sequence of Strings in Compressed Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Wavelet Trie: Maintaining anIndexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano*Università di Pisa* Part of the work done while at Microsoft Research Cambridge

  2. Disclaimer THIS HAS NOTHING TO DO WITH WAVELETS!

  3. Indexed String Sequences (foo, bar, foobar, foo, bar, bar, foo) 0 1 2 3 4 5 6 • Queries • Access(i): access the i-th element • Access(2) = foobar • Rank(s, pos): count occurrences of s before pos • Rank(bar, 5) = 2 • Select(s, i): find the i-th occurrence of a s • Select(foo, 2) = 6

  4. Prefix operations (foo, bar, foobar, foo, bar, bar, foo) 0 1 2 3 4 5 6 • Queries • RankPrefix(p, pos): count strings prefixed by p before pos • RankPrefix(foo, 5) = 3 • SelectPrefix(p, i): find the i-th string prefixed by p • SelectPrefix(foo, 2) = 3

  5. Example: storing relations • Write the columns as string sequences • Store them separately • Reduce relational operations to sequence queries What does Sheldon like? Who likes pages from domain wikipedia.org? Other operations: range counting, …

  6. Dynamic sequences We want to support the following operations: • Insert(s, pos): insert the string s immediately before position pos • Append(s): append the string s at end of the sequence (special case of Insert) • Delete(pos): delete the string at position pos If data structure only supports Append, we call it append-only, otherwise dynamic (or fully dynamic)

  7. Requirements • Store the sequence in as little space as possible • Close to the information-theoretic lower bound • But still be able to support all the described operations (query and update) efficiently • Aim for worst-case polylog operations

  8. Some notation (foo, bar, foobar, foo, bar, bar, foo) 0 1 2 3 4 5 6 • Sequence S, |S| = n • In the example n = 7 • String set Sset is unordered set of distinct strings appearing in S • In the example, {foo, bar, foobar}, |Sset| = 3 • Also called alphabet • Sequence symbols can also be integers, characters, … • As long as they are binarized to strings

  9. Wavelet Trees • Introduced in 2003 to represent Compressed Suffix Arrays • Support Access/Rank/Select on sequences on a finite alphabet (of integers) • Reduces to operations on bitvectors by recursively partitioning the alphabet • String sequences can be reduced to integer sequences

  10. Wavelet Trees • S = (a, b, r, a, c, a, d, a, b, r, a), Sset={a, b, c, d, r} abracadabra00101010010 {a, b} {c, d, r} abaaaba0100010 rcdr1011 {d, r} rdr101 a b c d r

  11. Wavelet Trees • Space equal to entropy of the sequence • Plus negligible terms • Supports Access/Rank/Select in O(log |Sset|) • Later extended to support Insert/Delete… • … but tree structure is fixed a priori • String set Sset is cannot be changed! • Unrealistic restriction in many database applications

  12. The Wavelet Trie • The Wavelet Trie is a Wavelet Tree on sequences of binary strings (Sset⊂ {0, 1}*) • Supports Access/Rank(Prefix)/Select(Prefix) • Fully dynamic… • … or append only (with better bounds) • The string set need not be known in advance

  13. Wavelet Trie: Construction Sequence of binary strings Branching bit: β α: 010β: 1001011 010111 0100110 0100001 0101010 0100110 010111 010110 010111 0100110 0100001 0101010 0100110 010111 010110 0 1 110 001 110 11 010 11 10 Common prefix: α

  14. Wavelet Trie: Construction α: 010β: 1001011 010111 0100110 0100001 0101010 0100110 010111 010110 α:εβ: 101 α:εβ: 1011 α: 01 α: 10 α: 10 α:εβ: 110 α: ε α: ε

  15. Wavelet Trie: Access α: 010β:1001011 α: 010β: 1001011 010111 0100110 0100001 0101010 0100110 010111 010110 0123456 α:εβ:101 α:εβ: 101 α:εβ: 1011 α:εβ:1011 α: 01 α: 10 α: 10 α:εβ: 110 Access(5) = 1 010 1 1 α: ε α: ε Rank is similar

  16. Wavelet Trie: Select α: 010β: 1001011 α: 010β:1001011 010111 0100110 0100001 0101010 0100110 010111 010110 0123456 α:εβ: 101 α:εβ:101 α:εβ: 1011 α:εβ:1011 α: 01 α: 01 α: 10 α: 10 α: 10 α: 10 α:εβ: 110 α:εβ: 110 4 Select(0100110, 1) = α: ε α: ε α: ε α: ε

  17. Wavelet Trie: Append α: 010β: 1001011 0 010111 0100110 0100001 0101010 0100110 010111 010110 α:εβ: 101 α:εβ: 1011 1 α: 01 α: 10 α:εβ: 11 α: 10 α:εβ: 110 0 010010 Insert/Delete are similar α: ε α: ε α: 0 α: ε

  18. Space analysis • Information-theoretic lower bound • LB(S) = LT(Sset) + nH0(S) • LT is the information-theoretic lower bound for storing a set of strings • Static WT: LB(S) + o(ĥn) • Append-only WT: LB(S) + PT(Sset)+ o(ĥn) • PT(Sset): space taken by the Patricia Trie • Fully dynamic WT: LB(S) + PT(Sset)+ O(nH0(S))

  19. Operations time complexity • Need new dynamic bitvectors to support initialization (create a bitvector0n or 1n) • Static and Append-only Wavelet Trie • All supported operations in O(|s| + hs) • hs is number of nodes traversed by string s • Fully dynamic Wavelet Trie • All supported operations in O(|s| + hs log n) • Deletion may take O(|ŝ| + hs log n) where ŝ is longest string in the trie

  20. Thanks for your attention! Questions?

More Related