1 / 14

Fast Compressed Tries through Path Decompositions

Fast Compressed Tries through Path Decompositions . Roberto Grossi Giuseppe Ottaviano* Università di Pisa. * Part of the work done while at Microsoft Research Cambridge. Compacted tries. Node label. Branching character. t. h. r. three trial triangle trie triple triply. ree.

warner
Télécharger la présentation

Fast Compressed Tries through Path Decompositions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research Cambridge

  2. Compacted tries Node label Branching character t h r three trial triangle trie triple triply ree i e a p l ε ε e y n l ε ε gle ε

  3. Applications • String dictionaries • With prefix lookup, predecessor, … • Exploit prefix compression • Monotone perfect hash functions • “Hollow” or “Blind” tries [ALENEX 09] • Binary tree (no need store branching chars) • No need to store node labels, just lengths (skips)

  4. Height vs. performance • Tries can be deep – no guarantee on height • Bad with pointer-based trees • ~1 cache miss per child operation • Worse with succinct tree encodings • Need to access several directories • Many cache misses per child operation • Large constants hidden in the O(1)

  5. Path decomposition t triangle h r p h e l ree i e a p Recurse here withsuffix le l ε ε e y n l ε ε gle ε Query: triple

  6. Centroid path decomposition • Decompose along the heavy paths • choose the edge that has most descendants • Height of the decomposed tree: O(log n) • Usually lower • Average height

  7. Succinct encoding • [PODS 08] presents a succinct data structure for centroid path-decomposed tries • Not practical: need complex operations on succinct trees • We introduce a simpler and practical encoding • This encoding enables also simple compression of the labels

  8. Succinct encoding • Node label written literally, interleaved with number of other branching characters at that point in array L • Corresponding branching characters in array B • Tree encoded with DFUDS in bitvectorBP • Variant of Range Min-Max tree [ALENEX 10] to support Find{Close,Open}, more space-efficient (Range Min tree) triangle L : t1ri2a1ngle BP: ( ((( ) B : h epl (spaces added for clarity) p h e l

  9. Compression of L ...$...index.html$....html$....html$...index.html$ … 3 index … 5 .html … Dictionary ...$...35$...5$...5$...35$ • Dictionary codewords shared among labels • Codewords do not cross label boundaries ($) • Use vbyte to compress the codeword ids

  10. Compression of L • Node labels (t1ri2a1ngle, l1e, …): • each label is suffix of a string in the set • interleaved with few “special characters” 1, 2, 3,… • Compressible if strings are compressible • Dictionary and parsing computed withmodified Re-Pair • Domain-specific compression can be used instead • Decompression overhead negligible

  11. Experimental results (time) • Experiments show gains in time comparable to the gains in height • Confirm that bottleneck is traversal operations Code available at https://github.com/ot/path_decomposed_tries

  12. Experimental results (space) • For strings with many common prefixes, even non-compressed trie is space-efficient • Labels compression considerably increases space-efficiency • Decompression time overhead: ~10% Code available at https://github.com/ot/path_decomposed_tries

  13. Thanks for your attention! Questions?

  14. References • [ALENEX 10]D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84–97, 2010. • [ALENEX 09] D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In SODA, pages 785–794, 2009. • [PODS 08] P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In PODS, pages 181–190, 2008.

More Related