1 / 32

Dictionaries and Data-Aware Measures

Dictionaries and Data-Aware Measures. Ankur Gupta Butler University. The Attack of Massive Data. Lots of massive data sets being generated Web publishing, bioinformation , XML, e-mail, geographical data IP address information, UPCs, credit cards, ISBN numbers, large inverted files, XML

turi
Télécharger la présentation

Dictionaries and Data-Aware Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dictionaries and Data-Aware Measures Ankur Gupta Butler University

  2. The Attack of Massive Data • Lots of massive data sets being generated • Web publishing, bioinformation, XML, e-mail, geographical data • IP address information, UPCs, credit cards, ISBN numbers, large inverted files, XML • Real-life data sets have low information content • Data sets need to be compressed and are compressible • The Goal: design data structures to manage massive data sets • Near-optimal query times for powerful queries • Near-minimum amount of space • Measure space in data-aware way, i.e. in terms of each individual data set

  3. Indexing Equals Compression • Suffix trees/arrays are used for pattern matching and take O(n log n) bits with O(polylog(n)) query time • GV00 improves the space to O(n log Σ) bits • FM00, FM01, GGV03, GGV04 further improve the space to O(nHk) bits of space with the same time bounds • Key point: these indexes support sufficient searching and compression of the text simultaneously

  4. Indexing Is So Needy • These indexes depend on many data structural building blocks that need to be space-efficient with reasonable query times • Two particular building blocks are rank/select dictionaries and wavelet trees • In this talk, we focus on compressed representations of dictionaries with near-optimal search times

  5. Dictionary Problem • Canonical problem: the input is an ordered set S of n items out of a universe U = {0, 1,…, u-1} • Support the operations: • member(i) – true if i is in S, false otherwise • pred(i) – the item immediately before i • rank(i) – the number of items from S up to i • select(i) – the ith item from S • rank(i) and select(i) can solve member(i) and pred(i)

  6. The Information-Theoretic Minimum • There is a limit on the best compression you can get for anyarbitrary subset S called the information-theoretic minimum, which is or, roughly n log (u/n) bits. • This is better than n log (u) by a 1/n factor inside the logarithm.

  7. Data-Aware vs. Data-Blind

  8. Gap Encoding • Example: Encode n = 7 values out of a universe of size u = 16 • A natural way is to encode the differences between items in the set (gaps) • Better than the worst-case information-theoretic measure for real-life data • gap measure is 1 4 8 15 9 13 12 1 2 3 3 1 1 4

  9. 1111 0001 0100 1000 1001 1100 1101 Prefix Omission Method (POM Encoding) • Example: Encode n = 7 bitstrings drawn from a universe u = 16 • POM encoding can also be seen as a subtrie (of n leaves) of the complete binary trie on u leaves • trie measure is the number of edges in the subtrie 0001 11 1 1 1000 100 100

  10. Relationship Between GAP and TRIE • The gap(S) is no more than trie(S) • The trie(S) can be much more than gap(S) • Such a scenario cannot occur too often; an amortized argument shows that trie(S) ≤ 2 gap(S) • To improve the analysis, a randomized shift shows that randomized_trie(S) ≤gap(S) + 2n – 2 • gap(S) = 15, trie(S) = 30, trie(S+3) = 26

  11. Searchable Encoding Methods(AKA Gap sucks?) • Previous methods are compressed, but are not easily searchable • In particular, these require a linear scan to answer any rank and select query • Block-and-walk method [BB04] is a good scheme • BSGAP data structure is our key building block • BSGAP stores a (modified) POM-encoded sequence of strings, but queries using a binary search rather than a linear scan • Takes nearly the same space as a linearly-encoded POM sequence • In particular, we take gap + O(n log log (u/n)) bits

  12. Binary Searchable GAP Encoding (BSPOM shown on right) • gap encoding is related to the number of edges in the best trie (figure on right) • Each node encoded as the relative gap encoding from its best ancestor • Pointers point to the right subtree stored on disk and use O(n log log (u/n)) extra bits • We never use more bits than the number of edges in the trie! • Support rank and select queries on n items in O(log n) time 5 4 8 9 1

  13. A Lower Bound on Query Time(AKA your best isn’t good enough) • Predecessor lower bound [BF99] applies, since pred(i) is solved with rank(i)and select(i) • Requires non-constant time for predecessor queries with a data structure of O(n polylog(u)) bits • Many data structures take O(1) time for rank and select, but require o(u) additional bits • Two lines explored in previous work: • Spend o(u) extra bits of space and get O(1) time • Spend only O(n polylog(u)) bits total and take time

  14. in AT(u, n/log3n) time Andersson/Thorup’s with O(n log u/log3n) bits Predecessor Data Structure BSGAP BSGAP BSGAP BSGAP BSGAP AT on n/log3n items BSGAPs on log3n items performs predecessor taking O(log log n) time and gap(S) + O(n log log (u/n)) bits Our Dictionary Structure This dictionary structure takes gap + O(n log log (u/n)) + O((n log (u/n))/log n) bits and performs rank and select in AT(u,n) time

  15. Theoretical ResultsDictionary Problem • We just described a novel dictionary structure that requires gap + O(n log log (u/n)) + O((n log (u/n))/log n) bits and performs rank and select in AT(u,n) time. • “Second-order” terms come from encoding pointers in the BSGAP and the space required for the top-level predecessor data structure • A further improvement reduces the time for select to just O(log log n) time • When n« u, we have the first O(n log (u/n)) bit dictionary with AT(u,n) query time

  16. Cheating Predecessor Time Bounds • If we relax our definition of rank so that it only supports queries on items in S, we can avoid the predecessor lower bound • This partial rank function prank(i) = rank(i) for i in S, and -1 otherwise • We also achieve a dictionary supporting prank and select queries in O(log log n) ≤ AT(u,n) time with the same space as before • prank dictionary replaces AT’s top level predecessor structure with a van Emde Boas-like distributor structure • Some complications arise with this scheme; details in the paper

  17. “Comparisons” With Prior Work For the above, n = 100,000 and u = 232.

  18. Practical Results • Engineering on prefix codes • space/time tradeoffs of various prefix codes • Tuning BSGAP data structure • best choice of code for pointers • parameter h to avoid some pointers • Space/time tradeoffs between a simple block-and-walk scheme [BB04] and our practical dictionary

  19. Prefix Codes Used • We encode the gap g as follows: • γ code: log (g +1) in unary; g in binary • δ code: log (g+1) using γ code; g in binary • δ2 code: log (g+1) using δ code; g in binary • nibble4: ¼ log (g+1) in unary; g in binary, padded to a multiple of 4 bits • nibble4gamma: ¼ log (g+1) using γ code; g in binary, padded to a multiple of 4 bits • fixed5: log (g+1) in five bits; g in binary • We use tables to compute prefix codes quickly

  20. Prefix Codes: Space 32-bit u

  21. Prefix Codes: Time 32-bit u

  22. Prefix Codes: Space 64-bit u

  23. Prefix Codes: Time 64-bit u

  24. Pointer Cost: Space 32-bit u

  25. Pointer Cost: Space 64-bit u

  26. BSGAP Tuning BSGAP • Pointer cost is non-trivial part of BSGAP • Low-level pointers jump over a few bits but take a lot of space • Decoding pointers can take a lot of time • Use a simple sequential encoding scheme when there are at most h items to search. Call this the hybrid value h Each leaf node of BSGAP encodes up to h items sequentially

  27. Binary Search Tree BSGAP BSGAP BSGAP BSGAP BSGAP BST on n/b items BSGAPs on b items, each finds correct BSGAP using a hybrid value h Practical Dictionary Structure This dictionary structure D(b, h) supports searching in O(log (n/b)) time and overhead space according to the value of h. The [BB04] is when h = b.

  28. Description of Experiments • We allow blocksize b to range from [2, 256] (32-bit files) and [2,2048] (64-bit files). • We range the hybrid value h over [2,b]. • Collect space and time required for all D(b, h) over 10,000 (1,000) randomly generated rank queries. • [BB04] curve is generated from the 256 (2048) points corresponding to D(b, b). • BSGAP curve is generated from 300 samples from the above data.

  29. Any questions?

  30. Any questions?

More Related