Improved Generalized Hashing Structures

Generalized Hashing with Variable-Length Bit Strings Michael Klipper With Dan Blandford Guy Blelloch Original source: D. Blandford and G. E. Blelloch. Storing Variable-Length Keys in Arrays, Sets, and Dictionaries, with Applications. In Symposium on Discrete Algorithms (SODA), 2005 (hopefully)

Hashing techniquescurrently available • Many hashing algorithms out there: • Separate chaining • Cuckoo hashing • FKS perfect hashing • Also many hash functions designed, including several universal families • O(1) expected amortized time for updates, and many have O(1) worst case time for searches • They use W(n lg n) bits for n entries, since at least lg n bits are used per entry to distinguish between keys.

What kind of bounds do we achieve? Let’s say we store n entries in our hashtable of the form (si, ti) for i = 0, 1, 2, … (n-1). Each si and ti are bit strings of variable length. For our purposes, many of the ti’s might only be a few bits long. Time for all operations (later slide): O(1) expected amortized Total space used: O(Si max(|si| - lg n, 1) + |ti|) bits

The Improvement We Attain Let’s say we store n entries taking up m total bits. In terms of the si and ti values on the previous slide, m = Si |si| + |ti| Note that m = W(n lg n). Thus, our space usage is O(m – n lg n) bits, as opposed to the W(m) bits that standard hashtable structures use. In particular, our structure is much more efficient than standard structures when m is close to n lg n (for example, when most entries are only a few bits long).

Goal:Generalized Dynamic Hashtables We want to support the following operations: • query(key, keyLength) • Looks up the key in the hashtable and returns the data associated and its length • insert(key, keyLength, data, dataLength) • Adds (key, data) as an entry in the hashtable • remove(key, keyLength) • Removes the key and the data associated NOTE: Each key will only have one entry associated with it. Another name for this kind of structure is a variable-length dictionary structure.

Other Structures • Variable-Length Sets • Also supports query, insert, and remove, though there is no extra data associated with keys • Can be easily implemented as a generalized hashtable that stores no extra data • O(1) expected amortized time for all operations • If the n keys are s0, s1, … sn-1, then the total space used in bits is O(Si max(|si| - lg n, 1))

Other Structures (cont.) • Variable-Length Arrays • For n entries, the keys are 0, 1, … n-1. • These arrays will not be able to resize their number of entries. • Operations: • get(i) returns the data stored at index i and its length • set(i, val, len) updates the data at index i to val of length len • Once again, O(1) expected amortized time for operations. Total space usage is O(Si |ti|).

Implementation Note Assume for now that we have a variable-length array structure described on the previous slide. We will use this to make generalized dynamic hashtables, which are more interesting than the arrays. At the end of this presentation, I can talk about implementation of variable-length arrays if time permits.

The Main Idea BehindHow Hashtables Work Our generalized hashtable structure contains a variable-length array with 2q entries (which will serve as the buckets for the hashtable). We keep 2q approximately equal to n by occasional rehashing of the bucket contents. The item (si, ti), where si is the key and ti is the data, is placed in a bucket as follows: we first hash si to some index (more on this later), and we write (si, ti) into the bucket specified by that index. Note that when we hash si, we implicitly treat it as an integer.

Hashtables (cont.) If several entries of the set collide in a bucket, we throw them all into the bucket together as one giant concatenated bit string. Thus, we essentially use a separate-chaining algorithm. To tell where one entry starts and another begins, we encode the entries with a prefix-free code (such as Huffman codes or gamma codes). Sample bucket (where si’ is si encoded, etc.) s1’ t1’ s2’ t2’ s3’ t3’

Time and Space Bounds Note that we use prefix-free codes that only use a constant factor more space (i.e. they encode m bits in O(m) space) and can be encoded/decoded in O(1) time. Time: If we use a universal hash function to determine the bucket index, then each bucket receives only a constant expected number of elements, so it takes O(1) expected amortized time to find an element in a bucket. The prefix-free codes we use allow O(1) decoding of any element. Space: The prefix-free codes increase the amount of bits stored by at most a constant factor. If we have m bits total we want to store, our space bound for variable-length arrays says that the buckets take up O(m) bits.

There’s a bit more than that… Recall the space bound for the hash table is O(Si max(|si| - lg n, 1) + |ti|). Where does the lg n savings per entry come from? We perform a technique called quotienting. We actually use two hash functions h’ and h’’. h’(si) is the bucket index, and h’’(si) has length max(|si| - q, 1). (Recall that 2q is approximately n.) Instead of writing (si, ti) in the bucket, we actually write (h’’(si), ti). This way, each entry needs |h’’(si)| + |ti| bits to write, which fulfills our space bound above.

A Quotienting Scheme Let h0 be a hash function from a universal family whose range is q bits. We describe a way to make a family of hash functions from the family from which h0 is drawn. Let sit be the q most significant bits of si, and let sib be the other bits. We define our hash functions as follows: h’’(si) = sib h’(si) = h0(sib) xor sit sit sib = h’’(si) 101101 001010100100101 si h0 + 010011 111110 h’(si)

Undoing the Quotienting In the previous example, we saw that h’(si) evaluated to 111110, or 62. This means we store h’’(si) in bucket number 62! Note that given h’(si) and h’’(si) we can retrieve si because sib = h’’(si) and sit = h0(h’’(si)) xor h’(si). The family of h’ functions we make is another universal family, so our time bound explained earlier still holds.

An Application of Hashtables: Graph Structures One area where we will be able to use the hashtable structure is in storing graphs. Here, we describe a semidynamic directed-graph implementation. This means that the number of vertices is fixed, but edges can be added or deleted at runtime. Let u and v be vertices of a graph. We want the following operations compactly and in O(1) expected amortized time: • deg(v) - get the degree of vertex v • adjacent(u, v) - returns true iff u and v are adjacent • firstEdge(v) - returns the first neighbor of v in G • nextEdge(u, v) - returns the next neighbor of u after v (assumes u and v are adjacent) • addEdge(u, v) - adds an edge from u to v in G • deleteEdge(u, v) - deletes the edge (u, v) from G

Hashing Integers Up to now, we have used bit strings as the main objects in the hashtable. It will also be useful to hash on integer values. Hence, we have created some utilities to convert between bit strings and integers using as few bits as possible, so an integer x takes basically lg |x| bits to write as a bit string.

A Graph Layout Where We Store Edges in a Hashtable Let’s say u is a vertex of degree d and v1, … vd are its neighbors. Let’s say that v0 = vd+1 = u by convention. Then the entry representing the edge (u, vi) has key (u, vi) and data (vi-1, vi+1). Hash Table This extra entry “starts” the list. u u u u u v1 v2 v3 v4 u v1 v2 v1 v2 v3 v4 Degree of Vertex 4

Implementations of a Couple Operations For simplicity, I’m leaving off the length arguments in query() and insert(). • adjacent(u, v) • return (query((u, v)) != -1); • firstEdge(u) • let (vp, vn, d) = query((u, u)); • return vn; • addEdge(u, v) • let (vp, vn, d) = query((u, u)); • remove((u, u)); • insert((u, u), (vp, v, d + 1)); • insert((u, v), (u, vn));

Compression and Space Usage • Instead of ((u, vi), (vi-1, vi+1)) in the table, we will store ((u, vi – u), (vi-1 – u, vi+1 – u)) • With this representation, we need O(S(u,v)ÎE lg |u – v|) space. • A good labeling of the vertices will make many of these differences small. For instance, for many classes of graphs, such as planar graphs, the total space used is O(n) bits! The following paper has details: D. Blandford, G. E. Blelloch, and I. Kash. Compact Representations of Separable Graphs. In SODA, 2003, pages 342-351.

More Details aboutImplementing Arrays We’ll use the following data for our example in these slides: t0 = 10110 t1 = 0110 t2 = 11111 t3 = 0101 t4 = 1100 t5 = 010 t6 = 11011 t7 = 00001111 We’ll assume that the word size is 2 bytes.

Key Idea: BLOCKS • Multiple data items can be crammed into a word, so let’s take advantage of that. • There are many possible ways to store data in blocks. The way that I’ll discuss here is to use two words per block: one stores data and one marks separation of entries. b0 1st word 1 0 1 1 0 0 1 1 0 1 1 1 1 1 Example: 2nd word 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 This is the block containing strings t0 through t2 from our example.

Blocks: continued We’ll name a block bi if i is the first entry number to be stored in that block. The size of a block is the sum of the sizes of the entries inside it. We’ll maintain a size invariant: for any adjacent blocks bi and bj, |bi| + |bj| is at least a full word. Note: splitting and merging blocks is easy. We assume these things for now: • Entries fit into a word… we can handle longer entries by storing a pointer to separate memory in its place • Entries are nonempty

Organization of Blocks • We have a bit array A of length n (this is a regular old C array). A[i] = 1 if and only if string #i starts a block. This is our indexing structure. • We also have a standard hashtable H. If string #i starts a block, H(i) = address of bi. We assume H is computed in O(1) expected amortized time. • Blocks are large enough that storing them in H only increases the space usage by a constant factor. Example: In this example, b0 and b3 are adjacent blocks, as are b3 and b7. H(0) b0 t0 t1 t2 1 0 0 H(3) 1 b3 t3 t4 t5 t6 A 0 0 0 H(7) b7 t7 1

A Note about Space Usage Any two 1’s in the indexing structure A are separated by at most one word. This is because entries are nonempty and a block only holds one word for entries.

The get() operation • Since bits that are turned on in A are close, we can find the block to which an entry belongs in O(1) time. One way to do this is table lookup. • If the ith entry is in block bk, then the ith entry of the array is the (i – k + 1)st entry in that block. • By using table lookup, we can find where the correct 1’s in the second word are, which tell us where the entry starts and ends.

A picture of the get() operation, illustrated with get(2) To find entry #2, we look in block #0. 1 0 H(0) A[2] 0 1 0 b0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 A start end Conclusion: Entry 2 is 5 bits long. It is 11111.

How set() works in a nutshell • Find the block with the entry. • Rewrite it. • If the block is too large, split it into two. • Merge adjacent blocks together to preserve the size invariant.

Now, to prove the theorem about space usage for arrays • Let m = Si |ti| and w = machine word size. I claim the total number of bits used is O(m). • Our size invariant for blocks guarantees that on average, blocks are half full. Thus, there are O(m / w) blocks used, since there are m bits total of data and each block has W(w) bits stored in it on average. • Our indexing structure A and hashtable H use O(w) bits per block (O(1) words). Total bits: O(m / w) blocks * O(w) per block = O(m) bits.

A note about entrieslonger than w bits What is really done in our code with entries longer than w bits is not just allocating separate memory and putting a pointer in the array, though it’s close. We do essentially what standard structures do, and we chain the words making up our entry into a linked list. We have a clever way to do this which doesn’t need to use w-bit pointers; instead we only need 7 or 8 bits for a pointer.

Improved Generalized Hashing Structures

Improved Generalized Hashing Structures

Presentation Transcript

Range Trees with Variable Length Comparisons (RT-VLC)

Variable Length Deduplication

VLSM Variable Length subnet masks

Variable Length RRT (VLRRT)

Working with Strings

Variable Length Data and Records

First Quantitative Variable: Ear Length

Computing with Strings

Variable Length Subnetting

Variable Bit Rate Video Coding

Working with Strings

Approximate Matching of Run-Length Compressed Strings

Variable Length Coding

Variable Length Coding

Variable Length Subnetting

Conditionals with Strings

Algorithms for variable length Markov chain modeling

Variable Length Subnet Masks

Variable Length Data and Records

Variable Length Subnet Masks ( VLSM )

Variable Bit Rate:

An Approach to Generalized Hashing