Efficient Hashing Techniques for Data Management

Hashing 8 April 2003

Example • Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each student uniquely identified by a student number. • The student numbers currently range from about 1,000,000 to above 9,999,999 therefore an array of 10 million elements would be enough to hold all possible student numbers. Given each student record is at least 100 bytes long we would require an array size of 1,000 Megabytes to do this.

Example - 2 • There are fewer than 400 students enrolled in CS at present • There must be a better way • We could have a sorted array of 400 elements and retrieve students using a binary search. • We want our access to be as fast as possible. In this situation we would use a hash table.

Example - 3 • Find some way to transform a student number from the several million values to a range closer to 400 but avoiding (as much as possible) the case where two numbers transform (or hash) to the same value. • We place the records according to their transformed key into a new array (or hash table) containing at least 400 elements.

Example - 4 • Make the size of the hash table 479 elements long. • A popular method for transforming keys is to use the mod operator (take the remainder upon integer division of the original key by the size of the hash table)

Example - 5 • For example, consider student number 949,786,456: 949786456 % 479 = 348 • Therefore we should place this student in array element 348 in the hash table (note: the mod operator is effective because it can only have values in the range 0 - 478).

Direct Access Table If we have a collection of n elements whose keys are unique integers in (1,m), where m >= n,then we can store the items in a direct address table, T[m], where Ti is either empty or contains one of the n elements. Searching a direct address table is an O(1) operation: • for a key, k, we access Tk, • if it contains an element, return it, • if it doesn't then return NULL. • There are two constraints: • the keys must be unique, and • the range of the keys must be severely bounded.

Direct Access Table

Using Linked Lists • If the keys are not unique, then we can construct a set of m lists and store the heads of these lists in the direct address table. • The time to find an element will still be O(1). • If the maximum number of duplicates is ndupmax, then searching for a specific element is O(ndupmax).

Using Linked Lists • If duplicates are the exception rather than the rule, then ndupmax is much smaller than n and a direct address table will provide good performance. • But if ndupmax approaches n, then the time to find a specific element approaches O(n) and some other structure such as a tree will be more efficient.

Using Linked Lists

Analysis • The range of the keys determines the size of the direct address table and may be too large to be practical. • For instance it’s not likely that you’ll be able to use a direct address table to store elements which have arbitrary 32-bit integers as their keys for a few years yet! • Direct addressing is easily generalized to the case where there is a function, h(k) => (1,m) which maps each value of the key, k, to the range (1,m). In this case, we place the element in T[h(k)] rather than T[k] and we can search in O(1) time as before.

Mapping Fuctions • The direct address approach requires that the function, h(k), is a one-to-one mapping from each k to integers in (1,m). Such a function is known as a perfect hashing function: it maps each key to a distinct integer within some manageable range and lets us build an O(1) search time table. • Finding a perfect hashing function is not always possible. • Sometimes we can find a hash function which maps most of the keys onto unique integers, but maps a small number of keys onto the same integer. • If the number of collisions is sufficiently small, then hash tables work well and give O(1) search times.

Handling Collisions • In cases where multiple keys map to the same integer, then elements with different keys may be stored in the same “slot” of the hash table. • There may be more than one element which should be stored in a single slot of the table. • Techniques used to manage this problem are: • chaining • overflow areas • re-hashing • using neighboring slots (linear probing) • quadratic probing • random probing

Chaining • One simple scheme is to chain all collisions in lists attached to the appropriate slot. • Allows an unlimited number of collisions to be handled and doesn't require a priori knowledge • The tradeoff is the same as with linked lists versus array implementations of sets: linked lists incur overhead in space and, to a lesser extent, in time.

Chaining

How Chaining Works • To insert a new item in the table, we hash the key to determine • which list the item goes on • insert the item at the beginning of the list (For example, to insert 11, we divide 11 by 8 giving a remainder of 3. Thus, 11 goes on the list starting at HashTable[3]) • To find an item, we hash the number and then follow links in the chain down the list to see if it is present.

How Chaining Works-2 • To delete a number, we find the number and remove the node from the appropriate linked list. • Entries in the hash table are dynamically allocated and entered on a linked list associated with each hash table entry. • Alternative methods, where all entries are stored in the hash table itself, are known as direct or open addressing.

Re-hashing • Re-hashing schemes use a second hashing operation when there is a collision. If there is a further collision, we re-hash until an empty “slot” in the table is found. • The re-hashing function can either be a new function or a re-application of the original one. As long as the functions are applied to a key in the same order, then a sought key can always be found.

Re-Hashing

Linear probing • One of the simplest re-hashing functions is +1 (or -1), i.e., on a collision, look in the neighboring slot in the table. • It calculates the new address extremely quickly.

Open Addressing 1. Linear Probing In linear probing, when a collision occurs, the new element is put in the next available spot (essentially doing a sequential search). Example: Insert : 49 18 89 48 Hash table size = 10, so 49 % 10 = 9, 18 % 10 = 8, 89 % 10 = 9, 48 % 10 = 8

Open Addressing

Problems • In linear probing records tend to cluster around each other. (once an element is placed in the hash table the chances of it’s adjacent element being filled are doubled–either filled by a collision or directly). • If two adjacent elements are filled then the chances of the next element being filled is three times that for an element with no neighbor.

Animation from the Web The animation gives you a practical demonstration of the effect of linear probing: it also implements a quadratic re-hash function so that you can see differences. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/hash_tables.html

Clustering • Linear probing is subject to a clustering phenomenon. • Re-hashes from one location occupy a block of slots in the table which “grows” towards slots and blocks to which other keys hash. • This exacerbates the collision problem and the number of re-hashes can become large.

Quadratic Probing • Better behavior is usually obtained with quadratic probing, where the secondary hash function depends on the re-hash index: address = h(key) + c i2 • On the ith re-hash. (A more complex function of i can be used.) • Quadratic probing is susceptible to secondary clustering since keys which have the same hash value also have the same probe sequence • Secondary clustering is not nearly as severe as clustering caused by linear probing.

Overflow area • When a collision occurs, a slot in an overflow area is used for the new element and a link from the primary slot established as in a chained system. • This is essentially the same as chaining, except that the overflow area is pre-allocated and thus may be faster to access. • As with re-hashing, the maximum number of elements must be known in advance, but in this case, two parameters must be estimated: the optimum size of the primary and overflow areas.

Overflow Area

Comparison

Hash Functions • If the hash function is uniform (equally distributes the data keys among the hash table indices), then hashing effectively subdivides the list to be searched. • Worst-case behavior occurs when all keys hash to the same index. Why? • It is important to choose a good hash function.

Choosing Hash Functions • Choice of h: h[x] • must be simple • must distribute (spread) the data evenly • Choice of m: m approximates n (about 1 item/linked list) where n = input size

Mod Function • Choice of a three digit hash for phone numbers e.g. 398-3738 • x is an integer value.h[x] = x mod m. • Choosing last three digit(738) is more appropriate than the first three digits (398) as it distributes the data more evenly. To do this take mod function: • x mod m: • h[x] = x mod 10k: gives last k digitsh[x] = x mod 2k: gives last k bits

Middle Digits of an Integer • This often yields unpredictable (and thus good) distributions of the data. Assume that you wish to take the two digits three positions from the right of x. • If x = 539872178then h[x] = 72 • This is obtained byh[x] = (x/1000) mod 100Where (x/1000) drops three digits and (x/1000) mod 100 keeps two digits.

Order Preserving Hash Function • x < y implies h[x]<= h[y] • Application: Sorting

Perfect Hashing Function • A perfect hashing function is one that causes no collisions. • Perfect hashing functions can be found only under certain conditions. • One application of the perfect hash function is a static dictionary. • h[x] is designed after having peeked at the data.

Retrieval • To retrieve a record is the same as insertion. • Take the key value, perform the same transformation as for insertion then look up the value in the hash table.

Issues • There are two basic issues when designing a hash algorithm: • Choosing the best hash function • Deciding what to do with collisions

Hash Function Strategies • If the key is an integer and there is no reason to expect a non-random key distribution then the modulus operator is a simple (and efficient) and effective method. • If the key is a string value (e.g. someone’s name or C++ reserved words) then it first needs to be transformed to an integer.

Efficient Hashing Techniques for Data Management

Efficient Hashing Techniques for Data Management

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing