1 / 24

Hashing

Hashing. 15-211 Fundamental Data Structures and Algorithms. Margaret Reid-Miller 18 January 2005. Plan. Today Seat assignments Hash functions Reading: For today and next time: Sedgewick Chapter 14 Reminder: HW0 due on Thursday. Hash Tables

pakuna
Télécharger la présentation

Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

  2. Plan • Today • Seat assignments • Hash functions • Reading: • For today and next time: Sedgewick Chapter 14 • Reminder: HW0 due on Thursday

  3. Hash Tables An Alternative Representation for Dictionaries

  4. Dictionary Interface An Abstract Data Type that maintains a dynamic set is a Dictionary. Crucial operations: • Insert • Find • Remove Standard operations: create, destroy, copy,…

  5. Dictionary Interface insert: may or may not allow multiple occurrences find: membership query, often also retrieve associated information remove: may use deferred actions for speed up amortized running time

  6. Small Universe • Suppose we have a small universe U = {0,1,2,…,M-1} of items. • We want to maintain a subset A of U. • Ease: Use an array of bits (boolean) of size M. • Insert: A[k] = 1 • Find: return A[k] != 0 • Remove: A[k] = 0 Operations are constant time.

  7. Direct Access Tables • In most applications we do not store simple items but pairs (key, object). • Use an array of pointers (references to objects). • Insert: A[key] = object • Find: return A[key] • Remove: A[key] = null Again operations are constant time.

  8. Large Universe • But what if the universe U of keys is large (and the subset is small)? e.g., names, symbol table of a compiler. • Even when the identifiers are at most 16 long there are some 1028 possibilities.

  9. 0 1 2 3 4 5 6 7 8 9 10 a b c d e f l h i j k l m n o p q r s t u v w x y z Hashing – the Idea • Map keys into integers in the range0 .. m-1, m<<M and m is the table size. • Pick a “good” mapping from keys to integers: • Easy to compute • Even distribution into the table

  10. Hashing – Terminology • The array in which we store the objects is the hash table. • To enter an object into the table, we compute an index from the key. • The map from the key to the index is a hash function h: h(key) = index

  11. Space-Time Tradeoff • A direct table has O(1) operations in the worse case. But space may be prohibitive. • Minimize space by using a sequential search. • Hashing balances space and time (on average) by changing the size of the hash table.

  12. Problem - Collisions • Fundamental problem: Some keys map to the same location, a collision: h(x) = h(y). • Can we prevent collisions?

  13. Pigeonhole Principal • There is no way to avoid collisions. • Since m << M there must be at least two keys that map to the same index. • The famous Pigeonhole Principle: If you put more than k items into k bins, then at least one bin contains more than one item.

  14. Problem - Hash Function • Second problem: How do we find a suitable hash function? • Ideally, we want to distribute the keys uniformly over the hash table to minimize collisions. • That is, we want h to appear random, as though “hashing” the keys.

  15. Hash Functions

  16. Hashing-Efficiency • We also need to make sure h(k) is easy to compute. • Note that k could be a fairly complicated data structure. How do you turn an array of integers into a single integer? Or how about a tree? • Goal: All operations should be constant time. • But things can go badly wrong on rare occasions.

  17. Division method • Assume wlog the keys are integers. • A simple hash function is h(k) = k mod m, where m is the table size. • The choice of m is crucial. • Good choice: m prime.

  18. Division method • Primes are fairly dense, so this is no great restriction on the table size. • In fact, we can nearly double the hash table: 31, 61, 127,251, 509, 1021, 2039,… • Store these values in a table; don’t try to compute on the fly.

  19. Multipication Method • Another hash function is h(x) = floor( m ( k r mod 1) ) where 0 < r < 1 is cleverly chosen. • Advantage: the choice of m is not critical • Ideally should be irrational, then the values (i r mod 1), i = 1, 2,...,M are very evenly distributed over [0,1]. • Of course, there is a little problem here.

  20. Random Input • Note that good hash functions are easy to come by if the input is random (as a bit pattern). Then we can take simply a few bits from the input (say, the first or last 16 bits). • However, such a method would fail miserably if the input shows some regularity. No good for general use.

  21. Integer keys? • The assumption objects in U are integers has to be taken with a grain of salt. • Often we have to massage things a bit to extract numbers. • Of course, in the end everything is just one (possibly huge) number written in binary. This can be used in some languages like C to directly extract hash values from thesebits.

  22. Example: Strings public int hashCode(String key, int m) { int h = 0; for (int i=0; i<key.length(); i++) h = 37 * h + key.charAt(i); // 37 is magic number h %= m; if (h < 0) // overflow? h += m; return h; } This is really an interpretation of the string as a number in base 37 (not ordinary radix notation, though.)

  23. Hash functions • Desired properties • Approximates a random distribution • Over the range of table index values • Efficient calculation • Approaches • Modular arithmetic • Many • Perfect hashing • When full set of input keys known in advance

  24. Next time: Collisions

More Related