1 / 19

Hash Tables and Associative Containers

Hash Tables and Associative Containers . CS-212 Dick Steflik. Hash Tables. a hash table is an array of size Tsize has index positions 0 .. Tsize-1 two types of hash tables open hash table array element type is a <key, value> pair all items stored in the array chained hash table

howell
Télécharger la présentation

Hash Tables and Associative Containers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hash Tables and Associative Containers CS-212 Dick Steflik

  2. Hash Tables • a hash table is an array of size Tsize • has index positions 0 .. Tsize-1 • two types of hash tables • open hash table • array element type is a <key, value> pair • all items stored in the array • chained hash table • element type is a pointer to a linked list of nodes containing <key, value> pairs • items are stored in the linked list nodes • keys are used to generate an array index • home address (0 .. Tsize-1)

  3. Faster Searching • "balanced" search trees guarantee O(log2 n) search path by controlling height of the search tree • AVL tree • 2-3-4 tree • red-black tree (used by STL associative container classes) • hash table allows for O(1) search performance • search time does not increase as n increases

  4. Hash Table • a hash table is an array/vector (fixed size) • has index positions 0 .. Tsize-1 • if we could use the keys as an index we would have O(1) retrieval • hashTable[key] • keys are used to generate an array index • home address (0 .. Tsize-1) • function to do this is called a hash function • hash(key) returns an int value • hash(key) % Tsize => 0 . . Tsize - 1

  5. Collisions • Collisions occur whenever two keys produce the same index (hash to the same location • Design goal: pick a hash function that produces no collisions • Away of life with hash tables • What do you do? • linear probing: check the next location, if its empty use it • quadratic probing: check next, then 2 away, then 4 away......

  6. a Hash Table of size 7 some insertions: hash(K1) % 7 => 3 hash(K2) % 7 => 5 hash(K3) % 7 => 2 hash(K4) % 7 => 3 hash(K5) % 7 => 2 hash(K6) % 7 => 4 0 1 2 3 4 5 6 T T T T T T T linear probe open addressing collision resolution strategy key value empty

  7. 0 1 2 3 4 5 6 F K6 K6info T F K3 K3info F K1 K1info F K4 K4info F K2 K2info F K5 K5info Search Performance average number of probes needed to retrieve the value with key K? K home address #probes K1 3 K2 5 K3 2 K4 3 K5 2 K6 4 1 1 1 2 5 4 14/6 = 2.33 (successful) unsuccessful search?

  8. 0 1 2 3 4 5 6 K3 K3info K5 K5info K1 K1info K4 K4info K6 K6info K2 K2info Chaining with Separate Lists hash(K1) % 7 => 3 hash(K2) % 7 => 5 hash(K3) % 7 => 2 hash(K4) % 7 => 3 hash(K5) % 7 => 2 hash(K6) % 7 => 4 linked lists of synonyms

  9. 0 1 2 3 4 5 6 K3 K3info K5 K5info K1 K1info K4 K4info K6 K6info K2 K2info Search Performance average number of probes needed to retrieve the value with key K? K home address #probes K1 3 K2 5 K3 2 K4 3 K5 2 K6 4 1 1 1 2 2 1 8/6 = 1.33 (successful) unsuccessful search?

  10. Where are Hash Tables used? • Databases • Spelling checkers • Java uses them all over the place (built into the language) • most scripting languages (ASP, PERL, PHP) have associative arrays • Caching Schemes • software – browsers, http proxy servers, DNS servers • hardware – memory caching, instruction caching

  11. Deletions? • search for item to be deleted • chained hash table • delete a node from a linked list • open hash table • just mark spot as "empty"? • must mark vacated spot as “deleted” • is different than “empty”

  12. Hash Functions • a hash function is used to map a key to an array index (home address) • search starts from here • insert, retrieve, delete all start by applying the hash function to the key • goals for a hash function • fast to compute • even distribution over the entire collection of keys • all hash functions produce collisions • multiple keys hash to same home address

  13. Some Hash Functions... • Division • works good in most cases as long as keys are relatively random • H(key) = key mod m • if key is an integer • identity function ( return key) • good if keys are random • not good if keys have similar characteristics • ex m = 25 • all keys divisible by 5 would map into positions 0, 5,10,15… causing clustering around those values

  14. more Hash functions... • Mid-Squared • produces a nearly random distribution of indices • mid-square technique takes longer to compute but gives better distribution when keys may have some digits in common • convert key to an octal string • A-Z = 018 - 328 and 0-9 = 338 - 448 • ex key = A1 = 1348 • 1348 * 1348 = 204208 • using a table of 1024 elements • 0010001000100002 use middle 10 bits as the index • index = 10001000102 = 54610 • note - most collisions will occur for short identifiers

  15. more Hash functions... • Digit Folding • assume a 5 digit decimal string (digits 0-9 only) • H(key) = d1 + d2 + d3 + d4 + d5 (sum of digits) • this would yield 0 <= h <= 45 for all possible keys • if we were to fold the digits in pairs • H(key) = d1d2 + d3d4 + d5 • 0 <= h <= 207 (99 + 99 + 9) • Double hashing • use two (or more) hash functions serially • helps overcome effects of a function that produces a poor distribution of keys

  16. Clustering • Undesireable function of the hash function selected and the collision resolution strategy • too many keys has to the same location causing long string of keys that need to be searched • especially bad using a divide based function and using linear probing • insertion/deletion/search can approach O(n) • Pick a different hash function • Pick a different collision resolution strategy

  17. Factors Affecting Search Performance • quality of hash function • how uniform? • depends on actual data • collision resolution strategy used • load factor of the HashTable • N/Tsize • the lower the load factor the better the search performance

  18. Successful Search Performance open addressing open addressing chaining (linear probing) (double hashing) load factor 0.5 1.50 1.39 1.25 0.7 2.17 1.72 1.35 0.9 5.50 2.56 1.45 1.0 ---- ---- 1.50 2.0 ---- ---- 2.00

  19. Summary of Hash tables • search speed depends on load factor and quality of hash function • should be less than .75 for open addressing • can be more than 1 for chaining • items not kept sorted by key • very good for fast access to unordered data with known upper bound • to pick a good TSize

More Related