Efficient Data Storage and Retrieval Using Maps

Chapter 9 Maps and Dictionaries

A basic problem • We have to store some records and perform the following: • add new record • delete record • search a record by key • Find a way to do these efficiently!

Map (Dictionary) • A Map is a collection of pairs of the form (k,e) where • K is a key • E is the element associated with the key • Operations allowed on a map • Get the element associated with a particular key • Insert an element with a specific key • Delete an element with a specific key

Array as table studid name score 519394413 andy 81.5 838203980 betty 90 643930281 david 56.8 ... 93429382 peter 20 43920398 mary 100 ... 234903893 tom 73 129384343 bill 49 Consider this problem. We want to store 1000 student records and search them by their social security number.

Array as table studid name score 0 One approach would be to store the records in an array (index 0..9999999). The index is used as the student id, i.e. the record of the student with id 0012345 is stored at A[12345] : : : 129384343 bill 49 : : : 519394413 andy 81.5 : : : 643930281 david 56.8 : : : : : : 838203980 betty 90 : : : 1000000000

Array as table • Store the records in an array where the index corresponds to the key • add - very fast O(1) • delete - very fast O(1) • search - very fast O(1) • Problems?

Typical Solution • Normally it is not possible to make a table large enough to hold the values associated with all possible keys • A table used to lookup student names might only contain a few thousand entries • This means • The table length will be less than the key range

Hashing • Two components • Table (can be accessed randomly by position) • Hash Function  h(key) • Maps keys into positions in the table • Generally speaking, the domain of h << the range of h • In other words, h maps lots of possible keys to just a (relatively) few possible outputs. • Basic Idea • Element with the key k is stored in position h(k) of the table

Hash function int Hash(key) Imagine that we have a magic function that we’ll call “Hash”. It maps the key (student ssnum) of the 1000 records into the integers 0..999, one to one. No two different keys maps to the same number. H(‘001233445’) = 134 H(‘003334333’) = 67 H(‘230056789’) = 764 … H(‘769908080’) = 3

Hash table studid name score 0 To store a record, we compute Hash(ssnum) for the record and store it at the location Hash(ssnum) of the array. To search for a student, we only need to peek at the location Hash(ssnum). : : : 3 990803480 bill 49 : : : 67 003334333 betty 90 : : : 134 230012345 andy 81.5 : : : 764 330056789 david 56.8 : : : 999 : : :

Hash table with Perfect Hash • Such magic function is called perfect hash • add - very fast O(1) • delete - very fast O(1) • search - very fast O(1) • But it is generally difficult to design perfect hash. (for example, when the potential key space is large) A hash function should try to mix the information in the key and convert it to an index within the range of the hash table (the hash address).

Hash function • A hash function maps a key to an index within a particular range • A good hash function: • Provides a one-to-one mapping between table locations and keys. • Is easy and quick to compute. • Achieves an even distribution of the keys that actually occur over the locations in the table. • It is often difficult to come up with a good hash function. • May require a mathematician or statistical analysis of the expected keys.

Phone Numbers • Consider using a hash function to lookup the name of a faculty member given their BSU phone number • How would you do it?

Common Hash Functions • Division (very common): • H(key) = key % hashTable.length • Using a prime number for a modulus usually has the effect of spreading the keys quite uniformly • Truncation: • Ignore part of the key • A good example is using the last 4 digits of a SSN • Folding: • Partition the key into several parts and combine the parts in some way

Collisions • Generally speaking, we cannot avoid collisions • Collision resolution – what do we do when two different keys map to the same index? H(‘0012345’) = 134 H(‘0033333’) = 67 H(‘0056789’) = 764 … H(‘9903030’) = 3 H(‘9908080’) = 3

Linear Probing • When a collision occurs just look for the next available slot • To find an element in a table • Hash to its position • If it is not there look at successive slots until • You find it • You hit an empty slot • You return to the original slot • Removing an element is tricky

Linear Probing 0 1 2 3 4 5 6 7 8 9

Linear Probing 0 1 2 3 4 5 6 7 8 9 add( 10 )

Linear Probing 0 1 2 3 4 5 6 7 8 9 add( 3 ) COLLISION!!

Linear Probing 0 1 2 3 4 5 6 7 8 9 COLLISION!! add( 100 )

Analysis • The major drawback of linear probing is clustering • Values tend to clump up in the table • Thus the sequential searches required to find an element become longer and longer • Possible solutions • Pseudo-random probing • Quadratic probing ( h+1, h+4, h+9, …) • Probe at ( h + i2 ) mod table.Length • Key dependent increments • For example use the first digit as an increment

Chained Hash Table One way to handle collision is to store the collided records in a linked list. The array now stores pointers to such lists. If no key maps to a certain hash value, that array entry points to nil. 0 1 nil 2 nil 3 4 nil 5 : Key: 9903030 name: tom score: 73 HASHMAX nil

Chaining 0 1 2 3 4 5 6 7 8 9

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 10 )

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 23 ) 23

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 3 ) 3 23

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 56 ) 3 23 56

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 44 ) 3 23 44 56

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 14 ) 3 23 14 44 56

Chaining 0 1 2 3 4 5 6 7 8 9 10 add( 93 ) 93 3 23 14 44 56

Chaining 0 1 2 3 4 5 6 7 8 9 10 81 add( 81 ) 93 3 23 14 44 56

Chaining 0 1 2 3 4 5 6 7 8 9 10 81 72 add( 72 ) 93 3 23 14 44 56

Chaining 0 1 2 3 4 5 6 7 8 9 100 10 81 72 add( 100 ) 93 3 23 14 44 56

Analysis • Advantages of chaining • Space savings if items are large • Simple and efficient collision handling • Deleting items is very easy • Disadvantages of chaining • Links take up space • As chains increases in length search time takes longer

Chained Hash table • Hash table, where collided records are stored in linked list • good hash function, appropriate hash size • Few collisions. Add, delete, search very fast O(1) • otherwise… • some hash value has a long list of collided records.. • add - just insert at the head fast O(1) • delete a target - delete from unsorted linked list slow O(n) • search - sequential search slowO(n)

hashCode() • All Java objects have a method named hashCode() (defined in class Object) • By default hashCode() returns a value based on the address in memory where the object is stored • General rules for implementing hashCode() • When invoked more than once on the same object it must return the same value each time. • If o1.equals(o2)then o1.hashCode() must be equal to o2.hashCode(). • Note that it is not required that if o1.equals(o2) is false that o1.hashCode() != o2.hashCode()

Efficient Data Storage and Retrieval Using Maps

Efficient Data Storage and Retrieval Using Maps

Presentation Transcript

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9