Hashing

Hashing

Idea • Used to perform insertions, deletions, and searches in constant average time. • Ideal hash table data structure: fixed array containing items • The hash table is filled with ‘items’, and typically a search is performed on a part of one of the items, called a key. • TableSizeis the amount of cells within the hash table. The indices of the table typically run from 0 to TableSize-1. • Each key is mapped to some number in the range of the hash table using a hashing function, and is then put into the corresponding cell.

Hash Function Basics • Should ideally be simple to compute. • Should ensure any two distinct keys are mapped to different cells. • Because there can be a finite number of cells but an infinite number of keys to be mapped, it is important to choose a hashing function that distributes the keys evenly.

Limitations and Common Problems Associated with Hashing • Operations that require ordering information among elements are not supported efficiently. • Ex. findMax, findMin • Ex. Printing an entire table in sorted order in linear time • Choosing a hashing function • Dealing with Collision • Choosing a table size

Hash Functions Case 1: Keys used are integers. hash(x) = Key mod TableSize Ideal unless the key has properties that potentially cause problems. Ex. TableSize = 10 and all keys end in zero **As a general rule, it is beneficial to chose a PRIME TableSize. Examples

Hash Functions Case 2: Keys used are Strings (most common). Solution 1: hash(x) = add up ASCII (or UNICODE) values of the characters in String Problem occurs if the table size is large – the function does not distribute the keys well. Examples Cont.

Hash Functions ( Solution 1 cont.) Consider, TableSize = 10,007 (prime) Key = 8 or fewer characters Since an ASCII character can have a value of at most 127, adding all the character values of a key (the hash function) will only map to a cell between somewhere between 0 and 1,016 (127*8). Examples Cont.

Hash Functions Solution 2: Assume that the key has at most 3 characters, then implement the “Solution 1” hash function. If the hash function only examines the first three characters, which are random, and the table size is 10,007, an equal distribution is expected since 263 = 17,576 possible combinations A problem occurs since the English language is NOT random. Only about 2,851 actual combinations exist, so at most only 28% of the table can be hashed to. Examples Cont.

Hash Functions Solution 3: Uses all characters in the key, computing: hash(x) = Must include method for bringing cell number into range after calculation, as negative numbers may result. Uses Horner’s Rule for polynomials. Examples Cont.

Hash Functions (Solution 3 cont.); Typically expected to distribute well but sometimes has problems. Extremely simple. Reasonably fast. Code for this function can be found on page 172. Examples Cont.

Hashing Functions Solution 4: Only use a select number of characters based specific length and key criteria. The amount of time it takes to compute the hash function is reduced significantly if the keys are long, resulting in only a slightly less evenly distributed function. Ex. Only use a few characters from the street name, city, and zip code of a key that is a complete address. Ex. Only use the characters in the odd spaces. Examples Cont.

Collision Resolution • Collision occurs when more than one key is mapped to the same cell in a hash table. • Multiple strategies to resolve collision problems exist: • Separate Chaining • Linear Probing • Random Collision Resolution • Quadratic Probing • Double Hashing

Collision Resolution Uses linked lists to store multiple keys that are hashed to the same value. Slows down the algorithm because of the time taken to reallocate new cells, as well as the need for a second data structure. Separate Chaining

Load Factor • The load factor of a hash table, λ, is the ratio of the number of elements in the hash table to the table size. • The load factor for a hash table that does not utilize linked lists should ideally be λ= 0.5. • Those tables are called probing hash tables.

Collision Resolution Idea: If a collision occurs, try alternative cells in the hash table until and empty cell is found. Implementation: Try h0(x), h1(x), h2(x), … in succession, where hi(x) = (hash(x) + f(i) ) mod TableSize f(x) is known as the “collision resolution strategy” f(x) is a linear function, and is typically f(i) = i. Linear Probing

Linear Probing Example Let hash(x) = x mod 10 f(i) = i TableSize = 10 hi(x) = (hash(x) + f(i)) mod TableSize Consider the group of keys {89, 18, 49, 58, 69} Map the keys to the hash table with the above hashing function, using linear probing to resolve any collisions. Map 89: h0(89) = (89 mod 10 + 0) mod 10 = (9 + 0) mod 10 = 9 mod 10 = 9 -> 89 maps to cell 9 in the hash table

Notice the wraparound quality of the key placement. • For the purposes of this and the following examples, the table size is not prime, though that is typically ideal.

(Linear Probing Example Cont.) Map 18: h0(18) = (18 mod 10 + 0) mod 10 = 8 -> 18 maps to cell 8 A collision occurs when mapping 49, and linear probing must be utilized. Map 49: h0(49) = (49 mod 10 + 0) mod 10 = (9 +0) mod 10 = 9 -> already taken by 89 h1(49) = (49 mod 10 + 1) mod 10 = (9 + 1) mod 10 = 10 mod 10 = 0 -> 49 maps to cell 0

Collision Resolution Prone to primary clustering – blocks of occupied cells start forming even if the table is relatively empty Any key that experiences a collision by being hashed into a cluster takes several attempts to find an empty cell. Expected number of probes: Insertions and unsuccessful searches ½(1 + 1/(1-λ)2) Successful searches ½(1 + 1/(1-λ)) Linear Probing Cont.

Collision Resolution If clustering is not a problem: - Assume a very large table size - Assume each probe is independent of previous probes These assumptions are satisfied by random collision resolution strategyand are reasonable unless λ is very close to 1. Number of expected probes Unsuccessful search 1/(1-λ) where (1-λ) = number of empty cells Random Collision Resolution

Collision Resolution An element is inserted due to the result of an unsuccessful search. Therefore, we can use the ‘cost’ of an unsuccessful search to compute the ‘cost’ of a successful one. Since λ changes from 0 to its current value, earlier insertions are cheaper. Estimated cost of average insertion: Clearly better results than linear probing. Random Collision Resolution Cont.

Collision Resolution Performance differences between Linear Probing and Random Collision Resolution. The x-axis represents λ, and the y-axis represents number of probes. Linear probing is represented by the dashed lines. Random Collision Resolution Cont.

Collision Resolution Idea: Same as Linear Probing, only a quadratic collision resolution strategy is used instead. The popular choice for f(x) is f(i) = i2 Eliminates primary clustering problem of linear probing. Problem occurs once the table is more than half full – there is no guarantee of finding an empty cell. If the table size is not prime, it does not even have to be half full before problems arise. Quadratic Probing

Quadratic Probing Example Let hash(x) = x mod 10 f(i) = i2 TableSize = 10 hi(x) = (hash(x) + f(i)) mod TableSize Consider the group of keys {89, 18, 49, 58, 69} Map the keys to the hash table with the above hashing function, using quadratic probing to resolve any collisions. Map 89: h0(89) = (89 mod 10 + 02) mod 10 = 9 -> 89 maps to cell 9 in the hash table Map 18: h0(18) = (18 mod 10 + 02) mod 10 = 8 -> 18 maps to cell 8

(Quadratic Probing Example Cont.) A collision occurs when mapping 49, and quadratic probing must be utilized. 49 maps to cell 0. Another collision occurs when mapping 58. Map 58: h0(58) = (58 mod 10 + 02) mod 10 = 8-> already taken by 18 h1(58) = (58 mod 10 + 12) mod 10 = 9 -> already taken by 89 h2(58) = (58 mod 10 + 22) mod 10 = (8 + 4) mod 10 = 2 -> 58 maps to cell 2

Collision Resolution In quadratic probing, at most, half of the table can be used as alternative locations to resolve collisions. Theorem: If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty. Theorem and Proof on page 180. It is crucial that the hash table have a prime table size. If not, the number of alternate locations is reduced significantly. Elements that hash to the same position will probe to the same alternative cells. This is called secondary clustering. Secondary clustering generally causes less than an extra half probe per search. Quadratic Probing Cont.

Collision Resolution Idea: Same basic concept as Linear and Quadratic Probing, only the collision resolution strategy used takes the form: f(i) = i * hash2(x) This formula indicates that when a collision occurs a second hash function is applied to x and it is then probed at a distance of hash2(x), then 2hash2(x), etc. A good choice of hash2(x) is: hash2(x) = R – (x mod R) where R is prime and less than TableSize. Double Hashing

Double Hashing Example Let hash(x) = x mod 10 f(i) = i * hash2(x) hash2(x) = 7 – (x mod 7) TableSize = 10 hi(x) = (hash(x) + f(i)) mod TableSize Consider the group of keys {89, 18, 49, 58, 69} Map the keys to the hash table with the above hashing function, using double probing to resolve any collisions. Map 89: h0(89) = (89 mod 10 ) + 0 * (7 – 89 mod 7)) mod 10 = 9 -> 89 maps to cell 9 in the hash table Map 18: h0(18) = (18 mod 10 + 0* (7 – 18 mod 7)) mod 10 = 8 -> 18 maps to cell 8

(Double Hashing Example Cont.) A collision occurs when mapping 49, and double hashing must be utilized. Map 49: h0(49) = (49 mod 10 + 0 * (7 – 49 mod 7) mod 10 = 9 -> already taken by 89 h1(49) = (49 mod 10 + 1 * (7 – 49 mod 7) mod 10 = (9 + 1(7-0)) mod 10 = (9 + 7) mod 10 = 6 -> 49 maps to cell 6

Collision Resolution A poor choice of hash2(x) could be disastrous. It is important to be sure all cells can be probed. This is not possible with a table size that is not prime. If correctly implemented, double hashing can have an expected number of probes comparable to that of the Random Collision Resolution Strategy. Quadratic probing, despite secondary clustering, is still likely to be simpler and faster in practice because it does not require time to process the second hash function. Double Hashing Cont.

Hashing

Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing