Data Structures and Algorithms

Data StructuresandAlgorithms Course slides: Hashing www.mif.vu.lt/~algis

Data Structures for Sets • Many applications deal with sets. • Compilers have symbol tables (set of vars, classes) • Dictionary is a set of words. • Routers have sets of forwarding rules. • Web servers have set of clients, etc. • A set is a collection of members • No repetition of members • Members themselves can be sets

Data Structures for Sets • Examples • Set of first 5 natural numbers: {1,2,3,4,5} • {x | x is a positive integer and x < 100} • {x | x is a CA driver with > 10 years of driving experience and 0 accidents in the last 3 years}

Set Operations • Unary operation: min, max, sort, makenull, …

Observations • Set + Operations define an ADT. • A set + insert, delete, find • A set + ordering • Multiple sets + union, insert, delete • Multiple sets + merge • Etc. • Depending on type of members and choice of operations, different implementations can have different asymptotic complexity.

Dictionary ADTs • Maintain a set of items with distinct keys with: • find (k): find item with key k • insert (x): insert item x into the dictionary • remove (k): delete item with key k • Where do we use them: • Symbol tables for compiler • Customer records (access by name) • Games (positions, configurations) • Spell checkers • Peer to Peer systems (access songs by name), etc.

Naïve Implementations • The simplest possible scheme to implement a dictionary is “log file” or “audit trail”. • Maintain the elements in a linked list, with insertions occuring at the head. • The search and delete operations require searching the entire list in the worst-case. • Insertion is O(1), but find and delete are O(n). • A sorted array does not help, even with ordered keys. The search becomes fast, but insert/delete take O(n).

Hash Tables: Intuition • Hashing is function that maps each key to a location in memory. • A key’s location does not depend on other elements, and does not change after insertion. • unlike a sorted list • A good hash function should be easy to compute. • With such a hash function, the dictionary operations can be implemented in O(1) time

Hash Tables: Intuition • Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements are assigned keys from the set of small natural numbers. That is, U ⊂ Z+ and ⏐U⏐ is relatively small. • If no two elements have the same key, then this dictionary can be implemented by storing its elements in the array T[0, ... , ⏐U⏐ - 1]. This implementation is referred to as a direct-access table since each of the requisite DICTIONARY ADT operations - Search, Insert, and Delete - can always be performed in Θ(1) time by using a given key value to index directly into T, as shown:

Hash Tables: Intuition • The obvious shortcoming associated with direct-access tables is that the set U rarely has such "nice" properties. In practice, ⏐U⏐ can be quite large. This will lead to wasted memory if the number of elements actually stored in the table is small relative to ⏐U⏐. • Furthermore, it may be difficult to ensure that all keys are unique. Finally, a specific application may require that the key values be real numbers, or some symbols which cannot be used directly to index into the table. • An effective alternative to direct-access tables are hash tables. A hash table is a sequentially mapped data structure that is similar to a direct-access table in that both attempt to make use of the random- access capability afforded by sequential mapping.

Hash Tables: Intuition

Hash Tables • All search structures so far • Relied on a comparison operation • Performance O(n) or O( log n) • Assume I have a function • f ( key ) ® integer • ie one that maps a key to an integer • What performance might I expect now?

Hash Tables - Structure • Simplest case: • Assume items have integer keys in the range 1 .. m • Use the value of the key itselfto select a slot in a direct access tablein which to store the item • To search for an item with key, k,just look in slot k • If there’s an item there,you’ve found it • If the tag is 0, it’s missing. • Constant time, O(1)

Hashing : the basic idea • Map key values to hash table addresseskeys -> hash table address This applies to find, insert, and remove • Usually: integers-> {0, 1, 2, …, Hsize-1}Typical example: f(n) = n mod Hsize • Non-numeric keys converted to numbers • For example, strings converted to numbers as • Sum of ASCII values • First three characters

Hashing : the basic idea Perm # (mod 9) Student Records

1 SP(k) = SP(k) = .... SP(k) = m k | h(k) = 0 k | h(k) = 1 k | h(k) = m-1 Hash Tables - Choosing the Hash Function • Uniform hashing • Ideal hash function • P(k) = probability that a key, k, occurs • If there are m slots in our hash table, • a uniform hashing function, h(k), would ensure: • or, in plain English, • the number of keys that map to each slot is equal Read as sum over all k such that h(k) = 0

h(k) = mk r Hash Tables - A Uniform Hash Function • If the keys are integers randomly distributed in [ 0 , r ), then • is a uniform hash function • Most hashing functions can be made to map the keys to [ 0 , r ) for some r, egadding the ASCII codes for characters mod 255will give values in [ 0, 256 )or[ 0, 255 ] • Replace + by xor • same range without the mod operation Read as 0£k < r

Hash Tables - Reducing the range to [0, m ) • We’ve mapped the keys to a range of integers0£k < r • Now we must reduce this range to [ 0, m ) • where m is a reasonable size for the hash table • Strategies • Division - use a mod function • Multiplication • Universal hashing

k mod 28 selects these bits 0110010111000011010 Hash Tables - Reducing the range to [0, m ) • Division • Use a mod function • h(k) = k mod m • Choice of m? • Powers of 2 are generally not good!h(k) = k mod2nselects last n bits of k • All combinations are not generally equally likely • Prime numbers close to 2n seem to be good choices • eg want ~4000 entry table, choose m = 4093

Hash Tables - Reducing the range to [ 0, m ) • Multiplication method • Multiply the key by constant, A, 0 < A < 1 • Extract the fractional part of the product • ( kA - ëkAû) • Multiply this by m • h(k) = ëm * ( kA - ëkAû)û • Now m is not critical and a power of 2 can be chosen • So this procedure is fast on a typical digital computer • Setm = 2p • Multiply k (w bits) by ëA•2wû ç2w bit product • Extract p most significant bits of lower half • A = ½(Ö5 -1) seems to be a good choice (see Knuth)

Hash Tables - Reducing the range to [0, m ) • Universal Hashing • A determined “adversary” can always find a set of data that will defeat any hash function • Hash all keys to same slot çO(n)search • Select the hash function randomly (at run time)from a set of hash functions • Reduced probability of poor performance • Set of functions, H, which map keys to [ 0, m ) • H, is universal, if for each pair of keys, x and y,the number of functions, hÌH,for which h(x) = h(y) is |H |/m

Hash Tables - Reducing the range to ( 0, m ] • Universal Hashing • A determined “adversary” can always find a set of data that will defeat any hash function • Hash all keys to same slot çO(n)search • Select the hash function randomly (at run time)from a set of hash functions • --------- • Functions are selected at run time • Each run can give different results • Even with the same data • Good average performance obtainable

Hash Tables - Reducing the range to ( 0, m ] • Universal Hashing • Can we design a set of universal hash functions? • Quite easily • Key, x = x0, x1, x2, ...., xr • Choose a = <a0, a1, a2, ...., ar>a is a sequence of elements chosen randomly from { 0, m-1 } • ha(x) = Saiximod m • There are mr+1sequencesa,so there aremr+1functions,ha(x) • Theorem • Thehaform a set of universal hash functions Proof: SeeCormen

Hash Tables - Constraints • Constraints • Keys must be unique • Keys must lie in a small range • For storage efficiency,keys must be dense in the range • If they’re sparse (lots of gaps between values),a lot of space is used to obtain speed • Space for speed trade-off

Hash Tables - Relaxing the constraints • Keys must be unique • Construct a linked list of duplicates “attached” to each slot • If a search can be satisfiedby any item with key, k,performance is still O(1) • but • If the item has some other distinguishing featurewhich must be matched,we get O(nmax) • where nmax is the largest number of duplicates - or length of the longest chain

Hash Tables - Relaxing the constraints • Keys are integers • Need a hash functionh( key ) ® integer • ie one that maps a key to an integer • Applying this function to thekey produces an address • If h maps each key to a uniqueinteger in the range 0 .. m-1then search is O(1)

Hash Tables - Hash functions • Form of the hash function • Example - using ann-character key • int hash( char *s, int n ) {int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; }returns a value in 0 .. 255 • xor function is also commonly usedsum = sum ^ *s++; • But any function that generates integers in 0..m-1 for some suitable (not too large) m will do • As long as the hash function itself is O(1) !

Hash Tables - Collisions • Hash function • With this hash function • int hash( char *s, int n ) {int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; } • hash( “AB”, 2 ) andhash( “BA”, 2 )return the same value! • This is called a collision • A variety of techniques are used for resolving collisions

Hash Tables - Collision handling • Collisions • Occur when the hash function maps two different keys to the same address • The table must be able to recognise and resolve this • Recognise • Store the actual key with the item in the hash table • Compute the address • k = h( key ) • Check for a hit • if ( table[k].key == key ) then hitelse try next entry • Resolution • Variety of techniques

Hash Tables - Linked lists • Collisions - Resolution • Linked list attached to each primary table slot • h(i) == h(i1) • h(k) == h(k1) == h(k2) • Searching for i1 • Calculate h(i1) • Item in table, i,doesn’t match • Follow linked list to i1 • If NULL found, key isn’t in table

Hash Tables - Overflow area • Overflow area • Linked list constructedin special area of tablecalled overflow area • h(k) == h(j) • k stored first • Adding j • Calculate h(j) • Find k • Get first slot in overflow area • Put jin it • k’s pointer points to this slot • Searching - same as linked list

Hash Tables - Re-hashing • Use a second hash function • Many variations • General term: re-hashing • h(k) == h(j) • k stored first • Adding j • Calculate h(j) • Find k • Repeat until we find an empty slot • Calculate h’(j) • Put j in it • Searching - Use h(x), then h’(x) h’(x) - second hash function

Hash Tables - Re-hash functions • The re-hash function • Many variations • Linear probing • h’(x) is +1 • Go to the next slotuntil you find one empty • Can lead to bad clustering • Re-hash keys fill in gapsbetween other keys and exacerbatethe collision problem

Hash Tables - Re-hash functions • The re-hash function • Many variations • Quadratic probing • h’(x)isc i2on the ith probe • Avoids primary clustering • Secondary clustering occurs • All keys which collide on h(x) follow the same sequence • First • a = h(j) = h(k) • Then a + c, a + 4c, a + 9c, .... • Secondary clustering generally less of a problem

Hash Tables - Collision Resolution Summary • Chaining • Unlimited number of elements • Unlimited number of collisions • Overhead of multiple linked lists • Re-hashing • Fast re-hashing • Fast access through use of main table space • Maximum number of elements must be known • Multiple collisions become probable • Overflow area • Fast access • Collisions don't use primary table space • Two parameters which govern performance need to be estimated

Hash Tables - Collision Resolution Summary • Re-hashing • Fast re-hashing • Fast access through use of main table space • Maximum number of elements must be known • Multiple collisions become probable • Overflow area • Fast access • Collisions don't use primary table space • Two parameters which govern performance need to be estimated

Hash Tables - Summary so far ... • Potential O(1) search time • If a suitable function h(key) ® integer can be found • Space for speed trade-off • “Full” hash tables don’t work (more later!) • Collisions • Inevitable • Hash function reduces amount of information in key • Various resolution strategies • Linked lists • Overflow areas • Re-hash functions • Linear probing h’ is +1 • Quadratic probingh’is +ci2 • Any other hash function! • or even sequence of functions!

Hashing: • Choose a hash function h; it also determines the hash table size. • Given an item x with key k, put x at location h(k). • To find if x is in the set, check location h(k). • What to do if more than one keys hash to the same value. This is called collision. • We will discuss two methods to handle collision: • Separate chaining • Open addressing

Maintain a list of all elements that hash to the same value Search -- using the hash function to determine which list to traverse Insert/deletion–once the “bucket” is found through Hash, insert and delete are list operations 0 1 2 3 4 5 6 7 8 9 10 23 1 56 24 36 14 16 17 7 29 31 20 42 Separate chaining find(k,e) HashVal = Hash(k,Hsize);if (TheList[HashVal].Search(k,e))then return true;else return false; class HashTable { …… private: unsigned int Hsize; List<E,K> *TheList; ……

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 23 1 56 23 1 56 24 24 36 14 36 14 16 16 17 17 7 29 7 29 53 20 42 31 20 42 31 Insertion: insert 53 53 = 4 x 11 + 9 53 mod 11 = 9

Analysis of Hashing with Chaining • Worst case • All keys hash into the same bucket • a single linked list. • insert, delete, find take O(n) time. • Average case • Keys are uniformly distributed into buckets • O(1+N/B): N is the number of elements in a hash table, B is the number of buckets. • If N = O(B), then O(1) time per operation. • N/B is called the load factor of the hash table.

If collision happens, alternative cells are tried until an empty cell is found. Linear probing :Try next available position 0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 16 28 7 31 9 Open addressing

0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 16 28 7 0 1 2 3 4 5 6 7 8 9 10 42 1 31 24 9 14 12 16 28 7 31 9 Linear Probing (insert 12) 12 = 1 x 11 + 1 12 mod 11 = 1

0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 12 16 28 7 31 9 Search with linear probing (Search 15) 15 = 1 x 11 + 4 15 mod 11 = 4 NOT FOUND !

Deletion in Hashing with Linear Probing • Since empty buckets are used to terminate search, standard deletion does not work. • One simple idea is to not delete, but mark. • Insert: put item in first empty or marked bucket. • Search: Continue past marked buckets. • Delete: just mark the bucket as deleted. • Advantage: Easy and correct. • Disadvantage: table can become full with dead items.

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 42 42 1 1 24 24 14 14 12 12 16 16 28 28 7 7 31 31 D 9 Deletion with linear probing: LAZY (Delete 9) 9 = 0 x 11 + 9 9 mod 11 = 9 FOUND !

Eager Deletion: fill holes • Remove and find replacement: • Fill in the hole for later searches remove(j) { i = j; empty[i] = true; i = (i + 1) % D; // candidate for swapping while ((not empty[i]) and i!=j) { r = Hash(ht[i]); // where should it go without collision? // can we still find it based on the rehashing strategy? if not ((j<r<=i) or (i<j<r) or (r<=i<j)) then break; // yes find it from rehashing, swap i = (i + 1) % D; // no, cannot find it from rehashing } if (i!=j and not empty[i]) then { ht[j] = ht[i]; remove(i); } }

Eager Deletion Analysis (cont.) • If not full • After deletion, there will be at least two holes • Elements that are affected by the new hole are • Initial hashed location is cyclically before the new hole • Location after linear probing is in between the new hole and the next hole in the search order • Elements are movable to fill the hole Initial hashed location Location after linear probing Initial hashed location Next hole in the search order New hole Next hole in the search order

Eager Deletion Analysis (cont.) • The important thing is to make sure that if a replacement (i) is swapped into deleted (j), we can still find that element. How can we not find it? • If the original hashed position (r) is circularly in between deleted and the replacement r i r j i Will not find i past the empty green slot! j r i i r j Will find i j i i r r

Quadratic Probing • Solves the clustering problem in Linear Probing • Check H(x) • If collision occurs check H(x) + 1 • If collision occurs check H(x) + 4 • If collision occurs check H(x) + 9 • If collision occurs check H(x) + 16 • ... • H(x) + i2

Data Structures and Algorithms