Hashing

Hashing Notes from Weiss, Ch 20 and Notes by Greg McCarra from Napier University: http://www.nada.kth.se/kurser/kth/2D1345/inda03/hashingReading.pdf

Introduction • What is hashing? Why is it useful to us? • Well, there are lots of applications out there that need to support ONLY the operations INSERT, SEARCH, and DELETE. These are known as “dictionary” operations. • Hashing can make this happen in as much as O(n) but as little as O(1) and is quite fast in practice. Let’s learn more…

What is it? • a hash table or hash map is a data structure that uses a hash function to efficiently translate certain keys (e.g., person names) into associated values (e.g., their telephone numbers). The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought. • Ideally the hash function should map each possible key to a different slot index; but this goal is rarely achievable in practice. Most hash table designs assume that hash collisions — pairs of different keys with the same hash values — are normal occurrences, and accommodate them in some way. • In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized) cost per operation. • In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure. For this reason, they are widely used in all kinds of computer software. ----Wikipedia

Example We have a small group of people who wish to join a club (say about 40 folks). Then, if each of these people have an ID# associated with them (from 1 to 40) we could store their information in an array and access it using the ID# as the array index.

Example Now, we have 7 of these clubs, with consecutive ID#s going up to 280. Now what? • We COULD create a 280 element array for each club and use 40 elements of the array. (wasteful?) • We COULD create a 40 element array and calculate the index of each person using a mapping. (index = ID# - 240).

Example Now, imagine that we are hosting a club in campus open to all students. We could use the PC ID# (8 digits long). How big should our array be? THINGS TO CONSIDER: • How many students do we expect to join? • How can we create a key based on this number?

Hash Functions • If we expect no more than 100 club members, we can use the last two digits of the PC ID# as our index (aka KEY). Do we see any problems with this? • How do we get this number? • Take the remainder • (PC ID# % 100)

Hash Functions • Taking the remainder is called the Division-remainder technique and is an example of a uniform hash function • A uniform hash function is designed to distribute the keys roughly evenly into the available positions within the array (or hash table).

Collisions • So what about students 20061234 and 20071234? They will hash to the same position in the table! What do we do?

Collisions If no two values are able to map into the same position in the hash table, we have what is known as an “ideal hashing”. For the hash function f, each key k maps into position f(k). Then, to search for an element, we simply compute its hash function and look it up in the table.

Collisions • Usually, ideal hashing is not possible (or at least not guaranteed). Some data is bound to hash to the same table element, in which case, we have a collision. • How do we solve this problem?

Collisions • We can think of each table location as a “bucket” that contains several slots. Each slot is filled with one piece of data. • This approach involves “chaining” the data. This is a common approach when the hash table is used as disk storage. For each element of the table, a linked list (of sorts) is maintained to hold data that map to the same location. This list can grow as items are entered (unordered) or enter items into the list in a sorted fashion (for easier retrieval).

Collisions • Other solutions? • Linear Probing • Quadratic Probing • Designing a Good Hash Function

Linear Probing • Have you ever been to a theatre or sports event where the tickets were numbered? • Has someone ever sat in your seat? • How did you resolve this problem?

Linear Probing Linear Probing involves seeing an item in the hashed location and then moving by 1 through the array (circling to the beginning if necessary) until an open location is found.

Linear Probing • Let’s say that we have 1000 numbered tickets to an event, but only sell 400. If we move the event to a smaller venue, we must also renumber the tickets. The hash function would work like this: • (ticket number) % 400. • How many folks can get the same hashed number? (3 - for example, tickets 42, 442, and 842)

Linear Probing • The idea is that even though these number hash to the same location, they need to be given a slot based on their hash number index. Using linear probing, the entries are placed into the next available position.

Linear Probing • Consider the data with keys: 24, 42, 34,62,73 into a table of size 10. These entries can be placed into the table at the following locations:

Linear Probing • 24 % 10 = 4. Position is free. 24 placed into element 4 • 42 % 10 = 2. Position is free. 42 placed into element 2 • 34 % 10 = 4. Position is occupied. Try next place in the table (5). 34 placed into position 5. • 62 % 10 = 2. Position is occupied. Try next place in the table (3). 62 placed into position 3. • 73 % 10 = 3. Position is occupied. Try next place in the table (4). Same problem. Try (5). Then (6). 73 is placed into position 6.

Linear Probing • How would it look if the numbers were: • 28, 19, 59, 68, 89??

Finding and Deleting • Finding? • Deleting? • we must be more careful. Having found the element, we can’t just remove it. Why? • Use lazy deletion

Clustering • Sometimes, data will cluster – this is caused when many elements hash to the same (or similar) location and linear probing has been used often. We can help with this problem by choosing our divisor carefully in our hash function and by carefully choosing our table size.

Designing a Good Hash Function • If the divisor is even and there are more even than odd key values, the hash function will produce an excess of even values. This is also true if there are an excessive amount of odd values. • However, if the divisor is odd, then either kind of excess of key values would still give a balanced distribution of odd/even results. • Thus, the divisor should be odd. But, this is not enough.

Designing a Good Hash Function • Thus, the divisor should be odd. But, this is not enough. • If the divisor itself is divisible by a small odd number (like 3, 5, or 7) the results are unbalanced again. Ideally, it should be a prime number. If no such prime number works for our table size (the divisor, remember?), we should use an odd number with no small factors.

Problems of Linear Probing • The majority of the problems are caused by clustering. These problems can be helped by using Quadratic probing instead.

Quadratic Probing • Works like linear probing but instead of looking to the next available position, the next location is chosen by looking at the positions that are 12, 22, 32, etc. positions ahead.

Quadratic Probing • Consider the data with keys: 24, 42, 34,62,73 into a table of size 10. These entries can be placed into the table at the following locations:

Quadratic Probing • 24 % 10 = 4. Position is free. 24 placed into element 4 • 42 % 10 = 2. Position is free. 42 placed into element 2 • 34 % 10 = 4. Position is occupied. Try place 12 away in the table (5). 34 placed into position 5. • 62 % 10 = 2. Position is occupied. Try place 12 away in the table. (3) 62 placed into position 3. • 73 % 10 = 3. Position is occupied. Try place 12 away in the table (4). Same problem. Try place 22 away in the table (6). 73 is placed into position 6. • Thus, we jumped over the existing cluster. • This doesn’t completely solve our problem, but it helps.

Quadratic Probing • How would it look if the numbers were: • 28, 19, 59, 68, 89??

Advantages • Fast – average constant time (O(1)) for finding information – esp apparent when the table is large. • If the key/value pairs are known before programming (disallowing insertions/deletions of new data into the table), the programmer can reduce average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. (Sometimes this allows for “perfect hashing”) ---- Wikipedia

Perfect Hashing • If all of the keys that will be used are known ahead of time, and there are no more keys than can fit the hash table, a perfect hash function can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well. • Perfect hashing allows for constant time lookups in the worst case. This is in contrast to most chaining and open addressing methods, where the time for lookup is low on average, but may be arbitrarily large. ---- Wikipedia

Drawbacks • More difficult to implement than search trees • Though operations take O(1) on average, cost of the hash function can be much higher, so on small numbers of data, hash tables are not as effective as a good tree structure. • Can be very inefficient if there are many collisions. • Unlikely in normal practice, a crafty (malicious) programmer can force the function to fall into the worst case behavior and create excessive collisions, causing poor performance (denial of service attacks) ---- Wikipedia

Implementation • In Java, Hash Tables are implemented as a “set” or a “map” • The classes “Set” and “HashSet” are in the java.util package • Set’s in java have four fundamental operations: • Adding an element (add method) • Removing an element (remove method) • Containment Testing (is element in set?) (contains method) • Listing all elements (in arbitrary order) (list using an iterator for the set with the hasNext and next methods in a loop)

ch16/set/SetDemo.java 01:import java.util.HashSet; 02:import java.util.Scanner; 03:import java.util.Set; 04: 05: 06:/** 07: This program demonstrates a set of strings. The user 08: can add and remove strings. 09:*/ 10:publicclass SetDemo 11:{ 12:publicstaticvoidmain(String[] args) 13:{ 14: Set<String> names =new HashSet<String>(); 15: Scanner in =newScanner(System.in); 16: 17:boolean done =false; 18:while(!done) 19:{ 20: System.out.print("Add name, Q when done: "); 21: String input = in.next(); Continued

ch16/set/SetDemo.java (cont.) 22:if(input.equalsIgnoreCase("Q")) 23: done =true; 24:else 25:{ 26: names.add(input); 27:print(names); 28:} 29:} 30: 31: done =false; 32:while(!done) 33:{ 34: System.out.print("Remove name, Q when done: "); 35: String input = in.next(); 36:if(input.equalsIgnoreCase("Q")) 37: done =true; 38:else 39:{ 40: names.remove(input); 41:print(names); 42:} 43:} 44:} Continued

ch16/set/SetDemo.java (cont.) 45: 46:/** 47: Prints the contents of a set of strings. 48:@param s a set of strings 49: */ 50:privatestaticvoidprint(Set<String> s) 51:{ 52: System.out.print("{ "); 53:for(String element : s) 54:{ 55: System.out.print(element); 56: System.out.print(" "); 57:} 58: System.out.println("}"); 59:} 60:} 61: 62: Continued

Maps • A set stores elements. A map stores associations between keys and values. • Maps are also used to implement Hash Tables in Java. • All of these methods are found in the java.util package • java.util.Map • java.util.Set • java.util.HashMap • java.util.HashSet

Maps • A map keeps associations between key and value objects • Mathematically speaking, a map is a function from one set, the key set, to another set, the value set • Every key in a map has a unique value • A value may be associated with several keys • Classes that implement the Map interface • HashMap • TreeMap

An Example of a Map

ch16/map/MapDemo.java 01:import java.awt.Color; 02:import java.util.HashMap; 03:import java.util.Map; 04:import java.util.Set; 05: 06:/** 07: This program demonstrates a map that maps names to colors. 08:*/ 09:publicclass MapDemo 10:{ 11:publicstaticvoidmain(String[] args) 12:{ 13: Map<String, Color> favoriteColors 14:=new HashMap<String, Color>(); 15: favoriteColors.put("Juliet", Color.PINK); 16: favoriteColors.put("Romeo", Color.GREEN); 17: favoriteColors.put("Adam", Color.BLUE); 18: favoriteColors.put("Eve", Color.PINK); 19: Continued

ch16/map/MapDemo.java (cont.) 20: Set<String> keySet = favoriteColors.keySet(); 21:for(String key : keySet) 22:{ 23: Color value = favoriteColors.get(key); 24: System.out.println(key +"->"+ value); 25:} 26:} 27:} Continued

Hashing

Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing