320 likes | 452 Vues
Chapter 11 Hash. Anshuman Razdan Div of Computing Studies razdan@asu.edu http://dcst2.east.asu.edu/~razdan/cst230/. Searching. Searching for a specific value among a collection of values is a common operation. Complexity of search/find using: array linked list ordered list binary tree
E N D
Chapter 11Hash Anshuman Razdan Div of Computing Studies razdan@asu.eduhttp://dcst2.east.asu.edu/~razdan/cst230/
Searching • Searching for a specific value among a collection of values is a common operation. • Complexity of search/find using: • array • linked list • ordered list • binary tree • BST CST 230 - Razdan et al.
Linear Search • search an array A of n elements for a specified element target i = 0; found = false; while( (i < n) && !found ) if( A[ i ] == (or equals) target found = true; else i++; if( found ) target is at position i else target is not in array CST 230 - Razdan et al.
Complexity of Linear Search • count # of comparisons that must be done. • Worst Case • Average Case CST 230 - Razdan et al.
Binary Search • search a sorted array A of n elements for a specified element target public static int BinarySearch( int[] A, int first, int n, int target ){ int middle; if( n <= 0 ) found = -1; else{ middle = first + size/2; if( target == A[middle] ) found = middle; else if( target < A[middle] ) found = BinarySearch( A, first, n/2, target ); else found = BinarySearch( A, middle+1, (n-1)/2, target ); } return found; } CST 230 - Razdan et al.
Complexity of BinarySearch • BinarySearch body has constant time – so we need to count the number of calls made to BinarySearch • Find the depth of recursive calls – the length of the longest chain on recursive calls in the execution of an algorithm. CST 230 - Razdan et al.
Motivation: Direct Access is Fast • Suppose we have a large number of products to store and that each product has a unique product ID. • If n products have ID’s in range 0..n-1, we can store each product in an array at index prodID. • time to find product? • If # ID’s is much smaller than range of ID’s storing each product at prodID is VERY space inefficient. CST 230 - Razdan et al.
Hashing • Each element has a unique key that identifies the element. • We have: large range of keys • We want: index of elements to be 0..numElem-1 key1 ... key2 ... key3 ... key4 ... keyn hash function 0 1 2 3 ... n-1 CST 230 - Razdan et al.
Common hashing function: Mod • The mod function is a natural choice for hashing because x mod n always results in a number in the range 0 .. n-1. • E.g., Insert the following numbers into a hash table of size 10: 432, 321, 17, 65, 9388, 200, 83, 564 CST 230 - Razdan et al.
Collisions • A perfect hashing function will produce a different index for every key. • Unfortunately, mod is NOT perfect. • 20 mod 10 = 0 • 520 mod 10 = 0 • 1030 mod 10 = 0 • etc. • When two (or more) distinct keys hash to the same index, we have a collision. • There are various methods used to deal with collisions. CST 230 - Razdan et al.
Open-address Hashing • One method to deal with collisions is open-addressing: • compute hash(key) • if data[hash(key)] is not occupied, insert key. else • search forward starting at index hash(key) + 1 until a vacant position is found and insert key. (Note: array is circular, so that after the last index of the array is tried, index 0 is tried next.) • This method is also called “linear probing” CST 230 - Razdan et al.
Example • Insert keys 89, 18, 49, 58, and 9 into a hash table of size 10. CST 230 - Razdan et al.
Hashing non-integer keys • Many applications require collections of objects with non-integer keys (often Strings). • an encoding function converts the key to an integer, and the hash function is performed on the encoding. • all Java classes (objects) include a method called hashCode. • Note: keys must be unique – so encoding of keys must be unique as well. This is very important when designing an encoding scheme. CST 230 - Razdan et al.
Hashtable methods • Common Hashtable methods are: • put put a new object into the table • containsKey search for object with specified key (returns boolean) • get retrieve an object for a specified key • remove removes an object with a specified key CST 230 - Razdan et al.
Example Implementation public class Hashtable{ private int manyItems; private Object[] keys; private Object[] data; private boolean[] hasBeenUsed; private int hash(Object key){ return Math.abs(key.hashCode())%data.length; } private int nextIndex(int i){ return (i+1) % data.length; } ... CST 230 - Razdan et al.
Constructor public Hashtable( int capacity ){ if( capacity <= 0 ) throw new IllegalArgumentException (“Capacity is negative.”); keys = new Object[capacity]; data = new Object[capacity]; hasBeenUsed = new boolean[capacity]; } CST 230 - Razdan et al.
findIndex private int findIndex( Object key ){ int count = 0; int i = hash(key); int retVal = -1; while( (count<data.length) && (hasBeenUsed[i]) && (retVal == -1) ){ if( key.equals(keys[i]) ) retVal = i; count++; i = nextIndex(i); } return retVal; } CST 230 - Razdan et al.
put public Object put(Object key, Object element){ int index = findIndex{key); Object answer = null; if( index != -1 ){ answer = data[index]; data[index] = element; } else if( manyItems < data.length ){ index = hash(key); while( keys[index] != null ) index = nextIndex(index); keys[index] = key; data[index] = element; hasBeenUsed[index] = true; manyItems++; } else throw new IllegalStateException (“Table is full”); return answer; } CST 230 - Razdan et al.
remove public Object remove( key ){ int index = findIndex( key ); Object answer = null; if( index != -1 ){ answer = date[index]; keys[index] = null; data[index] = null; manyItems--; } return answer; } CST 230 - Razdan et al.
get public Object get( Object key ){ int index = findIndex( key ); Object answer = null; if( index != -1 ){ answer = data[index]; } return answer; } CST 230 - Razdan et al.
containsKey public boolean containsKey( Object key ){ } CST 230 - Razdan et al.
Example • Show state of Hashtable after the following are performed (assume hashCode of an integer is the integer itself): • construct Hashtable with capacity 10 • put( new Integer(29), “Barb” ) • put ( new Integer(19), “Mateo” ) • put( new Integer( 9 ), “Eddie” ) • remove( new Integer(19) ) • containsKey( new Integer(9) ) • put( new Integer(30), “Jerry” ) CST 230 - Razdan et al.
Linear probing and clustering • In linear probing, when several keys hash to same index a “cluster” of values forms around the index. • elements take longer to find/add because we must move linearly through entire cluster. • elements are put farther and farther away from desired index. • need other methods that avoid clustering. CST 230 - Razdan et al.
Double Hashing • The most common technique to avoid clustering is double hashing: • use hash function hash1 to determine desired index of element. • if collision occurs, use hash function hash2 to determine next index to search for open spot. • In particular, if index i is occupied, the next index to examine is: (i + hash2(key) ) % data.length CST 230 - Razdan et al.
choosing hash2 • as we step through the array, we must ensure that every array position is examined. • we must choose hash2 to prevent returning to original hash index before visiting entire array. • Array capacity & hash2 value should be relatively prime. One way to accomplish this: • choose data.length as a prime number and have hash2 return values from range 1 .. data.length – 1 • Donald Knuth’s suggestion: • both data.length and data.length – 2 should be prime numbers (called twin primes) e.g. 1231 and 1229 • hash1(key) = Math.abs(key.hashCode()) % data.length • hash2(key) = 1 + (Math.abs(key.hashCode())%(data.length – 2) CST 230 - Razdan et al.
Chained Hashing • In chaining, we essentially allow collisions to occur, and store more than one element at a given array index. • How can we store more than one element? • list • ordered list • bst • If the hash function equally distributes keys over the array, the chains at each index should be relatively short. CST 230 - Razdan et al.
Time Analysis • Worst case for hashing is when all keys hash to same index (linear) • Best case for hashing is when all keys hash to different indices (constant) • Average case analysis gives a better picture of what happens in reality. CST 230 - Razdan et al.
Load Factor • The load factor for a hash table is defined as: • For open-address hashing <= 1. • For chaining, could be larger than 1. CST 230 - Razdan et al.
Average Time (Linear Probing) • In open-address hashing with linear probing, a nonfull hash table and no removals, the average number of table elements examined is about • For example. Suppose we have 800 items in a table of capacity 1000. How many entries will we examine on average? CST 230 - Razdan et al.
Average Time (Double Hashing) • In open-address hashing with double hashing, a nonfull hash table, and no removals, the average number of elements examined is about: • How many comparisons for previous example? CST 230 - Razdan et al.
Average Time (Chaining) • In open-address hashing with chained hashing, the average number of table elements examined is about: • How many for previous example? CST 230 - Razdan et al.
Java Data Structures • the java.util package includes the following classes (see http://java.sun.com/j2se/1.4.2/docs/api/ ) • HashMap • Hashtable • LinkedList • as well as interfaces: • Iterator • ListIterator CST 230 - Razdan et al.