Download Presentation
## Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Chapter 10**• Tables • Overview • Direct access to data via a key, through the Table ADT, implemented as a hash table.**Chapter Objectives**• 1. The Table ADT provides direct access to data indexed by a key. • 2. Hashing techniques support very fast retrieval via keys. • 3. Two approaches to hash tables: linear probing and chained hashing.**Retrieval by key**• A common programming task is to look up an item in a list • We have already seen some simple ways of doing this • Linear search • Binary search • They differ in terms of data requirements and search efficiency**The Table ADT**• Tables are data structures that allow data to be retrieved directly. • They may be multidimensional (unlike lists) • They may be implemented in an array form or using arrays of linked lists (or even linked lists of linked lists).**The Table ADT**• Implementation independent • Will be used later with array and linked implementations**Table ADT: characteristics**• Characteristics • •A table ADT T stores data of some type (tableElementType) with an associated key (tableKeyType).**Table ADT: operations**• Operations • bool T.lookup(tableKeyType lookupKey, • tableElementType & data) • Precondition: None • Postcondition: If lookupKey equals a key in the table, the value of data is set to the data associated with that key; otherwise, the value of data is undefined. • Returns: true if and only if lookupKey equals a key in the table**Table ADT: insert()**• void T.insert(tableKeyType insertKey, tableElementType insertData) • Precondition: None • Postcondition: insertData and associated insertKey are stored in T, i.e., T.lookup(insertKey, data) == true, and the value of data will be set to insertData upon return. • Note: The Postcondition implies that if insertKey duplicates an existing key within the table, the data associated with that key is replaced by insertData.**Table ADT, deleteKey()**• void T.deleteKey(tableKeyType deleteKey) • Precondition: None • Postcondition: T.lookup(deleteKey, data) will return false**Table ADT**• Linear implementation • // for an array implementation, // need a max table size • const int MAX_TABLE = 100;**Table Class: public section**• template < class tableKeyType, class tableDataType > • class Table • { • public: • Table(); // Table constructor • bool lookup(tableKeyType lookupKey, tableDataType & data); • void insert(tableKeyType insertKey, tableDataType insertData); • void deleteKey(tableKeyType deleteKey);**Table ADT, private section**• private: • // implementation via an unordered array of structs • struct item { • tableKeyType key; • tableDataType data; • }; • item T[MAX_TABLE]; // stores the items in the table • int entries; // keep track of number of entries in table • int search(tableKeyType key); // an internal routine for searching table • };**The classic ‘lookup table’**15796 foobars 0 Product: Widget Product code: 11234 The data key is the product code and the data itself is the item name 17556 dohinky 1 11234 widget 2 14322 Etc. 3 . . . We also keep track of the number of entries in the table. . . . entries 9998 4 9999**Hash Table, linear version**• template < class tableKeyType, class tableDataType > • Int Table < tableKeyType, tableDataType > • ::search(tableKeyType key) • { // internal routine for implementation -- searches in table for the key -- if found, returns its position; • // else it returns the current value of "entries" -- which is the index 1 past the last item in the table • int pos; • for (pos = 0; pos < entries && T[pos].key != key; pos++) • ; • return pos; • }**Search()**15796 foobars 0 Product: Widget Product code: 11234 Returns the location of where The search item was found 17556 dohinky 1 11234 widget 2 14322 Etc. 3 . . . If the item is not found search returns a value equal to entries. . . . entries 9998 4 9999**Table constructor**• template < class tableKeyType, class tableDataType > • Table < tableKeyType, tableDataType >::Table() • { • entries = 0; • }**Insert()**• template < class tableKeyType, class tableDataType > • void Table < tableKeyType, tableDataType > • ::insert(tableKeyType key, tableDataType data) • { • assert(entries < MAX_TABLE); • int pos(search(key)); // set pos to search results • if (pos == entries) // new key • entries++; • T[pos].key = key; • T[pos].data = data; • }**Insert()**15796 foobars 0 Product: dealybob Product code: 18452 Search returns 4 17556 dohinky 1 11234 widget 2 14322 etc. 3 18452 dealybob 4 . . Insert item in array[4] And add 1 to entries . . . entries 9998 5 9999**lookup()**• template < class tableKeyType, class tableDataType > • bool Table < tableKeyType, tableDataType > • ::lookup(tableKeyType key, tableDataType &data) • { int pos(search(key)); // set pos to search results • if (pos == entries) // not found • return false; • else { • data = T[pos].data; • return true; • } • }**deleteKey()**• template < class tableKeyType, class tableDataType > • void Table < tableKeyType, tableDataType > • ::deleteKey(tableKeyType key) • { • int pos(search(key)); // set pos to search results • if (pos < entries) { // otherwise, not found, so do nothing • // copy last entry into this position • --entries; • T[pos] = T[entries]; • } • }**deleteKey()**15796 foobars 0 Product: widget Product code: 11234 Search returns 2 17556 dohinky 1 11234 widget 2 14322 Etc. 3 18452 dealybob 4 . . Item in array[2] is written over by last item in array. Subtract 1 from entries . . . entries 9998 4 9999**Problems**• This lookup table has one big problem. • Every time we wish to find something in it we must perform a linear search. • This is a O(n) • For many problems it would be too slow.**Search Methods**• Recall what we know about searching: • Method Time Drawbacks • Sequential O(n) Slow for large n • Binary O(log2n) Data must be contiguous Inserts and deletes are slow Data must be sorted first • Tree O(log2n) Tree must be balanced • Hashing O(1.1) Needs much unused memory • Direct access O(1) Keys must match array indices**The fastest search solutions**• The only way we can get O(1) efficiency in searching for Key in an array A is if either • 1. Key is the position (index) of the data in A A[key] • 2. There is a key-to-address transformation of ‘key’ (hashing function) into the index: A[hash(key)]**Direct access**• Alternative 1 (direct access) is often impractical because it requires keys to be in the range 0..array_size-1More often, key data is to be accessed by names, social security number, etc.Thus, a hashing function (a key-to-address transformation) is required for most data sets.**Perils of direct lookup**• Image your company stores data for their employees by the key field social security number (SSN) • These numbers are 9 digits long. • You would need an array of size 1,000,000,000 to hold all the possibilities from 000-00-0000 to 999-99-9999**Storing SSNs**000000000 To look up the information for an employee with the SSN 467-89-1234 you go directly to a[467891234]! 000000001 . . . 467891234 . . . . 999999998 999999999**Pros and cons of direct lookup**000000000 Advantages: Direct lookup Every employee has their own unique spot in the array 000000001 . . . 467891234 Disadvantages: You only ever use a small portion of the array. The rest of the space is wasted (and there is a lot of it)! . . . . 999999998 999999999**A better solution**• A better solution would map all the employees into a data structure without wasting a lot of space. • From the domain of all possible SSNs we want a structure that will store just the range of ones applying to our employees. • To do this we may have to convert the SSNs to something else.**An example of hash conversion**• You wish to store product information by product number. The product numbers have 5 digits with the lowest one being 10000. • const int max_size = 10000; • Then we could come up with a simple hash function • hashKey = productCode-max_Size; • This gives us a number between 0 and 9999. • We can use this unique number to directly access the array element containing data for that product.**Product code hash example**0 Product: Widget Product code: 11234 ProductArray[11234 - 10000] 1 . . . widget 1234 . . . . 9998 9999**Another implementation**0 • Or, we could use mod (%) • like this: 1 . . . Product: Widget Product code: 11234 ProductArray[11234 % 10000] widget 1234 . . . . 9998 9999**An internet example**• Internet Protocol (IP) uses 32-bit addresses to look up host names. • Example: 63.100.1.17 (4 bytes) • When you want to access a host machine that is not on your network the router takes the host name you give it and looks up the IP address. • It then forwards your request to that host computer.**Problem**• Routers need to look up addresses as quickly as possible. • There are millions of IP addresses, so how will it do this? • Linear search? O(n) • Binary search? O(logn) base 2 • Neither of these are fast enough.**Answer**• Convert the host name “joe@schmo.net” to it’s IP address using a hash function based on the characters. • Perhaps we could use ASCII codes to generate unique numbers within a certain range.**Issues in hashing**• Each hash should generate a unique number. If two different items produce the same hash code we have a collision in the data structure. Then what? • Two issues must be addressed • 1. Hash functions must minimize collisions (there are strategies to do this) • 2. If (when) collisions do occur, we must know how to handle them.**What if...**0 • We used mod (%) like • this: 1 . . . Product: Widget Product code: 11234 Product: Whatzit Product code: 12234 ProductArray[11234 % 1000] widget 234 . whatzit . . . 998 999**Two good rules to follow**• A good hashing function must • 1. Always produce the same address for a given key, and • 2. For two different keys, the probability of producing the same address must be low.**Picking good hash functions**• We want one that spreads values out evenly. (random distribution) • If all the values cluster together • We have more collisions • We waste space in the rest of the data structure**A bad hash function. Why?**A SSN of 123-45-6789 is turned into the x value of 123456789 and then converted by this function to the value 1234. Thus, all SSN’s in the company could be mapped into an array of 10,000 elements (indexes 0000 - 9999). This is good, but could lead to what problems?**The problem is...**• SSNs are similar for people of approximately the same age, living in the same area of the country. • Employees are likely to have the first four digits commonly the same. • This would cause lots of collisions • At best, it causes clustering**A rule for hash functions**• “a good hash function depends on the entire key, rather than just a part.” • It is best if we use all the digits rather than throwing some of them away with integer division. • A better approach would divide by a prime number and take the remainder, to thoroughly mix the digits.**Prime number division**• Divide SSN by 10,007 (the smallest prime number > 10,000). • The remainder is between 0 and 10,006 • So we dimension the array as a[10007] • Then compute the table index from the SSN key in our hash function as • Index = key % tableSize;**An example using primes**Data keys: 321, 330, 415, 498, 791 Hash function = key % 11**Hash table after first five entries inserted**0 3 3 0 1 2 3 2 1 3 4 9 8 4 5 6 7 8 4 1 5 9 1 0 7 9 1**Linear probing**• No matter how good a hash function is, collisions will occur • One method for handling them is ‘linear probing’ • When you hash into the table and find the spot already occupied, you go forward, in linear search fashion until you find a slot that has not been taken. Then insert the item there.**2. so probe forward, adding it**at the next free slot Hash table after 365 added 0 3 3 0 1 1. can't add 365 at its home address 2 3 2 1 3 4 9 8 4 3 6 5 5 6 7 8 4 1 5 9 1 0 7 9 1