1 / 76

Tables

Tables. Chapter 10 Tables Overview Direct access to data via a key, through the Table ADT, implemented as a hash table. Chapter Objectives. 1. The Table ADT provides direct access to data indexed by a key. 2. Hashing techniques support very fast retrieval via keys.

brenna
Télécharger la présentation

Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tables

  2. Chapter 10 • Tables • Overview • Direct access to data via a key, through the Table ADT, implemented as a hash table.

  3. Chapter Objectives • 1. The Table ADT provides direct access to data indexed by a key. • 2. Hashing techniques support very fast retrieval via keys. • 3. Two approaches to hash tables: linear probing and chained hashing.

  4. Retrieval by key • A common programming task is to look up an item in a list • We have already seen some simple ways of doing this • Linear search • Binary search • They differ in terms of data requirements and search efficiency

  5. The Table ADT • Tables are data structures that allow data to be retrieved directly. • They may be multidimensional (unlike lists) • They may be implemented in an array form or using arrays of linked lists (or even linked lists of linked lists).

  6. The Table ADT • Implementation independent • Will be used later with array and linked implementations

  7. Table ADT: characteristics • Characteristics • •A table ADT T stores data of some type (tableElementType) with an associated key (tableKeyType).

  8. Table ADT: operations • Operations • bool T.lookup(tableKeyType lookupKey, • tableElementType & data) • Precondition: None • Postcondition: If lookupKey equals a key in the table, the value of data is set to the data associated with that key; otherwise, the value of data is undefined. • Returns: true if and only if lookupKey equals a key in the table

  9. Table ADT: insert() • void T.insert(tableKeyType insertKey, tableElementType insertData) • Precondition: None • Postcondition: insertData and associated insertKey are stored in T, i.e., T.lookup(insertKey, data) == true, and the value of data will be set to insertData upon return. • Note: The Postcondition implies that if insertKey duplicates an existing key within the table, the data associated with that key is replaced by insertData.

  10. Table ADT, deleteKey() • void T.deleteKey(tableKeyType deleteKey) • Precondition: None • Postcondition: T.lookup(deleteKey, data) will return false

  11. Table ADT • Linear implementation • // for an array implementation, // need a max table size • const int MAX_TABLE = 100;

  12. Table Class: public section • template < class tableKeyType, class tableDataType > • class Table • { • public: • Table(); // Table constructor • bool lookup(tableKeyType lookupKey, tableDataType & data); • void insert(tableKeyType insertKey, tableDataType insertData); • void deleteKey(tableKeyType deleteKey);

  13. Table ADT, private section • private: • // implementation via an unordered array of structs • struct item { • tableKeyType key; • tableDataType data; • }; • item T[MAX_TABLE]; // stores the items in the table • int entries; // keep track of number of entries in table • int search(tableKeyType key); // an internal routine for searching table • };

  14. The classic ‘lookup table’ 15796 foobars 0 Product: Widget Product code: 11234 The data key is the product code and the data itself is the item name 17556 dohinky 1 11234 widget 2 14322 Etc. 3 . . . We also keep track of the number of entries in the table. . . . entries 9998 4 9999

  15. Hash Table, linear version • template < class tableKeyType, class tableDataType > • Int Table < tableKeyType, tableDataType > • ::search(tableKeyType key) • { // internal routine for implementation -- searches in table for the key -- if found, returns its position; • // else it returns the current value of "entries" -- which is the index 1 past the last item in the table • int pos; • for (pos = 0; pos < entries && T[pos].key != key; pos++) • ; • return pos; • }

  16. Search() 15796 foobars 0 Product: Widget Product code: 11234 Returns the location of where The search item was found 17556 dohinky 1 11234 widget 2 14322 Etc. 3 . . . If the item is not found search returns a value equal to entries. . . . entries 9998 4 9999

  17. Table constructor • template < class tableKeyType, class tableDataType > • Table < tableKeyType, tableDataType >::Table() • { • entries = 0; • }

  18. Insert() • template < class tableKeyType, class tableDataType > • void Table < tableKeyType, tableDataType > • ::insert(tableKeyType key, tableDataType data) • { • assert(entries < MAX_TABLE); • int pos(search(key)); // set pos to search results • if (pos == entries) // new key • entries++; • T[pos].key = key; • T[pos].data = data; • }

  19. Insert() 15796 foobars 0 Product: dealybob Product code: 18452 Search returns 4 17556 dohinky 1 11234 widget 2 14322 etc. 3 18452 dealybob 4 . . Insert item in array[4] And add 1 to entries . . . entries 9998 5 9999

  20. lookup() • template < class tableKeyType, class tableDataType > • bool Table < tableKeyType, tableDataType > • ::lookup(tableKeyType key, tableDataType &data) • { int pos(search(key)); // set pos to search results • if (pos == entries) // not found • return false; • else { • data = T[pos].data; • return true; • } • }

  21. deleteKey() • template < class tableKeyType, class tableDataType > • void Table < tableKeyType, tableDataType > • ::deleteKey(tableKeyType key) • { • int pos(search(key)); // set pos to search results • if (pos < entries) { // otherwise, not found, so do nothing • // copy last entry into this position • --entries; • T[pos] = T[entries]; • } • }

  22. deleteKey() 15796 foobars 0 Product: widget Product code: 11234 Search returns 2 17556 dohinky 1 11234 widget 2 14322 Etc. 3 18452 dealybob 4 . . Item in array[2] is written over by last item in array. Subtract 1 from entries . . . entries 9998 4 9999

  23. Problems • This lookup table has one big problem. • Every time we wish to find something in it we must perform a linear search. • This is a O(n) • For many problems it would be too slow.

  24. Search Methods • Recall what we know about searching: • Method Time Drawbacks • Sequential O(n) Slow for large n • Binary O(log2n) Data must be contiguous Inserts and deletes are slow Data must be sorted first • Tree O(log2n) Tree must be balanced • Hashing O(1.1) Needs much unused memory • Direct access O(1) Keys must match array indices

  25. The fastest search solutions • The only way we can get O(1) efficiency in searching for Key in an array A is if either • 1. Key is the position (index) of the data in A A[key] • 2. There is a key-to-address transformation of ‘key’ (hashing function) into the index: A[hash(key)]

  26. Direct access • Alternative 1 (direct access) is often impractical because it requires keys to be in the range 0..array_size-1More often, key data is to be accessed by names, social security number, etc.Thus, a hashing function (a key-to-address transformation) is required for most data sets.

  27. Perils of direct lookup • Image your company stores data for their employees by the key field social security number (SSN) • These numbers are 9 digits long. • You would need an array of size 1,000,000,000 to hold all the possibilities from 000-00-0000 to 999-99-9999

  28. Storing SSNs 000000000 To look up the information for an employee with the SSN 467-89-1234 you go directly to a[467891234]! 000000001 . . . 467891234 . . . . 999999998 999999999

  29. Pros and cons of direct lookup 000000000 Advantages: Direct lookup Every employee has their own unique spot in the array 000000001 . . . 467891234 Disadvantages: You only ever use a small portion of the array. The rest of the space is wasted (and there is a lot of it)! . . . . 999999998 999999999

  30. A better solution • A better solution would map all the employees into a data structure without wasting a lot of space. • From the domain of all possible SSNs we want a structure that will store just the range of ones applying to our employees. • To do this we may have to convert the SSNs to something else.

  31. Hash function map

  32. An example of hash conversion • You wish to store product information by product number. The product numbers have 5 digits with the lowest one being 10000. • const int max_size = 10000; • Then we could come up with a simple hash function • hashKey = productCode-max_Size; • This gives us a number between 0 and 9999. • We can use this unique number to directly access the array element containing data for that product.

  33. Product code hash example 0 Product: Widget Product code: 11234 ProductArray[11234 - 10000] 1 . . . widget 1234 . . . . 9998 9999

  34. Another implementation 0 • Or, we could use mod (%) • like this: 1 . . . Product: Widget Product code: 11234 ProductArray[11234 % 10000] widget 1234 . . . . 9998 9999

  35. An internet example • Internet Protocol (IP) uses 32-bit addresses to look up host names. • Example: 63.100.1.17 (4 bytes) • When you want to access a host machine that is not on your network the router takes the host name you give it and looks up the IP address. • It then forwards your request to that host computer.

  36. Problem • Routers need to look up addresses as quickly as possible. • There are millions of IP addresses, so how will it do this? • Linear search? O(n) • Binary search? O(logn) base 2 • Neither of these are fast enough.

  37. Answer • Convert the host name “joe@schmo.net” to it’s IP address using a hash function based on the characters. • Perhaps we could use ASCII codes to generate unique numbers within a certain range.

  38. Issues in hashing • Each hash should generate a unique number. If two different items produce the same hash code we have a collision in the data structure. Then what? • Two issues must be addressed • 1. Hash functions must minimize collisions (there are strategies to do this) • 2. If (when) collisions do occur, we must know how to handle them.

  39. A Collision

  40. What if... 0 • We used mod (%) like • this: 1 . . . Product: Widget Product code: 11234 Product: Whatzit Product code: 12234 ProductArray[11234 % 1000] widget 234 . whatzit . . . 998 999

  41. Two good rules to follow • A good hashing function must • 1. Always produce the same address for a given key, and • 2. For two different keys, the probability of producing the same address must be low.

  42. Picking good hash functions • We want one that spreads values out evenly. (random distribution) • If all the values cluster together • We have more collisions • We waste space in the rest of the data structure

  43. A bad hash function. Why? A SSN of 123-45-6789 is turned into the x value of 123456789 and then converted by this function to the value 1234. Thus, all SSN’s in the company could be mapped into an array of 10,000 elements (indexes 0000 - 9999). This is good, but could lead to what problems?

  44. The problem is... • SSNs are similar for people of approximately the same age, living in the same area of the country. • Employees are likely to have the first four digits commonly the same. • This would cause lots of collisions • At best, it causes clustering

  45. A rule for hash functions • “a good hash function depends on the entire key, rather than just a part.” • It is best if we use all the digits rather than throwing some of them away with integer division. • A better approach would divide by a prime number and take the remainder, to thoroughly mix the digits.

  46. Prime number division • Divide SSN by 10,007 (the smallest prime number > 10,000). • The remainder is between 0 and 10,006 • So we dimension the array as a[10007] • Then compute the table index from the SSN key in our hash function as • Index = key % tableSize;

  47. An example using primes Data keys: 321, 330, 415, 498, 791 Hash function = key % 11

  48. Hash table after first five entries inserted 0 3 3 0 1 2 3 2 1 3 4 9 8 4 5 6 7 8 4 1 5 9 1 0 7 9 1

  49. Linear probing • No matter how good a hash function is, collisions will occur • One method for handling them is ‘linear probing’ • When you hash into the table and find the spot already occupied, you go forward, in linear search fashion until you find a slot that has not been taken. Then insert the item there.

  50. 2. so probe forward, adding it at the next free slot Hash table after 365 added 0 3 3 0 1 1. can't add 365 at its home address 2 3 2 1 3 4 9 8 4 3 6 5 5 6 7 8 4 1 5 9 1 0 7 9 1

More Related