1 / 53

CSCI 2342 Data Structures and Algorithms II Dr. Tami Meredith

CSCI 2342 Data Structures and Algorithms II Dr. Tami Meredith. Lecture 4: Dictionaries (Chapter 18). Abstraction in Program Design. Problems can be viewed at many levels of abstraction In particular, the data structures we have seen are not necessarily unrelated

shyla
Télécharger la présentation

CSCI 2342 Data Structures and Algorithms II Dr. Tami Meredith

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCI 2342Data Structures and Algorithms IIDr. Tami Meredith Lecture 4: Dictionaries (Chapter 18)

  2. Abstraction in Program Design • Problems can be viewed at many levels of abstraction • In particular, the data structures we have seen are not necessarily unrelated • Some "data structures" we have seen are actually just concepts for data management • E.g., a priority queue is data management concept implemented using a heap • The textbook uses the term ADT for both data storage structures and data management abstractions

  3. Abstraction Layers • Examples: • Priority queue with an array-based heap in C++ • Dictionary using a linked BST with C pointers • Table using an array-based BST Index in a file • Set using a linked-list with Java objects and references

  4. Key Concepts • High level data management techniques such as Queues, Tables, or Dictionaries are not data structures but are abstraction concepts • Data structures are a program design concept for storing and retrieving data • Implementation is a programming team concern • The squeeze ...

  5. Data Organisation • Data is almost always organised into records; e.g., all the data for a person, order, meeting, ... • Records are broken into fields; e.g., parts • One (or more) field(s) are used as akey(see Chapter 11) • Keys are used to sort and retrieve records • Often must search a collection of data for specific information • Example: Assignment 1 • Record: Athlete • Fields: name, income, salary, endorsements, sport • Key: name (this is how we sorted and looked it up)

  6. Dictionaries • A data management concept where data manipulation is performed based on the key • Sometimes called a map (i.e., maps key to the record) • May also be called a table, but some would argue this fact • Dictionary behavior changes significantly if keys are not unique (i.e., duplicates can exist) • Many systems require unique keys • Non-key access/search is highly suboptimal and inefficient

  7. FIGURE 18-3 A dictionary entry Dictionary Entries Note: The data item can be broken up into fields – it is not necessarily a single field and is only shown as one since the ADT never manipulates the data in any way and treats it as a single entity

  8. FIGURE 18-1 A collection of data about certain cities Example Dictionary (Key = City) Consider the need for searches through this data based on other than the name of the city

  9. Operations: Dictionary ADT • Insert new item into dictionary • Remove item with given search key from dictionary • Find item with a given search key in dictionary • Traverse items in dictionary in sorted search-key order • Test whether dictionary is empty • Get number of items in dictionary • Remove all items from dictionary • Test whether dictionary contains an item with given search key (same as search really)

  10. Dictionary ADT (UML)

  11. High-Level Implementation • Have choices in our selection of data structures to implement a dictionary • Some choices are high level: sorted vs. unsorted • Some choices are low level: array vs. linked • Sorted (by search key), array-based • Sorted (by search key), link-based • Unsorted, array-based • Unsorted, link-based

  12. Options for Implementation • Sorted Linked List • Unsorted Linked List • Sorted Array (could use STL Vector for array) • Unsorted Array • Unsorted Binary Tree • Binary Search Tree • ... others we haven't learned yet ...

  13. FIGURE 18-4 The data members for two sorted linear implementations of the ADT dictionary for the data in Figure 18-1 (a) array based (b) link based Possible Implementations

  14. FIGURE 18-5 The data members for a binary search tree implementation of the ADT dictionary for the data in Figure 18-1 Possible Implementations

  15. Selecting an Implementation • Reasons for considering linear implementations • Perspective • Efficiency • Motivation • Questions to ask • What operations are needed? • How often is each operation required?

  16. Selecting an Implementation • Amount of Data • Frequency of Change in Data (insert/delete) – static data can be better optimized • Frequency of Search • Availability of existing data structures (reuse) • Memory usage and availability • Cost of implementation and testing • Time available for implementation and testing Efficiency, Memory Use, Implementation Factors

  17. Selecting an Implementation • Three Scenarios • Insertion and traversal in no particular order • Not a dictionary, just a list of stuff • Retrieval Only (Static data)– consider: • Binary search of a sorted array is equivalent to retrieval from a BST: O(log2n) • Is there enough data to justify sorting? • Is hashing possible? • Insertion, removal, retrieval, and traversal in sorted order (Dynamic data) • How much data? • How critical is this data to the application?

  18. FIGURE 18-6 Insertion for unsorted linear implementations (a) array based (b) link based Selecting an Implementation

  19. FIGURE 18-7 Insertion for sorted linear implementations (a) array based (b) pointer based Selecting an Implementation

  20. FIGURE 18-8 The average-case order of dictionary operations for various implementations Selecting an Implementation Note: Ignore traversal (its always linear)

  21. Change of Direction -- Hashing

  22. Can we do better? • Idea: If we had 100 students at SMU, we could: • give them all a student number from 00 to 99 • store their data in an array • use the student number as an array offset • If we knew the student number search = insertion = deletion = O(1) • Would work for SMU but would need an array of length 100,000,000 entries to use A Numbers as array indices – fast but inefficient, lots of wasted space

  23. Hashing • Need a different strategy to locate an item • Consider a “magic box” as an address calculator • Place/retrieve item from that address in an array • Ideally to a unique number for each key • FIGURE 18-9 Address calculator

  24. Hashing • The "Address Calculator" turns the key (e.g., an A-number) into a smaller number so we can use a smaller array • We call an address calculator a hash function indexarray = hash(key)

  25. Hashing • Pseudocode for getItem

  26. Hashing • Pseudocode for remove

  27. Properties • Would like unique array indices – having two students hash to the same array slot is a problem • Would like to use as small an array as possible • Situation: 9000 students at SMU with an 8 digit A-Number • Problem: Need a function to turn these 9000 A-numbers into unique numbers in the range [0..n] where 9000 < n by the least amount

  28. Hash Functions • Possible algorithms • Selecting digits • Folding • Modulo arithmetic • Converting a character string to an integer • Use ASCII values • Factor the results, Horner’s rule

  29. Selecting Digits • Very simple and easy • Leads to a poor distribution across the hash table • Not unique • Example: 3rd and 7th digit • 12345678 → 37 • 98454562 → 46 • 15477869 → 46

  30. Folding • Simply add the digits • Note that for 9 digits 0 ≤ hash(key) ≤ 81 • Limited range of results but can be improved by adding in different ways • Example: • 12345678 → 1+2+3+4+5+6+7+8 = 36 • 98454562 → 9+8+4+5+4+5+6+2 = 43 • 15477869 → 1+5+4+7+7+8+9+6 = 47

  31. Modulo Arithmetic • Simple and effective • Yields generally good results for prime numbers index = key % p • where p is the table size and is prime • Example: where p = 101 • 12345678 → 12345678 % 101 = 44 • 98454562 → 98454562 % 101 = 65 • 15477869 → 15477896 % 101 = 23

  32. Finding Primes (Fast and Easy) // Sieve of Erastosthenes boolisPrime[10000000]; for (inti = 0; i < 10000000; i++) isprime[i] = true; // initialize isPrime[] for (inti = 2; i < sqrt(10000000); i++) { if (isPrime[i]) { for (int j = 2 * i; j < 10000000; j += i) { isPrime[j] = false; } } } // isprime[] has all prime numbers up to 10000000

  33. Working with Characters • Could use numbers: a=1, b=2, ... • Similarly, could use the ASCII values (or parts thereof) • Yields very large numbers quickly • Adding, folding, or modulo methods can reduce these

  34. Horner's Rule • Let a=1, b=2, and so on • Thus "note" = 14 15 20 5 • In Binary, we have: 01110 01111 10100 00101 • We can concatenate the binary to make a single number: 01110011111010000101 • This is 14*323+15*322+20*321+5*320 • Horner noticed that this is more efficiently calculated as: (((14 * 32) + 15) * 32 + 20) * 32 + 5

  35. Perfect Hashing • If SMU has 9876 students, the ideal would be to find a hash function that produces an integer n ϵ [0..9875] with all n being unique (no duplicates) • Such a function is called a perfect hash function • Perfect hash functions are possible if the data are known in advance • See gperf- http://www.gnu.org/software/gperf/ • generates c or c++ coded perfect hash functions & tables for a set of strings

  36. Collisions • For random data, perfect hash functions are not possible • A collision is when hash(k1) = hash(k2) and k1 ≠ k2 • We have two choices • Use a table so big that collisions can't occur and waste a lot of space • Do a bit of extra work to find an empty space in a smaller table – "collision resolution"

  37. FIGURE 18-10 A collision Collisions

  38. Resolving Collisions • Approach 1: Open addressing • Probe (search) for another available location • Can be done linearly, quadratically • Use wrap around at the end of the table • Removal requires specify state of an item • Occupied, emptied, removed • This is because we stop on lookup when we find an empty slot during probing • Clustering is a problem, clusters can merge (YUCK!) • Hard to tell when to stop with quadratic probing (when have all slots been examined?) • Double hashing can reduce clustering

  39. Probing Techniques if hash(key) = n check: • Linear Probing n, n+1, n+2, n+3, ... • Quadratic Probing n, n+12, n+22, n+32, ... • Double Hashing n+hash2(key), n+2*hash2(key), n+3*hash2(key), ...

  40. FIGURE 18-11 Linear probing with h ( x ) = x mod 101 Linear Probing

  41. FIGURE 18-12 Quadratic probing with h ( x ) = x mod 101 Quadratic Probing

  42. FIGURE 18-13 Double hashing during the insertion of 58, 14, and 91 Double Hashing

  43. Resolving Collisions • Approach 2: Restructuring the hash table • Each hash location can accommodate more than one item • Each location is a “bucket” or an array itself • Table size is known and fixed, lots of wasted space, no dynamic memory management • Alternatively, design the hash table as an array of linked chains – called “separate chaining”

  44. FIGURE 18-14 Separate chaining Resolving Collisions

  45. The Efficiency of Hashing • Hashing is O(1) when the hash function is perfect and is reduced because of collisions • Efficiency of hashing involves the load factor alpha (α) • Can change efficiency by changing table size • Best values for α are less than ⅔

  46. The Efficiency of Hashing • Linear probing – average value for α

  47. The Efficiency of Hashing • Quadratic probing and double hashing – efficiency for given α

  48. The Efficiency of Hashing • Separate chaining – efficiency for given α

  49. The Efficiency of Hashing (Fig 18-15)

  50. Maintaining Hashing Performance • Insertions (collisions and their resolution) cause the load factor α to increase • To maintain efficiency, restrict the size of α • α 0.5 for open addressing • α 1.0 for separate chaining • If load factor exceeds these limits • Increase size of hash table • Rehash with new hashing function

More Related