1 / 89

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications. Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing. Outline. (static) hashing extendible hashing linear hashing Hashing vs B-trees. (Static) Hashing. Problem: “ find EMP record with ssn=123 ”

verda
Télécharger la présentation

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Carnegie Mellon Univ.Dept. of Computer Science15-415/615 – DB Applications Faloutsos & Pavlo Lecture#11(R&G ch. 11) Hashing

  2. Outline • (static) hashing • extendible hashing • linear hashing • Hashing vs B-trees CMU SCS 15-415/615

  3. (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium? CMU SCS 15-415/615

  4. Hashing A: Brilliant idea: key-to-address transformation: #0 page 123; Smith; Main str #123 page #999,999,999 CMU SCS 15-415/615

  5. Hashing Since space is NOT free: • use M, instead of 999,999,999 slots • hash function: h(key) = slot-id #0 page 123; Smith; Main str #123 page #999,999,999 CMU SCS 15-415/615

  6. Hashing Typically: each hash bucket is a page, holding many records: #0 page 123; Smith; Main str #h(123) M CMU SCS 15-415/615

  7. #0 page #h(123) M Hashing Notice: could have clustering, or non-clustering versions: 123; Smith; Main str. CMU SCS 15-415/615

  8. EMP file ... #0 page ... 234; Johnson; Forbes ave 123 #h(123) 123; Smith; Main str. ... M 345; Tompson; Fifth ave ... Hashing Notice: could have clustering, or non-clustering versions: CMU SCS 15-415/615

  9. Indexing-overview • hashing • hashing functions • size of hash table • collision resolution • extendible hashing • Hashing vs B-trees CMU SCS 15-415/615

  10. Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method CMU SCS 15-415/615

  11. Design decisions - functions • Goal: uniform spread of keys over hash buckets • Popular choices: • Division hashing • Multiplication hashing CMU SCS 15-415/615

  12. Division hashing h(x) = (a*x+b) mod M • eg., h(ssn) = (ssn) mod 1,000 • gives the last three digits of ssn • M: size of hash table - choose a prime number, defensively (why?) CMU SCS 15-415/615

  13. eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Division hashing CMU SCS 15-415/615

  14. h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) in general, we need an irrational number advantage: M need not be a prime number but φ must be irrational Multiplication hashing CMU SCS 15-415/615

  15. quadratic hashing (bad) ... Other hashing functions CMU SCS 15-415/615

  16. quadratic hashing (bad) ... conclusion: use division hashing Other hashing functions CMU SCS 15-415/615

  17. Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method CMU SCS 15-415/615

  18. eg., 50,000 employees, 10 employee-records / page Q: M=?? pages/buckets/slots Size of hash table CMU SCS 15-415/615

  19. eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and M: prime number Eg., in our case: M= closest prime to50,000/10 / 0.9 = 5,555 Size of hash table CMU SCS 15-415/615

  20. Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method CMU SCS 15-415/615

  21. Collision resolution • Q: what is a ‘collision’? • A: ?? CMU SCS 15-415/615

  22. Collision resolution #0 page FULL #h(123) 123; Smith; Main str. M CMU SCS 15-415/615

  23. Collision resolution • Q: what is a ‘collision’? • A: ?? • Q: why worry about collisions/overflows? (recall that buckets are ~90% full) • A: ‘birthday paradox’ CMU SCS 15-415/615

  24. Collision resolution • open addressing • linear probing (ie., put to next slot/bucket) • re-hashing • separate chaining (ie., put links to overflow pages) CMU SCS 15-415/615

  25. Collision resolution linear probing: #0 page FULL #h(123) 123; Smith; Main str. M CMU SCS 15-415/615

  26. Collision resolution re-hashing #0 page h1() FULL #h(123) 123; Smith; Main str. h2() M CMU SCS 15-415/615

  27. Collision resolution separate chaining FULL 123; Smith; Main str. CMU SCS 15-415/615

  28. Design decisions - conclusions • function: division hashing • h(x) = ( a*x+b ) mod M • size M: ~90% util.; prime number. • collision resolution: separate chaining • easier to implement (deletions!); • no danger of becoming full CMU SCS 15-415/615

  29. Outline • (static) hashing • extendible hashing • linear hashing • Hashing vs B-trees CMU SCS 15-415/615

  30. Problem with static hashing • problem: overflow? • problem: underflow? (underutilization) CMU SCS 15-415/615

  31. Solution: Dynamic/extendible hashing • idea: shrink / expand hash table on demand.. • ..dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’ [Fagin et al] CMU SCS 15-415/615

  32. Extendible hashing #0 page FULL #h(123) 123; Smith; Main str. M CMU SCS 15-415/615

  33. Extendible hashing #0 page solution: don’t overflow – instead: SPLIT the bucket in two FULL #h(123) 123; Smith; Main str. M CMU SCS 15-415/615

  34. in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually: Extendible hashing CMU SCS 15-415/615

  35. Extendible hashing directory 0001... 0111... 00... 01... 10101... 10... 10011... 10110... 11... 1101... 101001... CMU SCS 15-415/615

  36. Extendible hashing directory 0001... 0111... 00... 01... 10101... 10... 10011... 10110... 11... 1101... 101001... CMU SCS 15-415/615

  37. Extendible hashing directory 0001... 0111... 00... 01... 10101... 10... split on 3-rd bit 10011... 10110... 11... 101001... 1101... CMU SCS 15-415/615

  38. Extendible hashing directory 0001... 0111... 00... 01... new page / bucket 10... 10011... 10101... 11... 101001... 10110... 1101... CMU SCS 15-415/615

  39. 0001... 0111... 000... 001... 010... 10011... 10101... 011... 101001... 10110... 100... 1101... 101... 110... 111... Extendible hashing directory (doubled) new page / bucket CMU SCS 15-415/615

  40. 0001... 0111... 000... 00... 001... 01... 10101... 010... 10... 10101... 10011... 10110... 101001... 11... 011... 100... 101001... 10110... 101... 1101... 110... 111... Extendible hashing BEFORE AFTER 0001... 0111... 10011... 1101... CMU SCS 15-415/615

  41. 0001... 0111... 00... 000... 01... 001... 10101... 10... 010... 10101... 101001... 10011... 10110... 101001... 011... 11... 100... 10110... 101... 1101... 110... 111... global Extendible hashing local BEFORE AFTER 1 1 3 2 0001... 0111... 10011... 1101... CMU SCS 15-415/615

  42. 0001... 0111... 00... 000... 01... 001... 10101... 010... 10... 10101... 101001... 10011... 10110... 101001... 11... 011... 100... 10110... 101... 1101... 110... 111... Extendible hashing BEFORE AFTER 1 1 3 2 0001... 0111... 2 3 3 10011... 2 2 1101... Do all splits lead to directory doubling? A: CMU SCS 15-415/615

  43. 0001... 0111... 00... 000... 01... 001... 10101... 010... 10... 10101... 101001... 10011... 10110... 101001... 11... 011... 100... 10110... 101... 1101... 110... 111... Extendible hashing BEFORE AFTER 1 1 3 2 0001... 0111... 2 3 3 10011... 2 2 1101... Do all splits lead to directory doubling? A:NO! (most don’t) CMU SCS 15-415/615

  44. 0001... 0111... 00... 000... 01... 001... 10101... 010... 10... 10101... 101001... 10011... 10110... 101001... 11... 011... 100... 10110... 101... 1101... 110... 111... Extendible hashing BEFORE AFTER 1 1 3 2 0001... 0111... 2 3 3 10011... 2 2 1101... Give insertions, that cause split, but no directory dup.? CMU SCS 15-415/615

  45. 0001... 0111... 00... 000... 01... 001... 10101... 010... 10... 10101... 101001... 10011... 10110... 101001... 11... 011... 100... 10110... 101... 1101... 110... 111... Extendible hashing BEFORE AFTER 1 1 3 2 0001... 0111... 2 3 3 10011... 2 2 1101... Give insertions, that cause split, but no directory dup.? A: 000… and 001… CMU SCS 15-415/615

  46. Extendible hashing • Summary: directory doubles on demand • or halves, on shrinking files • needs ‘local’ and ‘global’ depth CMU SCS 15-415/615

  47. Outline • (static) hashing • extendible hashing • linear hashing • Hashing vs B-trees CMU SCS 15-415/615

  48. Linear hashing - overview • Motivation • main idea • search algo • insertion/split algo • deletion CMU SCS 15-415/615

  49. Linear hashing Motivation: ext. hashing needs directory etc etc; which doubles (ouch!) Q: can we do something simpler, with smoother growth? CMU SCS 15-415/615

  50. Linear hashing Motivation: ext. hashing needs directory etc etc; which doubles (ouch!) Q: can we do something simpler, with smoother growth? A: split buckets from left to right, regardless of which one overflowed (‘crazy’, but it works well!) - Eg.: CMU SCS 15-415/615

More Related