1 / 36

Principles of Database Management Systems 4.2: Hashing Techniques

Principles of Database Management Systems 4.2: Hashing Techniques. Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom). Hashing?. Locating the storage block of a record by the hash value h(k) of its key k Normally really fast

nami
Télécharger la présentation

Principles of Database Management Systems 4.2: Hashing Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Database Management Systems4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom) Notes 4.2: Hashing

  2. Hashing? • Locating the storage block of a record by the hash valueh(k) of its key k • Normally really fast • records (often) located by a single disk access Notes 4.2: Hashing

  3. Hashing key  h(key) <key> Buckets (typically 1 disk block) . . . Notes 4.2: Hashing

  4. (1) Hash value determines the storage block directly Two alternatives . . . records key  h(key) . . . • to implement a primary index Notes 4.2: Hashing

  5. Two alternatives (2) Records located indirectly via index buckets record key  h(key) key 1 Index • for a secondary index Notes 4.2: Hashing

  6. Example hash function • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h = (x1 + x2 + … + xn) mod b  {0, 1, …, b-1} Notes 4.2: Hashing

  7. Good hash Expected number of function: keys/bucket is the same for all buckets  This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function. Notes 4.2: Hashing

  8. Next: example to illustrate inserts, overflows, deletes h(K) Notes 4.2: Hashing

  9. d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1 Notes 4.2: Hashing

  10. d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g Notes 4.2: Hashing

  11. If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit Notes 4.2: Hashing

  12. Extensible • Linear • also others ... How do we cope with growth? • Overflows and reorganizations • Dynamic hashing: # of buckets may vary Notes 4.2: Hashing

  13. Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K)  use i grows over time…. For example, b=32 00110101 Notes 4.2: Hashing

  14. (b) Use directory h(K)[i ] to bucket . . . . . . Directory contains 2i pointers to buckets, and stores i. Each bucket stores j, indicating #bits used for placing the records in this block (j  i) Notes 4.2: Hashing

  15. Extensible Hashing: Insertion • If there's room in bucket h(k)[i], place record there; Otherwise … • If j=i, set i=i+1 and double the directory • If j<i, split the block in two, distribute records among them now using j+1 bits of h(k); (Repeat until some records end up in the new bucket); Update pointers of bucket array • See the next example Notes 4.2: Hashing

  16. i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/block (j) 1 0001 i = 1 1001 1100 Insert 1010 Notes 4.2: Hashing

  17. 2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100 Notes 4.2: Hashing

  18. i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001 Notes 4.2: Hashing

  19. Extensible hashing: deletion • Reverse insert procedure … • Example: Walk thru insert example in reverse! Notes 4.2: Hashing

  20. Indirection (Not bad if directory in memory) Directory doubles in size (First it fits in memory, then it does not -> sudden performance degradation) - - Summary Extensible hashing + Can handle growing files - without full reorganizations Only one data block examined + Notes 4.2: Hashing

  21. Two ideas: b (a) Use ilow order bits of hash 01110101 grows i (b) File grows linearly Linear hashing: grow # of buckets by one -> No bucket directory needed Notes 4.2: Hashing

  22. Linear Hashing: Parameters • n: number of buckets in use • buckets numbered 0…n-1 • i: number of bits of h(k) used to address buckets • r: number of records in hash table • ratio r/n limited to fit an avg bucket in a block • next example: r  1.7n, and block holds 2 records=> AVG bucket occupancy is  1.7/2 = 0.85 of a block Notes 4.2: Hashing

  23. If h(k)[i ] = (a1 … ai)2< n, then look at bucket h(k)[i ]; else look at bucket h(k)[i ]- 2i -1 = (0a2 … ai)2 Rule Example:2 keys/block, b=4 bits, n=2, i =1 • insert 0101 00 01 • now r=4 >1.7n -> get new bucket 10and distribute keys btw buckets 00 and 10 0000 0101 1010 1111 Notes 4.2: Hashing

  24. 1010 -> n=3, i =2; distribute keys btw buckets 00 and 10: 000110 0000 0101 1010 1111 Notes 4.2: Hashing

  25. 0001 n=3, i =2; insert 0001: • can have overflow chains! 000110 0000 0101 1010 1111 Notes 4.2: Hashing

  26. n=3, i =2 0001 • insert 0111 0111 000110 • bucket 11 not in use-> redirect to 01 0000 0101 1010 1111 • now r=6 > 1.7n-> get new bucket 11 Notes 4.2: Hashing

  27. 1111 0111 0001 n=4, i =2; distribute keys btw 01 and 11 0001 0111 00011011 0000 0101 1010 1111 Notes 4.2: Hashing

  28. 3 0101 0101 101 100 0 0 0 0 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block) Notes 4.2: Hashing

  29. Summary Linear Hashing + Can handle growing files - without full reorganizations No indirection directory of extensible hashing Can have overflow chains - but probability of long chains can be kept low by controlling the r/n fill ratio (?) + - Notes 4.2: Hashing

  30. Summary Hashing - How it works - Dynamic hashing - Extensible - Linear Notes 4.2: Hashing

  31. Next: • Indexing vs Hashing • Index definition in SQL Notes 4.2: Hashing

  32. Indexing vs Hashing • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Notes 4.2: Hashing

  33. Indexing vs Hashing • INDEXING (Including B-Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Notes 4.2: Hashing

  34. Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name Notes 4.2: Hashing

  35. Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL … Oracle and IBM DB2 UDB provide a PCTFREE clause to inditate the proportion of B-tree blocks initially left unfilled Oracle: “Hash clusters” with built-in or DBA-specified hash function Notes 4.2: Hashing

  36. The BIG picture…. • Chapters 2 & 3: Storage, records, blocks... • Chapter 4: Access Mechanisms - Indexes - B trees - Hashing • Chapters 6 & 7: Query Processing NEXT Notes 4.2: Hashing

More Related