1 / 24

Hash-Based Indexes

Hash-Based Indexes. Chapter 11. Modified by Donghui Zhang Jan 30, 2006. Introduction. Hash-based indexes are best for equality selections . Cannot support range searches. Basic idea: static hashing. Two dynamic hashing techniques: Extendible Hashing Linear Hashing. 0. 1. M-1.

livi
Télécharger la présentation

Hash-Based Indexes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006

  2. Introduction • Hash-based indexes are best for equalityselections. Cannot support range searches. • Basic idea: static hashing. • Two dynamic hashing techniques: • Extendible Hashing • Linear Hashing

  3. 0 1 M-1 Primary bucket pages Motivation • B+-tree needs logBN I/Os to get one key. Can we do faster? • If we use key as bucket number, then one I/O is good enough! • Problem: too many buckets! • What if (key % M)? • Got the idea, but should use a Hashing function on key.

  4. Static Hashing • # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. • h(k) mod M = bucket to which data entry withkey k belongs. (M = # of buckets) 0 h(key) mod M 1 key h M-1 Primary bucket pages Overflow pages

  5. Static Hashing (Contd.) • Buckets contain data entries. • Hash fn works on search key field of record r. Must distribute values over range 0 ... M-1. • h(key) = (a * key + b) usually works well. • Long overflow chains can develop and degrade performance. • Extendible and LinearHashing: Dynamic techniques to fix this problem.

  6. Extendible Hashing • Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? • Problem 1: Reading and writing all pages is expensive! • Problem 2: May have many under-utilized pages! • Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. Nooverflowpage! • Trick lies in how hash function is adjusted!

  7. LOCAL DEPTH 2 Example Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Directory is array of size 4. • To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r). • If h(r) = 5 = binary 101, it is in bucket pointed to by 01. 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Insert: If bucket is full, splitit (allocate new page, re-distribute). • If necessary, double the directory. (As we will see, splitting a • bucket does not always require doubling; we can tell by • comparing global depth with local depth for the split bucket.)

  8. LOCAL DEPTH 2 Example Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Insert 14. • 14%4 = 2. • Does not overflow. 2 10 Bucket C 14* 10* 11 2 DIRECTORY • Insert 20. • 20%4 = 0. • Overflow! Bucket D 15* 7* 19* DATA PAGES • 32%8 = 0, 16%8 = 0 -- stay in bucket 000 • 4 % 8 = 4, 12%8 = 4, 20%8 = 4 -- move to bucket 100

  9. Insert h(r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)

  10. Points to Note • When does bucket split cause directory doubling? • Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; • How to implement the doubling of directory? • Copying it over! (Use of least significant bits enables this!) • `fixing’ pointer to split image page.

  11. Comments on Extendible Hashing • Good: no overflow page. If directory fits in memory, equality search answered with one disk access guaranteed; (else two). • Since directory contains only key + pointer, it should generally be small enough to fit in memory. • Bad: if the distribution of hash values is skewed, directory can grow large. • Multiple entries with same hash value cause problems!

  12. Sample quiz question • Is it possible for an insertion to cause the directory to double more than once? • If yes, show an example.

  13. Deletion on Extendible Hashing • Find the record and delete. • Rule 1: If removal of data entry makes bucket empty, can be merged with `split image’. • Rule 2: If each directory element points to the same bucket as its split image, can halve directory.

  14. Linear Hashing • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • LH handles the problem of long overflow chains without using a directory, and handles duplicates. • Idea: • Keep a pointer which alternate through each bucket. • Whenever a new overflow page is generated, split the bucket the pointer points to, and add a new bucket at the end of the file. • Observation: the number of buckets is doubled after splitting all bucket!

  15. Example of Linear Hashing • Let there be eight records: • 32, 44, 14, 18 • 9, 25, 31, 35 • Let each disk page hold up to four records. • Consider inserting 10, 30, 36, 7, 5, 11…

  16. If insert 22 …… Example of Linear Hashing Let hLevel=(a*key+b) % 4 hLevel+1=(a*key+b) % 8 • On split, hLevel+1 is used to re-distribute entries. Level=0, N=4 Level=0 PRIMARY h h OVERFLOW h h PRIMARY PAGES 0 0 1 1 PAGES PAGES Next=0 32* 32* 44* 36* 000 00 000 00 Next=1 Data entry r 9* 5* 9* 5* 25* 25* with h(r)=5 001 001 01 01 30* 30* 10* 10* 14* 18* 14* 18* Primary 10 10 010 010 bucket page 31* 35* 7* 31* 35* 7* 11* 11* 43* 011 011 11 11 (This info is for illustration only!) (The actual contents of the linear hashed file) 100 44* 36* 00

  17. Example of Linear Hashing Let hLevel=(a*key+b) % 4 hLevel+1=(a*key+b) % 8 Level=0 h OVERFLOW h PRIMARY 0 1 PAGES PAGES Important notice: after buckets 010 and 011 are split, Next moves back To 000! 32* 000 00 9* 25* 001 01 Next=2 30* 10* 14* 18* 22* 10 010 31* 35* 7* 11* 43* 011 11 100 44* 36* 00 101 5* 01

  18. Example: End of a Round Level=1 Level=0 PRIMARY OVERFLOW h h PAGES PRIMARY OVERFLOW 0 1 PAGES Next=0 PAGES h PAGES h 1 0 00 000 32* 32* 000 00 001 01 9* 25* 9* 25* 001 01 10 010 50* 10* 18* 66* 34* 66* 10 18* 10* 34* 010 Next=3 011 11 35* 11* 43* 43* 11* 7* 31* 35* 011 11 44* 36* 100 100 00 44* 00 36* 5* 37* 29* 101 01 101 11 5* 29* 37* 14* 30* 22* 110 10 14* 22* 30* 110 10 31* 7* 11 111

  19. LH File • In the middle of a round. Buckets split in this round: Bucket to be split If ( h search key value ) Level Next is in this range, must use h ( search key value ) Level+1 Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round

  20. Linear Hashing (Contd.) • Insert: Find bucket by applying hLevel / hLevel+1: • If bucket to insert into is full: • Add overflow page and insert data entry. • (Maybe) Split Next bucket and increment Next. • Can choose any criterion to `trigger’ split. • Since buckets are split round-robin, long overflow chains don’t develop! • Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased.

  21. Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) • Directory to keep track of buckets, doubles periodically. • Can get large with skewed data; additional I/O if this does not fit in main memory.

  22. Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. • Overflow pages not likely to be long. • Duplicates handled easily. • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization. • For hash-based indexes, a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!

  23. LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 14* 10* 11 2 Bucket D 15* 7* 19* Sample Quiz Question • Min/Max # insertions to increase global depth by 1? • What about by 2?

  24. Sample Quiz Question • Min/Max # insertions to cause Next to point to bucket 011? Level=0 h OVERFLOW h PRIMARY 0 1 PAGES PAGES 32* 000 00 • What about for Next to point to 000? 9* 25* 001 01 Next=2 30* 10* 14* 18* 22* 10 010 31* 35* 7* 11* 43* 011 11 100 44* 36* 00 101 5* 01

More Related