Hash-Based Indexes: Introduction, Static Hashing, and Extendible Hashing

Hash-Based Indexes Chapter 11

Introduction : Hash-based Indexes • Best for equalityselections. • Cannot support range searches. • Static and dynamic hashing techniques exist: Trade-offs similar to ISAM vs. B+ trees.

Static Hashing • h(k) mod N = bucket to which data entry withkey k belongs. (N = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages

Static Hashing : h(k) mod N • # primary pages fixed (N = # of buckets) • allocated sequentially • never de-allocated • overflow pages if needed. 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages

Static Hashing • h(k) mod N = bucket to which data entry withkey k belongs with N = # of buckets • Hash function works on search key of record r. • h() must distribute values over range [0 ... N-1]. • For example, h(key) = (a * key + b) • a and b are constants • lots known about how to tune h.

Static Hashing Cons • Primary pages fixed space  static structure. • Fixed # buckets is the problem: • Rehashing can be done  Not good for search. • In practice, instead use overflow chains. • Long overflow chains degrade performance. • Solution: Employ dynamic techniques : • Extendible hashing, or • LinearHashing

Extendible Hashing • Problem: Bucket (primary page) becomes full. • Solution: Re-organize file by doubling # of buckets? • But Reading and writing all pages is expensive!

Extendible Hashing • Ideas : • Use directory of pointers to buckets instead of buckets • Details : • Double # of buckets by doubling the directory • Split just the bucket that overflowed! • Trick : • How hash function is adjusted!

Example LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH • Directory : array[4] • To find bucket for r, take last `global depth’ # bits of function h(r) 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES

Example LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH • If h(r) = 5 = 101, • it is in bucket pointed to by 01. • If h(r) = 4 = 100, • it is in bucket pointed to by 00. 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES

LOCAL DEPTH 2 Insertion Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Insert: If bucket is full, splitit • (allocate new page, re-distribute content). 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Splitting may : • double the directory, or • simply link in a new page. • To tell what to do : • Compare global depth with local depth for split bucket.

Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES Insert h(r)= 6 = binary 110

Bucket A Bucket A LOCAL DEPTH LOCAL DEPTH 2 2 16* 16* 4* 12* 32* 4* 12* 32* GLOBAL DEPTH GLOBAL DEPTH 2 2 2 2 Bucket B Bucket B 00 00 5* 5* 1* 21* 13* 1* 21* 13* 01 01 2 2 Bucket C Bucket C 10 10 6* 10* 10* 11 11 Bucket D Bucket D 2 2 DIRECTORY DIRECTORY 15* 7* 19* 15* 7* 19* DATA PAGES DATA PAGES Insert h(r)= 6 = binary 110

Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES Insert h(r)=20 = binary 10100

Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES Insert h(r)=20 Split Bucket A into two buckets A1 and A2. 3 4* 12* Bucket A2 20* 3 16* 32* Bucket A1

3 4* 12* Bucket A2 20* 3 Bucket A1 32* 16* Insert h(r)=20 Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES

Insert h(r)=20 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)

Points to Note • 20 = binary 10100. • Last 2 bits (00) tell us if r belongs in A • Last 3 bits needed to tell if r belongs into A1 or A2

More Points to Note Bits: • Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to • Local depth of a bucket: Actual# of bits used to determine if an entry belongs to this bucket. • When does bucket split cause directory doubling? • Before insert, local depth of bucket = global depth. • Insert causes local depth to become > global depth; • Directory is doubled by copying it over and `fixing’ pointer to split the one “over-full page”.

Extendible Hashing : Delete • Delete: • If removal of data entry makes bucket empty, can be merged with `split image’. • If each directory element points to same bucket as its split image, can halve directory.

Comments on Extendible Hashing • If directory fits in memory, then equality search answered with one disk access; else with two. • 100MB file, 100 bytes/rec, 4K pages contain 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. • Directory grows in spurts. • If the distribution of hash values is skewed, directory can grow large.

Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting full bucket when new data to be added • Directory to keep track of buckets, doubles periodically

Hash-Based Indexes: Introduction, Static Hashing, and Extendible Hashing

Hash-Based Indexes: Introduction, Static Hashing, and Extendible Hashing

Presentation Transcript

Hash Indexes: Chap. 11

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-based Indexes

Hash-Based Indexes

LFSR-based Hash

Hash-Based Indexes

Hash-Based Signatures

Hash-Based Signatures

Hash-Based Indexes

Hash-Based Signatures

B-Trees, Part 2 Hash-Based Indexes

Hash-Based Indexes

Tree- and Hash-Structured Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes