CS 405G: Introduction to Database Systems

CS 405G: Introduction to Database Systems 23 Hashing Chen Qian University of Kentucky

B+-tree Organization Internal node Leaf node Luke Huan Univ. of Kansas

Review • B+-tree Insert • Find correct leaf L. • Put data entry onto L. • If L has enough space, done! • Else, must splitL (into L and a new node L2) • Distribute entries evenly, copy upmiddle key. • Insert index entry pointing to L2 into parent of L. • This can happen recursively • Tree growth: gets wider and (sometimes) one level taller at top. Luke Huan Univ. of Kansas

Review • B+-tree Delete • Start at root, find leaf L where entry belongs. • Remove the entry. • If L is at least half-full, done! • If L has only d-1 entries, • Try to redistribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, mergeL and sibling. • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. • Tree shrink: gets narrower and (sometimes) one level lower at top. Luke Huan Univ. of Kansas

Root 17 24 30 5 13 Root 39* 2* 3* 19* 20* 22* 24* 27* 38* 5* 7* 8* 29* 33* 34* 14* 16* 30 24 13 17 39* 22* 24* 27* 38* 3* 5* 19* 20* 29* 33* 34* 2* 7* 14* 16* Example B+ Tree - Inserting 8* Notice that root was split, leading to increase in height. In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice.

Root Root 17 17 24 30 5 13 27 30 5 13 39* 2* 3* 19* 20* 22* 24* 27* 38* 5* 7* 8* 29* 33* 34* 14* 16* 39* 2* 3* 22* 24* 27* 29* 38* 5* 7* 8* 33* 34* 14* 16* Example Tree (including 8*) Delete 19* and 20* ... • Deleting 19* is easy. • Deleting 20* is done with re-distribution. Notice how middle key is copied up.

... And Then Deleting 24* • Must merge. • Observe `toss’ of index entry (key 27 on right), and `pull down’ of index entry (below). 30 39* 22* 27* 38* 29* 33* 34* Root 5 13 17 30 39* 3* 22* 38* 2* 5* 7* 8* 27* 33* 34* 29* 14* 16*

Introduction • Hash-based indexes • are best for equality selections • cannot support range searches • A good hash function is: • Random • Uniform • Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees. Chen Qian @ University of Kentucky

Static Hashing • Each record (entry) has a key k • # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. • h(k) mod N = bucket to which data entry withkey k belongs. (N = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages Chen Qian @ University of Kentucky

Static Hashing (Contd.) • Buckets contain data records. • Hashing works on search key field of record r. Must distribute values over range 0 ... N-1. • h(key) = a binary number with 32/64/128/256 bits. • h(k) mod N is a way to distribute values • Long overflow chains can develop and degrade performance. • Extendible and LinearHashing: Dynamic techniques to fix this problem. Chen Qian @ University of Kentucky

Extendible Hashing • Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? • Reading and writing all pages is expensive! • Idea: Use directory of pointers to buckets • double # of buckets by doubling the directory, • splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. Nooverflowpage! Chen Qian @ University of Kentucky

Example LOCAL DEPTH 2 Bucket A • Directory is array of size 4. • To find bucket for r, take last `global depth’ # bits of h(r). • If h(5) = binary(5) = 101, it is in bucket pointed to by 01. 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES

LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Insert: If bucket is full, splitit (allocate new page, re-distribute). • If necessary, double the directory. (As we will see, splitting a • bucket does not always require doubling; we can tell by • comparing global depth with local depth for the split bucket.)

LOCAL DEPTH 2 Bucket A • Suppose now insert 20, whose hash is 10100. • Bucket A is full. • Split A to A1 and A2 • A1: values ending with 000 • A2: values ending with 100 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES

Insert h(r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A) Chen Qian @ University of Kentucky

Points to Note • 20 = binary 10100. • Last 2 bits (00) tell us r belongs in A or others. • Last 3 bits needed to tell us r belongs in A1 or A2. • Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to. • Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket. • When does bucket split cause directory doubling? • Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page. Chen Qian @ University of Kentucky

Comments on Extendible Hashing • Static hash: one disk access for equality search • If directory fits in memory, equality search answered with one disk access; else two. • 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; • chances are high that directory will fit in memory. • Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. • Multiple entries with same hash value cause problems! Chen Qian @ University of Kentucky

Linear Hashing • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • Linear hashing handles the problem of long overflow chains without using a directory, and handles duplicates. • Idea: Use a family of hash functions h0, h1, h2, ... • hi(key) = h(key) mod(2iN); N = initial # buckets • h is some hash function (range is not 0 to N-1) • h0 ranges in 0 to N-1. (If N = 2d, looking at the last d bits) • h1 ranges in 0 to 2N-1. (looking at the last d+1 bits) • hi+1 doubles the range of hi (similar to directory doubling) Chen Qian @ University of Kentucky

Linear Hashing (Contd.) • Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin. • Splitting proceeds in `rounds’. Round ends when all NRinitial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR yet to be split. • Current round number is Level. • Search:To find bucket for data entry r, findhLevel(r): • If hLevel(r) in range `Next to NR’, r belongs here. • Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR; must apply hLevel+1(r) to find out. Chen Qian @ University of Kentucky

Overview of LH File • In the middle of a round. Buckets split in this round: Bucket to be split If ( h search key value ) Level Next is in this range, must use h ( search key value ) Level+1 Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round Chen Qian @ University of Kentucky

Linear Hashing (Contd.) • Insert: Find bucket by applying hLevel and hLevel+1: • If bucket to insert into is full: • Add overflow page and insert data entry. • Split Next bucket and increment Next. • Since buckets are split round-robin, long overflow chains don’t develop! Chen Qian @ University of Kentucky

Example of Linear Hashing • Now adding 43 Level=0, N=4 Level=0 PRIMARY h h OVERFLOW h h PRIMARY PAGES 0 0 1 1 PAGES PAGES Next=0 32* 32* 44* 36* 000 00 000 00 Next=1 Data entry r 9* 5* 9* 5* 25* 25* with h(r)=5 001 001 01 01 30* 30* 10* 10* 14* 18* 14* 18* Primary 10 10 010 010 bucket page 31* 35* 7* 31* 35* 7* 11* 11* 43* 011 011 11 11 100 44* 36* 00

Example: End of a Round Level=1 PRIMARY OVERFLOW h h PAGES 0 1 PAGES Next=0 Level=0 00 000 32* PRIMARY OVERFLOW PAGES h PAGES h 1 0 001 01 9* 25* 32* 000 00 10 010 50* 10* 18* 66* 34* 9* 25* 001 01 011 11 35* 11* 43* 66* 10 18* 10* 34* 010 Next=3 100 00 44* 36* 43* 11* 7* 31* 35* 011 11 101 11 5* 29* 37* 44* 36* 100 00 14* 22* 30* 110 10 5* 37* 29* 101 01 14* 30* 22* 31* 7* 11 111 110 10 Chen Qian @ University of Kentucky

LH Described as a Variant of EH • The two schemes are actually quite similar: • Begin with an EH index where directory has N elements. • Use overflow pages, split buckets round-robin. • So, directory can double gradually. Also, primary bucket pages are created in order. • If they are allocated in sequence too (so that finding i’th is easy), we actually don’t need a directory! Voila, LH. Chen Qian @ University of Kentucky

Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) • Directory to keep track of buckets, doubles periodically. • Can get large with skewed data; additional I/O if this does not fit in main memory. Chen Qian @ University of Kentucky

Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. • Overflow pages not likely to be long. • Duplicates handled easily. • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • For hash-based indexes, a skewed data distribution is one in which the hash values of data entries are not uniformly distributed! Chen Qian @ University of Kentucky

CS 405G: Introduction to Database Systems