CS 277: Database System Implementation

CS 277: Database System Implementation Notes 5: Hashing and More Arthur Keller Notes 5

Hashing key  h(key) <key> Buckets (typically 1 disk block) . . . Notes 5

Two alternatives . . . records (1) key  h(key) . . . Notes 5

Two alternatives record (2) key  h(key) key 1 Index • Alt (2) for “secondary” search key Notes 5

Example hash function • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h: add x1 + x2 + ….. xn • compute sum modulo b Notes 5

 This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function. Good hash  Expected number of function: keys/bucket is the same for all buckets Notes 5

Within a bucket: • Do we keep keys sorted? • Yes, if CPU time critical & Inserts/Deletes not too frequent Notes 5

Next: example to illustrate inserts, overflows, deletes h(K) Notes 5

d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1 Notes 5

d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g Notes 5

If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit Notes 5

Extensible • Linear How do we cope with growth? • Overflows and reorganizations • Dynamic hashing Notes 5

Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K)  use i grows over time…. 00110101 Notes 5

(b) Use directory h(K)[i ] to bucket . . . . . . Notes 5

i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i = 1 1001 1100 Insert 1010 Notes 5

2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100 Notes 5

i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001 Notes 5

Extensible hashing: deletion • No merging of blocks • Merge blocks and cut directory if possible (Reverse insert procedure) Notes 5

Deletion example: • Run thru insert example in reverse! Notes 5

Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - - Summary Extensible hashing + Can handle growing files - with less wasted space - with no full reorganizations Notes 5

Two ideas: b (a) Use ilow order bits of hash 01110101 grows i (b) File grows linearly Linear hashing • Another dynamic hashing scheme Notes 5

0101 • can have overflow chains! If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ]- 2i -1 Rule Exampleb=4 bits, i =2, 2 keys/bucket • insert 0101 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block) Notes 5

0101 • insert 0101 1010 1111 0101 10 11 Exampleb=4 bits, i =2, 2 keys/bucket 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block) Notes 5

3 0101 0101 101 100 0 0 0 0 100 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block) Notes 5

 When do we expand file? • If U > threshold then increase m (and maybe i ) • Keep track of: # used slots total # of slots = U Notes 5

Can still have overflow chains - Summary Linear Hashing + Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing + Notes 5

Example: BAD CASE Very full Very empty Need to move m here… Would waste space... Notes 5

Summary Hashing - How it works - Dynamic hashing - Extensible - Linear Notes 5

Next: • Indexing vs Hashing • Index definition in SQL • Multiple key access Notes 5

Indexing vs Hashing • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Notes 5

Indexing vs Hashing • INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Notes 5

Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name Notes 5

Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL... Notes 5

Note ATTRIBUTE LIST MULTIKEY INDEX (next) e.g., CREATEINDEX foo ON R(A,B,C) Notes 5

Multi-key Index Motivation: Find records where DEPT = “Toy” AND SAL > 50k Notes 5

Strategy I: • Use one index, say Dept. • Get all Dept = “Toy” records and check their salary I1 Notes 5

Strategy II: • Use 2 Indexes; Manipulate Pointers Toy Sal > 50k Notes 5

Strategy III: • Multiple Key Index One idea: I2 I3 I1 Notes 5

Art Sales Toy Example 10k 15k Example Record Dept Index Salary Index 17k 21k Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k Notes 5

For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k Notes 5

Interesting application: • Geographic Data DATA: <X1,Y1, Attributes> <X2,Y2, Attributes> y x . . . Notes 5

Queries: • What city is at <Xi,Yi>? • What is within 5 miles from <Xi,Yi>? • Which is closest point to <Xi,Yi>? Notes 5

40 30 10 20 20 m g b n h o d e a k c f j l i 10 25 15 35 20 10 20 5 d e h i g f c a b 15 15 • Search points near f • Search points near b n o j k m l Example Notes 5

Queries • Find points with Yi > 20 • Find points with Xi < 5 • Find points “close” to i = <12,38> • Find points “close” to b = <7,24> Notes 5

Many types of geographic index structures have been suggested • Quad Trees • R Trees Notes 5

Two more types of multi key indexes • Grid • Partitioned hash Notes 5

Grid Index Key 2 X1 X2 …… Xn V1 V2 Key 1 Vn To records with key1=V3, key2=X2 Notes 5

CLAIM • Can quickly find records with • key 1 = Vi Key 2 = Xj • key 1 = Vi • key 2 = Xj • And also ranges…. • E.g., key 1  Vi key 2 < Xj Notes 5

V1 V2 V3 X1 X1 X1 Like Array... X2 X2 X2 X3 X3 X3 X4 X4 X4 • How is Grid Index stored on disk?  But there is a catch with Grid Indexes! Problem: • Need regularity so we can compute position of <Vi,Xj> entry Notes 5

Solution: Use Indirection X1 X2 X3 Buckets V1 V2 V3 *Grid only V4 contains pointers to buckets Buckets -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Notes 5

CS 277: Database System Implementation