Principles of Database Management Systems 4.2: Hashing Techniques

Principles of Database Management Systems4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom) Notes 4.2: Hashing

Hashing? • Locating the storage block of a record by the hash valueh(k) of its key k • Normally really fast • records (often) located by a single disk access Notes 4.2: Hashing

Hashing key  h(key) <key> Buckets (typically 1 disk block) . . . Notes 4.2: Hashing

(1) Hash value determines the storage block directly Two alternatives . . . records key  h(key) . . . • to implement a primary index Notes 4.2: Hashing

Two alternatives (2) Records located indirectly via index buckets record key  h(key) key 1 Index • for a secondary index Notes 4.2: Hashing

Example hash function • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h = (x1 + x2 + … + xn) mod b  {0, 1, …, b-1} Notes 4.2: Hashing

Good hash Expected number of function: keys/bucket is the same for all buckets  This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function. Notes 4.2: Hashing

Next: example to illustrate inserts, overflows, deletes h(K) Notes 4.2: Hashing

d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1 Notes 4.2: Hashing

d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g Notes 4.2: Hashing

If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit Notes 4.2: Hashing

Extensible • Linear • also others ... How do we cope with growth? • Overflows and reorganizations • Dynamic hashing: # of buckets may vary Notes 4.2: Hashing

Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K)  use i grows over time…. For example, b=32 00110101 Notes 4.2: Hashing

(b) Use directory h(K)[i ] to bucket . . . . . . Directory contains 2i pointers to buckets, and stores i. Each bucket stores j, indicating #bits used for placing the records in this block (j  i) Notes 4.2: Hashing

Extensible Hashing: Insertion • If there's room in bucket h(k)[i], place record there; Otherwise … • If j=i, set i=i+1 and double the directory • If j<i, split the block in two, distribute records among them now using j+1 bits of h(k); (Repeat until some records end up in the new bucket); Update pointers of bucket array • See the next example Notes 4.2: Hashing

i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/block (j) 1 0001 i = 1 1001 1100 Insert 1010 Notes 4.2: Hashing

2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100 Notes 4.2: Hashing

i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001 Notes 4.2: Hashing

Extensible hashing: deletion • Reverse insert procedure … • Example: Walk thru insert example in reverse! Notes 4.2: Hashing

Indirection (Not bad if directory in memory) Directory doubles in size (First it fits in memory, then it does not -> sudden performance degradation) - - Summary Extensible hashing + Can handle growing files - without full reorganizations Only one data block examined + Notes 4.2: Hashing

Two ideas: b (a) Use ilow order bits of hash 01110101 grows i (b) File grows linearly Linear hashing: grow # of buckets by one -> No bucket directory needed Notes 4.2: Hashing

Linear Hashing: Parameters • n: number of buckets in use • buckets numbered 0…n-1 • i: number of bits of h(k) used to address buckets • r: number of records in hash table • ratio r/n limited to fit an avg bucket in a block • next example: r  1.7n, and block holds 2 records=> AVG bucket occupancy is  1.7/2 = 0.85 of a block Notes 4.2: Hashing

If h(k)[i ] = (a1 … ai)2< n, then look at bucket h(k)[i ]; else look at bucket h(k)[i ]- 2i -1 = (0a2 … ai)2 Rule Example:2 keys/block, b=4 bits, n=2, i =1 • insert 0101 00 01 • now r=4 >1.7n -> get new bucket 10and distribute keys btw buckets 00 and 10 0000 0101 1010 1111 Notes 4.2: Hashing

1010 -> n=3, i =2; distribute keys btw buckets 00 and 10: 000110 0000 0101 1010 1111 Notes 4.2: Hashing

0001 n=3, i =2; insert 0001: • can have overflow chains! 000110 0000 0101 1010 1111 Notes 4.2: Hashing

n=3, i =2 0001 • insert 0111 0111 000110 • bucket 11 not in use-> redirect to 01 0000 0101 1010 1111 • now r=6 > 1.7n-> get new bucket 11 Notes 4.2: Hashing

1111 0111 0001 n=4, i =2; distribute keys btw 01 and 11 0001 0111 00011011 0000 0101 1010 1111 Notes 4.2: Hashing

3 0101 0101 101 100 0 0 0 0 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block) Notes 4.2: Hashing

Summary Linear Hashing + Can handle growing files - without full reorganizations No indirection directory of extensible hashing Can have overflow chains - but probability of long chains can be kept low by controlling the r/n fill ratio (?) + - Notes 4.2: Hashing

Summary Hashing - How it works - Dynamic hashing - Extensible - Linear Notes 4.2: Hashing

Next: • Indexing vs Hashing • Index definition in SQL Notes 4.2: Hashing

Indexing vs Hashing • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Notes 4.2: Hashing

Indexing vs Hashing • INDEXING (Including B-Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Notes 4.2: Hashing

Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name Notes 4.2: Hashing

Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL … Oracle and IBM DB2 UDB provide a PCTFREE clause to inditate the proportion of B-tree blocks initially left unfilled Oracle: “Hash clusters” with built-in or DBA-specified hash function Notes 4.2: Hashing

The BIG picture…. • Chapters 2 & 3: Storage, records, blocks... • Chapter 4: Access Mechanisms - Indexes - B trees - Hashing • Chapters 6 & 7: Query Processing NEXT Notes 4.2: Hashing

Principles of Database Management Systems 4.2: Hashing Techniques

Principles of Database Management Systems 4.2: Hashing Techniques

Presentation Transcript

Principles of Database Management Systems

LIS 384K.11 Database-Management Principles and Applications

Principles of Database Management Systems 4.2: Hashing Techniques

CS 4 40 : Database Management Systems

Chapter 9 Database Management Systems

Database Management Systems

Perfect Hashing in Theory and Practice

CS 257 Database Systems Principles Assignment 2

CS 540 Database Management Systems

CS 540 Database Management Systems

Future of Database Systems

Database Principles

Database Management Systems

Welcome to CSE3180 ‘Principles of Database Systems’ First Semester 2005

DATABASE MANAGEMENT SYSTEMS

Database Management Systems

Database Management

Principles of Database Management Systems

CS 245: Database System Principles Notes 4: Indexing

Databases and Database Management Systems

CS511 Design of Database Management Systems