Hashing on the Disk

Hashing on the Disk • Keys are stored in “disk pages” (“buckets”) • several records fit within one page • Retrieval: • find address of page • bring page into main memory • searching within the page comes for free Hashing

data pages Σ key space 0 1 2 hash function . . . . m-1 b • page size b: maximum number of records in page • space utilization u: measure of the use of space Hashing

overflow x x x x x x Collisions • Keys that hash to the same address are stored within the same page • If the page is full: • page splits: allocate a new page and split page content between the old and the new page or • overflows: list of overflow pages Hashing

Access Time • Goal: find key in one disk access • Access time ~number of accesses • Large u: good space utilization but many overflows or splits => more disk accesses • Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses Hashing

Categories of Methods • Static: require file reorganization • open addressing, separate chaining • Dynamic: dynamic file growth, adapt to file size • dynamic hashing, • extendible hashing, • linear hashing, • spiral storage… Hashing

Static Hashing Schemes • Open Addressing for disk • One file: array of pointers to list of disk pages • Data files: array of data pages with keys mapped to the same address • Open Addressing: One file with pages holding keys mapped to the same address • Performance drops as u increases Hashing

Dynamic Hashing Schemes • File size adapts to data size without total reorganization • Typically 1-3 disk accesses to access a key • Access time and u are a typical trade-off • u between 50-100% (typically 69%) • Complicated implementation Hashing

index data pages Schemes With Index • Two disk accesses: • one to access the index, one to access the data • with index in main memory => one disk access • Problem: the index may become too large • Dynamic hashing (Larson 1978) • Extendible hashing (Fagin et.al. 1979) Hashing

address space data space Schemes Without Index • Ideally, less space and less disk accesses (at least one) • Linear Hashing (Litwin 1980) • Linear Hashing with Partial Expansions (Larson 1980) • Spiral Storage (Martin 1979) Hashing

Hash Functions • Support for shrinking or growing file • shrinking or growing address space, the hash function adapts to these changes • hash functions using first (last) bits of key = bn-1bn-2….bi b i-1…b2b1b0 • hi(key)=bi-1…b2b1b0supports 2i addresses • hi: one more bit thanhi-1to address larger files Hashing

Index: binary tree 1 2 3 4 b data pages h2(k) 2nd level h1(k) 1stlevel Dynamic Hashing (Larson1978) • Two level index • primary h1(key): accesses a hash table • secondary h2(key): accesses a binary tree Hashing

Index • Fixed (static): h1(key)=key mod m • Dynamic behavior on secondary index • h2(key) uses i bits of key • the bit sequence of h2=bi-1…b2b1b0denotes which path on the binary tree index to follow in order to access the data page • scan h2 from right to left (bit 1: follow right path, bit 0: follow left path) Hashing

index 0 1 2 3 4 5 0 h1=1, h2=“0” 1 0 h1=1, h2=“01” 2 1 h1=1, h2=“11” 3 1 0 h1=5, h2= any 4 b data pages h2(k) 2nd level h1(k) 1stlevel h1(key) = key mod 6 h2(key) = “01”<= depth of binary tree = 2 Hashing

b 0 1 2 3 0 1 2 3 h1=1,h2=any 0 0 1 2 3 1 Insertions • Initially fixed size primary index and no data • insert record in new page under h1address • if page is full, allocate one extra page • split keys between old and new page • use one extra bit in h2 for addressing h1=1, h2=0 h1=1, h2=1 Hashing

0 1 h1=0,h2=any 0 1 1 b 2 2 h1=3,h2=any 2 3 3 storage index 0 0 1 h1=0,h2=0 1 1 3 h1=0,h2=1 2 3 2 h1=3,h2=any 0 h1=0,h2=0 1 0 0 3 1 h1=0,h2=01 1 1 h1=0,h2=11 4 2 0 h1=3,h2=0 2 3 1 h1=3,h2=1 5 Hashing

Deletions • Find record to be deleted using h1, h2 • Delete record • Check “sibling” page: • less than b records in both pages ? • if yes merge the two pages • delete one empty page • shrink binary tree index by one level and reduce h2 by one bit Hashing

1 0 merging 1 2 2 3 3 4 1 0 delete 1 2 2 3 3 4 Hashing

Extendible Hashing (Fagin et.al. 1979) • Dynamic hashing without index • Primary hashing is omitted • Only secondary hashing with all binary trees at the same level • The index shrinks and grows according to file size • Data pages attached to the index Hashing

0 0 1 2 3 4 dynamic hashing 1 0 0 1 2 3 4 1 0 1 number of address bits 00 01 10 11 2 2 1 dynamic hashing with all binary trees at same level Hashing

storage index 0 0 b Insertions • Initially 1 index and 1 data page • 0 address bits • insert records in data page local depth l : Number of address bits global depth d: size of index 2d Hashing

storage index d l 0 0 d: global depth = 1 l : local depth = 1 b l d 1 1 0 1 1 Page “0” Overflows Hashing

Page “0” Overflows (cont.) • 1 more key bit for addressing and 1 extra page => index doubles !! • Split contents of previous page between 2 pages according to next bit of key • Global depthd: number of index bits =>2d index size • Local depthl : number of bits for record addressing Hashing

contain records with same 2 bits of key d 2 2 00 01 10 11 2 1 contains records with same 1st bit of key Page “0” Overflows (again) Hashing

3 2 000 001 010 011 100 101 110 111 d 3 1 more key bit for addressing 3 2d-l: number of pointers to page 1 Page “01” Overflows Hashing

2 3 000 001 010 011 100 101 110 111 3 3 2 +1 2 Page “100” Overflows • no need to double index • page 100 splits into two (1 new page) • local depth l is increased by 1 Hashing

Insertion Algorithm • If l < d, split overflowed page (1 extra page) • If l = d => index is doubled, page is split • d is increased by 1=>1 more bit for addressing • update pointers (either way): • if d prefix bits are used for addressing d=d+1; for (i=2d-1, i>=0,i--) index[i]=index[i/2]; • if d suffix bitsare used for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1; d=d+1 Hashing

Deletion Algorithm • Find and delete record • Check sibling page • If less than b records in both pages • merge pages and free empty page • decrease local depth l by 1 (records in merged page have 1 less common bit) • if l < d everywhere => reduce index (half size) • update pointers Hashing

2 3 000 001 010 011 100 101 110 111 3 delete with merging 3 2 2 3 2 000 001 010 011 100 101 110 111 2 2 2 00 01 10 11 l < d 2 2 2 2 2 Hashing

Observations • A page splits and there are more than b keys with same next bit • take one more bit for addressing (increase l) • if d=l the index doubles again !! • Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) • if distribution is known, transform it to uniform • Dynamic hashing performs better for non-uniform distributions (affected locally) Hashing

Performance • For n: records and page size b • expected size of index (Flajolet) • 1 disk access/retrieval when index in main memory • 2 disk accesses when index is on disk • overflows increase number of disk accesses Hashing

b b before splitting after splitting After splitting Storage Utilization with Page Splitting • In general 50% < u < 100% • On the average u ~ ln2 ~ 69% (no overflows) Hashing

b b Storage Utilization with Overflows • Achieves higher u and avoids page doubling (d=l) • higher u is achieved for small overflow pages • u=2b/3b~66% after splitting • small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75% • double index only if the overflow overflows!! Hashing

Linear Hashing (Litwin 1980) • Dynamic scheme without index • Indices refer to page addresses • Overflows are allowed • The file grows one page at a time • The page which splits is not always the one which overflowed • The pages split in a predetermined order Hashing

p b p b Linear Hashing (cont.) • Initially n empty pages • p points to the page that splits • Overflows are allowed Hashing

p File Growing • A page splits whenever the “splitting criterion” is satisfied • a page is added at the end of the file • pointer p points to the next page • split contents of old page between old and new page based on key values Hashing

p 0 1 2 3 4 125 320 90 435 16 711 402 27 737 712 613 303 4 319 • b=bpage=4, boverflow=1 • initially n=5 pages • hash function h0=k mod 5 • splitting criterion u > A% • alternatively split when overflow overflows, etc. new element 438 215 522 Hashing

p 0 1 2 3 4 5 320 90 16 711 402 27 737 712 613 303 438 4 319 125 435 215 522 • Page 5 is added at end of file • The contents of page 0 are split between pages 0 and 5 based on hash function h1 = key mod 10 • p points to the next page Hashing

Hash Functions • Initially h0=key mod n • As new pages are added at end of file, h0 alone becomes insufficient • The file will eventually double its size • In that case use h1=key mod 2n • In the meantime • use h0 for pages not yet split • use h1 for pages that have already split • Split contents of page pointed to by p based on h1 Hashing

Hash Functions (cont.) • When the file has doubled its size, h0 is no longer needed • set h0=h1 and continue (e.g., h0=k mod 10) • The file will eventually double its size again • Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%) Hashing

Hash Functions • Initially n pages and 0 <= h0(k) <= n • Series of hash functions • Selection of hash function: if hi(k) >= p then use hi(k) else use hi+1(k) Hashing

Linear Hashing with Partial Expansions (Larson 1980) • Problem with Linear Hashing: pages to the right of p delay to split • large chains of overflows on rightmost pages • Solution: do not wait that much to split a page • k partial expansions: take pages in groups of k • all k pages of a group split together • the file grows at lower rates Hashing

2 pointers to pages of the same group 0 1 n 2n Two Partial Expansions • Initially 2n pages, n groups, 2 pages/group • groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1) • Pages in same group spit together => some records go to a new page at end of file (position: 2n) Hashing

0 n 2n 3n 0 n2n3n 1st Expansion • After n splits, all pages are split • the file has 3n pages (1.5 time larger) • the file grows at lower rate • after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n Hashing

2 pointers to pages of the same group 0 1 2n 4n 2nd Expansion • After n splits the file has size 4n • repeat the same processhaving initially 4n pages in 2n groups Hashing

1,6 Linear Hashing 1,5 1,4 Linear Hashing disk access/retrieval 2 partial 1,3 expansions 1,2 1,1 1 1 1,2 1,4 1,6 1,8 2 relative file size LinearHashing 2 part. Exp. Linear Hashing Linear Hashing 3 part. Exp. b = 5 b’ = 5 u = 0.85 retrieval 1.17 1.12 1.09 insertion 3.57 3.21 3.31 deletion 4.04 3.53 3.56 Hashing

Dynamic Hashing Schemes • Very good performance on membership, insert, delete operations • Suitable for both main memory and disk • b=1-3 records for main memory • b=1-4 Kbytes for disk • Critical parameter: space utilization u • large u=> more overflows, bad performance • small u=> less overflows, better performance • Suitable for direct access queries (random accesses) but not for range queries Hashing

Hashing on the Disk

Hashing on the Disk

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Summer School on Hashing’14 Locality Sensitive Hashing

Hashing

Disk Storage, Basic File Structures, and Hashing

Disk Storage, Basic File Structures, and Hashing

Hashing

Hashing

Hashing

Hashing

Summer School on Hashing’14 Locality Sensitive Hashing

Portal on a disk

HASHING

Hashing

Hashing, Hashing Tables

Additional notes on Hashing