More Hashing

More Hashing Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel

Recap of Last Class • Hash function converts key to file address • Collision is when two or more keys hash to the same address • Collision Avoidance • Good Hash Function spreads out the keys evenly along the whole address space • Non-Dense File decreases chance of collisions and decreases probes after a collision

Recap: Linear Probing • Very simple collision resolution • if H(key) = A, and A is already used, try A+1, then A+2, etc • Advantages • easy to implement • guaranteed to use all addresses • Disadvantages • clustering / clumping

Clumping • Given the following hashes and linear probing: • adams = 20 • bates = 22 • cole = 20 • dean = 21 • evans = 23 • Result of either • poor hash function • dense file

Random Probing • Instead of adding 1, spread out by random amount • True random would not work. Instead use pseudo-random. While A is in use A = (A + R) mod T A = address R = prime T = Table Size

But what if 25 and 30 already had keys directly hashed to those locations? Cole would be at 35 -- 4 probes away. • adams = 20 • bates = 22 • cole = 20 • dean = 21 • evans = 23

Chaining • Assuming a better hash function and less dense file are not options... • And assuming linear and random probing lead to coalesced lists... • Chaining : maintain a linked list of collisions, one head per address • Example, after addition of Adams and Cole, and R=5: 19 : null 20 : 35 -> null 21 : null • Advantage: Faster at resolving collisions • Disadvantage : Space

Re-Cap from weeks ago • File Read Time = seek time + latency + data read time • Smallest Readable Portion = 1 cluster = 4KB (usually) • To access portion of a file, most of the time is in seek time and latency, not read time • so, number of file reads is more important than size of reads, until size gets really big • SO... reading a few records from a file takes no more time than reading just one record

Buckets • Given, collisions will occur... • Why not just read 2, or 3, or 4 records instead of just 1 on each read operation? • "Bucket" - a group of records at the same address • "Hash File of Buckets" - hashed keys collide to small arrays of records in the data file

Bucket Size? • use avg collisions and stddev? • if 1000 records and 200 addresses • then avg is 5.0 • but stddev might be 1.0 • start by determining how many records can fit in one or more disk clusters • then design a good hash function to match that address space

Advantages and Disadvantages of Buckets • Advantages: • Can achieve relatively fast access • Remember, the hash function tells us where the record is located, so only 1 read operation. And even with collisions, the list of possible records is read into memory, which searches fast. • Search Time = time to read bucket + time to search the array • Disadvantages: • What do we do when the bucket is full? • solutions are similar to collision resolution • we end up reading multiple sets of records

Predicting Collision Rates • Collisions will happen! • Poisson Function: • p(x) gives the probability that a given address will have had x records assigned to it. (r/N)x e-(r/N) p(x) = --------------- x! N = number of available addresses r = number of records to be stored x = number of records assigned to a given address

Analysis continued • Given • N = 1000 • r = 1000 • Probability that a given address will have exactly one, two, or three keys hashed to it: p(1) = 0.368 p(2) = 0.184 p(3) = 0.061

Analysis Continued • Given • N = 10,000 • R = 10,000 • How many addresses should have one, two, or three keys hashed to them? 10,000 x p(1) = 10000x0.3679 = 3679 10,000 x p(2) = 10000x0.1839 = 1839 10,000 x p(3) = 10000x0.0613 = 613 • So, 1839 keys will collide once and 613 will collide at least twice. • Many of those collisions will disrupt probing.

Impact of Packing Density Records that never collide = 303 Records that cannot go at their home = 107 Records at their home, but cause collisions = 90 Total = 500 • Given • r = 500 • N = 1000 • one record per address • Addresses with exact one record? N x p(1) = 1000 x 0.303 = 303 • How many overflow records? 1 x N x p(2) + 2 x N x p(3) + 3 x N x p(4) + ... = N x [1 x p(2) + 2 x p(3) + 3 x p(4)] = 1000 x [ 1 x 0.076 + 2 x 0.013 + 3 x 0.002] = 107 • Percentage of Records NOT stored at home address 107 / 500 = 21.4%

Impact of Packing Density

Real Life • We must balance many factors: • file size • e.g., wasted space in hashed files • e.g., extra space for index files • disk access times • available memory • frequency of additions and deletions compared to searches • Best Solution of All? • probably a combination of indexed files, hashing, and buckets

Next Classes… • Thursday April 14 • No Class • Tuesday April 19 • B-Trees • Thursday April 21 • Review

More Hashing

More Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

CSE 326: Data Structures More Hashing Techniques

HASHING

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

More on Hashing and Security

Hashing, Hashing Tables

Hashing

Hashing

Hashing