390 likes | 515 Vues
In this lecture, we delve into the significance of hashing in searching and data retrieval processes. We explore the purpose of searching, critical factors our users care about, and the commonly used hash functions that transform keys into values without collisions. Attendees will learn about entry ADTs, performance considerations, and the importance of reliable hash functions to ensure O(1) efficiency. We also discuss common pitfalls associated with bad hash functions and the critical aspects to consider when implementing hash tables for optimal performance in large datasets.
E N D
CSC 213 – Large Scale Programming Lecture 11: Why I Like Hash
Today’s Goal • Consider what will be important when searching • Why search in first place? What is its purpose? • What should we expect & handle when searching? • What factors matter to our users (and ourselves)? • (Besides source of bad jokes) What is hashing? • Why important for searching? How can it help? • What are critical factors of good hash function? • Commonly-used hash function example examined
Keys To Map & Dictionary • Used to convert the keyinto value • valuescannot share a keyand be in same Map • In searching failure is normal, not exceptional
Entry ADT • Needs 2 pieces: what we have & what we want • First part is the key: data used in search • Item we want is value; the second part of an Entry • Implementations must define 2 methods • key()& value()return appropriate item • Usually includes setValue()but NOTsetKey()
Sequence-Based Map • Sequence’s perspective of Mapthat it holds Positions elements
Sequence-Based Map • Outside view of Map and how it is stored Positions Entrys
Sequence-Based Map • Mapimplementation’s view of data and storage Positions Elements/Entrys
Please hold while the machine searches 1,000,000 records for your location
Map Performance • In all seriousness, can be matter of life-or-death • 911 Operators immediatelyneed addresses • Google’s search performance in TB/s • O(log n) time too slow for these uses • Would love to use arrays • Get O(1) time to add, remove, or lookup data • This HUGE array needs massive RAM purchase
Monster Amounts of RAM • Java requires using int as array index • Limit to int and RAM available in a machine • Integer.MAX_VALUE = 2,147,483,647 • 8,200,000,000 pages in Google’s index (2005) • In US, possible phone numbers = 10,000,000,000 • Must do more for O(1) array usage time
Monster Amounts of RAM • Java requires using int as array index • Limit to int and RAM available in a machine • Integer.MAX_VALUE = 2,147,483,647 • 8,200,000,000 pages in Google’s index (2005) • In US, possible phone numbers = 10,000,000,000 • Must do more for O(1) array usage time • As with all life’s problems we turn to hash
Monster Amounts of RAM • Java requires using int as array index • Limit to int and RAM available in a machine • Integer.MAX_VALUE = 2,147,483,647 • 8,200,000,000 pages in Google’s index (2005) • In US, possible phone numbers = 10,000,000,000 • Must do more for O(1) array usage time • As with all life’s problems we turn to hash
Hashing To The Rescue • Hash function turns keyinto intfrom 0 – N-1 • Result is usable as index for an array • Specific for key’stype; cannot be reused • Store the Entrysin array (“hash table”) • (Great name for shop in Amsterdam, too) • Begin by computing key’s hash value • Result is array index for that Entry • Now is possible to use array for O(1) time!
Hash Table Example • Example shows table of Entry<Long,String> • Simple hash function ish(x) = xmod 10,000 • x is/from Entry’skey • h(x) computes index to use • Always is mod array length • Not all locations used • Holes willappear in array • Empties: set to null-or- use sentinel value
When We Use Hash • Hash key tofind index • First step for most calls • get()-need index to check • Add at that index -put() • remove()- index to set null • Then check keyat index • At index manykeyspossible • Still aMap, so results known • If you find keys not samecannot treat as the same!
Properties of Good Hash • To really be useful, hash must have properties Reliable Fast Use entire table
Properties of Good Hash • To really be useful, hash must have properties Reliable Fast Use entire table Make good brownies
Reliability of Hash Function • Implement Mapwith a hash table • To use Entry, get key toeasily look up its index • Always computes same indexfor that key
Speed of Hash Function • Hash must be computed on each access • Goal: O(1) efficiency by using an array • Efficiency of array wasted if hash is slow • If O(1) computation performed by hash function • It is possible to performgetin O(1) time • O(1) time for put& removecould also occur • None of this is guaranteed; many problems can occur
Use Entire Table Important • Hashing take lots of space because array is used • When creating, make array big enough to hold all data • Can copy to larger array, but this notO(1) operation • Use prime number lengths but these quickly get large • Spreads out Entrys equally across entire table • Further apart it's spread, easier to find opening
Hash Function Analogy Hash table
Hash Function Analogy Hash function Hash table
Examples of Bad Hash • h(x) = 0 • Reliable,fast, little use of table • h(x) = random.nextInt() • Unreliable,fast, uses entire table • h(x) = current index -or- free index • Reliable, slow,uses entire table • h(x) = x34 + 2x33+ 24x32 + 10x31… • Reliable,moderate,too large
Incredibly Bad Hash • Using only part of key& not whole thing • No matter what, inevitably, you will guess wrong
Incredibly Bad Hash • Using only part of key& not whole thing • No matter what, inevitably, you will guess wrong
Incredibly Bad Hash • Using only part of key& not whole thing • No matter what, inevitably, you will guess wrong Part used for hash
Incredibly Bad Hash • Using only part of key& not whole thing • No matter what, inevitably, you will guess wrong Part that matters Part used for hash
Censored Good Hash • Hash must first turnkeyinto int • Easy for numbers, but rarely that simple in real life • For a String, could add value of each character • Would hash to same index “spot”, “pots”, “stop” • Instead we usually use polynomial code:
Censored Good Hash • Hash must first turnkeyinto int • Easy for numbers, but rarely that simple in real life • For a String, could add value of each character • Would hash to same index “spot”, “pots”, “stop” • Instead we usually use polynomial code:
Censored Good Hash • Hash must first turnkeyinto int • Easy for numbers, but rarely that simple in real life • For a String, could add value of each character • Would hash to same index “spot”, “pots”, “stop” • Instead we usually use polynomial code:
Good, Fast Hash • Polynomial codes good, but veryslow • Major bummer since we use hash for its speed • Cause of slowdown: computing antakes n operations • Horner’s method better by piggybacking work
Compression • Hash’s only use is computing array indices • Useless if larger than table’s length: no index exists! • When a=33, “spot” hashed to 4,293,383 • Some hash incalculable (like “triskaidekaphobia”) • To compress result, work like array-based queue hash=(result+length)%length • % returns by modulus (the remainder from division) • Serves exact same purpose: keeps index within limits
Before Next Lecture… • Continue working on week #4 assignment • Due at usual time Tues. so may want to get cracking • Start thinking of designs & CRC cards for project • Due in 10 days as projects completed in stages • Read sections 9.2.1 & 9.2.5 – 9.2.7 of the book • Consider better ways of handling this situation: