Introduction to Hashing: Optimizing Data Access with Hash Functions in Data Structures

Hashing – Part I CS 367 – Introduction to Data Structures

Searching • Up to now the only way to find a key is to search through all or part of the data • linked list: O(n) • AVL tree: O(log n) • binary search of array: O(log n) • If lots of data and/or searching the data very often, these times can be long • given the key, would like to get the data directly

Hashing • The solution to this problem is to put the key through a function that says exactly where the data is (or where it should be placed) • this function is called a hash function • h(key) = integer • the integer obtained from a hash function can be used as an index into an array • if the hash function is perfect – always generates a unique integer for different keys – the time to place and access data is O(1)

Hashing A M X Hashing Function A M X 0 1 2 3 4 5 6 7 8 9 10 11

Hashing Functions • So what is the hashing function? • the simplest hashing function is to use the division remainder • assume the array is 1000 elements in size • translate the data into a number, n • h(n) = n % 1000

Hashing Functions • simple example • consider a small school • each student is tracked by a 4 digit ID number • each students ID# begins with the year they started • 2000 -> 0, 2001->1, 2002->2, etc. • all student records are stored in an array • maximum of 1000 students per year • let’s look at records for all sophomores • assume they were freshman in 2001

Hashing Functions To find John’s record in the array: 1009 % 1000 = 9 Go to index number 9. Mary’s ID #: 1000 Pete’s ID #: 1004 John’s ID #: 1009 Amy’s ID#: 1011 0 1 2 3 4 5 6 7 8 9 10 11 … Mary’s records Pete’s records John’s records Amy’s records

Generating n • The previous example is rather simplistic in that it is hashing already unique integers • seems kind of pointless • maybe not if the integers are large • consider the UW’s 10 digit ID numbers • Often it is desirable to hash some other kind of data • a person’s name for example

Generating n • How is a string converted into an integer? • the simplest method is to add all of the ASCII values for each character together • example • convert amy into an integer • a = 97; m = 109; y = 121 • a + m + y = 327 • there are lots of other ways to convert strings to integers • what are a few of them?

Hashing Functions • There are millions of possible hashing functions • we will not be considering them all • basically, anything you can think of to generate an integer could be used as a hashing function • Mathematicians have spent lots of time and effort to come up with some basic methods that work pretty well

Division • We have already seen the division method • it involves taking the remainder of division • h(key) = key % tableSize • A few notes about making this work better • table size should be a prime number • usually a good method if nothing very little is known about the keys • the remaining methods will all use division as the final step in their calculation

Folding • Separate the key into various equally sized parts and then recombine them • usually with addition • Two kinds of folding • shift folding • just add the various parts together as they are • boundary folding • reverse the order of every other part and add them together

Folding • Consider a SSN as a key • break it into 3 parts • first 3, second 3, last 3 • Shift folding example • SSN = 123-45-6789 • first = 123; second = 456; third = 789 • h(key) = (first + second + third) % size • h(SSN) = 1368 % tableSize • Boundary folding example • h(key) = (first + R(second) + third) % size • h(key) = (123 + 654 + 789) % size

Increasing Performance • Consider using shifting and exclusive OR’ing to generate the key • exclusive OR parts together to generate index • Example • consider the string abcdefgh • if each part is a letter, just exclusive OR them • ‘a’ ^ ‘b’ ^ ‘c’ ^ ‘d’ ^ ‘e’ ^ ‘f’ ^ ‘g’ ^ ‘h’ • often, a character is represented by 8 bits • what’s the problem with this? • might be better to exclusive OR chunks of the string • “abcd” ^ “efgh” • why were four digits chosen in this case?

Increasing Performance int shiftFold(String key, int tableSize) { int chunk = 0; int result = 0; byte[ ] st = key.getBytes(); for(int i=0; i<st.length; i+=4) { for(int j=0; (j<4) && (j + i < st.length); j++) { chunk = chunk | st[j + i]; chunk = chunk << 8; } result = result ^ chunk; chunk = 0; } return result % tableSize; }

Increasing Performance • The performance could be increased even more if the table size was a power of 2 • can get rid of the modulo operation at the end • modulo is an expensive calculation • could just do a subtraction and an AND operation instead

Mid-Square Function • Square the number and take the middle part as the index • a string must first be converted to get the number to square • The entire key gets used to generate the address • less chance for conflicts • more on this later • This method works best if the table size is a power of two

Mid-Square Function • Table size equals 1024 (210) • The key is 3121 • 31212 = 9740441 = (100101001010000101100001)2 • middle 10 digits of this value are listed in bold • Index in array is • (0101000010)2 = 322 • This is all very quick and easy to calculate using mask and shift operations

Mid-Square Function int tableSize = 1024; int mask = (tableSize – 1) ; int maskBits = logBase2(tableSize); int shiftBits = 7; // table size must be a power of two int midSquare(String key, int tableSize) { int n = stringToNum(key); int n = n * n; return n & (mask << shiftBits); }

Extraction • Simply pull out a certain part of the key and use it as the index • example • SSN = 123-45-6789 • index = middle of key = 456 • alternative index = first, middle, last = 159 • Should try to choose a part of the key that is most likely unique • consider foreign student SSN • start with 999 • probably not a great idea to extract the first three numbers

Introduction to Hashing: Optimizing Data Access with Hash Functions in Data Structures

Introduction to Hashing: Optimizing Data Access with Hash Functions in Data Structures

Presentation Transcript

Spectral Hashing

Chapter 5 – Part 3

CHAPTER 8 Hashing

FDIC Comprehensive Seminar On Deposit Insurance Coverage For Bankers

IDEA 2004 Requirements of Part C Regulations at 34 CFR Part 303

T. S. Eliot’s Murder in the Cathedral

KEN YEANG

Lectures 3/4: Requirements

Implementing Virtual Private Networks

ANCOVA

Detector Description

Understanding Medicare

RURAL LAND RESOURCES

Agenda -Part 2

Georgia Grade 8 Writing Assessment

Near-Optimal Space Perfect Hashing Algorithms

Understanding Medicare

Chapter 2 The Language: Rationale and Fundamentals (Part IV)

CS 245: Database System Principles Notes 4: Indexing

Searching and Hashing

Unit 5 Romance

HASHING