CH8. HASHING

CH8. HASHING

8.1 Symbol table abstract data type • Symbol table • pairs (name + attribute) ex) dictionary, library list • Operations on a symbol table (1) determine if a particular name is in the table (2) retrieve the attributes of that name (3) modify the attribute of that name (4) insert a new name and its attribute (5) delete a name and its attributes • Reducing search time • linear search : O(n) • binary search : O(log2n) • hash table : hash function time + 

8.1 Symbol table abstract data type(Cont’) Template <class Name, Attribute> Class SymbolTable //objects: a set of name-attribute pairs, where the names are unique { public: SymbolTable(int size=defaultsize); //Create an empty symbol table with capacity size Boolean IsIn(Name name); //if name is in symbol table, return TRUE(1); else return FALSE(0) Attribute *Find(Name name); //if name is in symbol table, return a pointer to the corresponding attribute; //else return 0 void Insert(Name name, Attribute attr); //if name is in symbol table, then replace its existing attribute with attr; //else, insert the pair (name, attr) into the sysbol table void Delete(Name name); //if name is in symbol table, delete (name, attr) from symbol table }; ADT 8.1 Abstract data type SymbolTable

8.2 Static Hashing 8.2.1 Hash Tables • static hashing • the table length is fixed • dynamic hashing • the table length is not fixed • hash function(h) : identifier(x)  hash value(h(x)) • x : identifier • h(x) : hash, home address of x • memory is sequential : hash table(ht) • the hash table has b buckets : ht[0] … ht[b-1] • a bucket is consist of slots • the identifiers may be non-comparable

8.2 Static Hashing(Cont’) • Def : identifier density = n : the number of identifier in the table T : the total number of possible identifiers loading density(factor) : • Synonyms : two identifiers I1 and I2 if h(I1) = h(I2) • Overflow : a new identifier is hashed into a full bucket • Collision : two nonidentical identifiers are hashed into the same bucket. When s = 1, collisions and overflows occur simultaneously

0 A A2 1 2 3 D 4 5 6 GA G . . . … … 25 8.2 Static Hashing(Cont’) • Example 8.1 the first character is correspond to the address b = 26, s = 2, n = 10, loading factor  = 10/52 = 0.19 Slot 2 Slot 1 Figure 8.1 Hash table with 26 buckets and two slots per bucket

8.2 Static Hashing(Cont’) 8.2.2 Hash functions • hash function • easy to compute • minimize collisions • uniform hash function gives 1/b probability of h(x) = i to x

ex) 10100 • 10100 • ------------------ • 10110010000 • 25 bits used 8.2 Static Hashing(Cont’) • Mid-square • Used in many cases • square the identifier and use an appropriate number of bits from the middle • r bits used  table size = 2r

8.2 Static Hashing(Cont’) • Division • using modulo(%) operator fD(x) = x % M • hash address [0, M-1], table size = M • The choice of M is critical • If M is a power of 2, hD(x) depends only LSB of x • In case of the left-justified identifier : Caution! • In practice, Choose M such that it has no prime division less than 20

8.2 Static Hashing(Cont’) Figure 8.2: internal representations of x and x2 in octal notation (x is input right-justified, zero-filled, six bits or two octal digits per character)

8.2 Static Hashing(Cont’) 48bit 48bit 0 0 0 0 0 0 A 1 A 1 0 0 0 0 0 0 left-justified right-justified zero-filled Figure 8.3 Identifier A1 right- and left-justified and zero-filled (six bits per character)

8.2 Static Hashing(Cont’) • Folding • partition the identifier into several parts 1) shift folding : add all different partitions 2) folding at the boundaries : fold parts and add them ex) x = 12320324111220 P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20 1) shift folding h(x) = 123 + 203 + 241 + 112 + 20 = 699 2) folding at the boundaries we reverse 82 and 84 to get 302 and 211 h(x) = 123 + 302 + 241 + 211 + 20 = 897

8.2 Static Hashing(Cont’) • Digit Analysis • useful when all the identifiers are known in advance • examine the digits of each identifier and then eliminate skewed distribution digits • choose uniform distributed digits

8.2 Static Hashing(Cont’) 8.2.3 Overflow Handling • Two method of overflow handling • open addressing (linear probling, linear open addressing) • chaining

8.2 Static Hashing(Cont’) Struct identifier { Char *id; int n; }; // assume that operators == and != are defined for identifier int operator == (identifier&, identifier&); int operator != (identifier&, identifier&); class symbolTable { public; symbolTable(int size = defaultsize) { buckets = size; ht = new identifier[buckets]; }; Private: int buckets; identifier *ht; }; Program 8.1 : Symbol table class efinition

8.2 Static Hashing(Cont’) • Open addressing • example 8.3 • 26-bucket table, one slot per bucket • hash function h(x) = first character of x • identifiers : GA, D, A, G, L, A2, A1, A3, A4, Z, ZA, E 0 1 2 3 4 5 6 7 8 9 10 11 12 13 24 25 … A A2 A1 D A3 A4 GA G ZA E L Z Figure 8.4 Hash table with linear probing(26 bukets, one slot per bucket)

8.2 Static Hashing(Cont’) • Open addressing • Hash table search when open addressing • disadvantage of the open address • making clusters of identifiers (1) compute h(x) (2) examine identifiers at positions ht[h(x)], ht[h(x)+1], …, ht[h(x)+j], in this order, until one of the following happens: (a) ht[h(x)+j]=x; in this case x is found (b) ht[h(x)+j] is null; x is not in the table (c) we return to the starting position h(x); the table is full and x is not in the table

8.2 Static Hashing(Cont’) int SymbolTable::LearSearch(const identifier& x, int (*hashfunc)(identifier)) //Search the hash table ht (each bucket has exactly one slot) for x //using linear probing. //Retrun j such that if x is already in the table, then ht[j]=x. //If x is not in the table, return –1. The hash function hashfunc is passed as an //argument to LinearSearch. { int i=hashfunc(x); for (int j=i; ht[j].id&&ht[j] !=x; ){ j=(j++) % buckets; //treat the table as circular if (j==i) return –1; //back to start point } if (ht[j]==x) return j; else return –1; } Program 8.2 Linear search

8.2 Static Hashing(Cont’) Figure 8.5: Some primes of the form 4j+3

8.2 Static Hashing(Cont’) • Analysis of example 8.3 • The number of buckets examined • total : 39 buckets examined • average : 39/12 = 3.25 • A – 1 • A2 - 2 • A1 - 2 • D – 1 • A3 – 5 • A4 – 6 • GA – 1 • G – 2 • ZA – 10 • E – 6 • L – 1 • Z - 1

8.2 Static Hashing(Cont’) • Approximation of average number of identifier comparison : P P = (2 - ) / (2 - 2) ,  = loading density in example 8.3 ,  = 12/26 = 0.47 P = 1.5 but in the real case, the average was 3.25

8.2 Static Hashing(Cont’) • Chaining • Each bucket has one list of synonyms • head node + linked list • example of the data example 8.3

8.2 Static Hashing(Cont’) ht ident link [0] A4 A3 A1 A2 A 0 [1] 0 [2] 0 [3] D 0 [4] E 0 [5] 0 [6] G GA 0 [7] 0 [8] 0 [9] 0 Hash table with 26 buckets; Each bucket can hold a link [10] 0 [11] L 0 … [25] ZA Z 0 Figure 8.6: Hash chains corresponding to Figure 8.4

8.2 Static Hashing(Cont’) • Use the structure data type in Program 8.1 class ListNode{ friend SymbolTable; private: identifer ident; ListNode *link; }; typedef ListNode* ListPtr; class SymbolTable{ public: SymbolTable(int size=defaultsize){ buckets=size; ht=new ListPtr[buckets]; }; private: int buckets; ListPtr *ht; }; Program 8.3: Class definitions for chain search

8.2 Static Hashing(Cont’) identifier* SymbolTable::ChainSearch(const indentifier& x, int (*hashfunc) (identifier)) //Search the chained hash table ht for x. On termination, return a pointer //to the identifier in the hashtable. If the identifier does not exist, return 0 { int j=hashfunc(x) //compute headnode address //search the chain starting at ht[j] for (ListPtr l=ht[j]; l; l=l->link) if (l-> ident==x) return &l->ident; return 0; } Program 8.4: Chain search

8.2 Static Hashing(Cont’) Figure 8.7: Average number of bucket accesses per identifier retrieved

8.3 Dynamic Hashing 8.3.2 Dynamic Hashing using Directions • Disadvantage of static hashing : waste of memory space • Dynamic hashing : variable number of pages ex) 8 identifier consist of 2 characters(using Trie : binary radix tree) identifiers Binary representation A0 100 000 A1 100 001 B0 101 000 B1 101 001 C0 110 000 C1 110 001 C2 110 010 C3 110 011 Figure 8.8: Some identifiers that require three bits per character

8.3 Dynamic Hashing(Cont’) a) place 6 identifiers(A0, B0, C2, A1, B1, C3) on 4 pages • low order bits : 00, 01, 10, 11 b) insert C5 (110 101) • overflow on 01 page • add the next least significant bit • split the page c) insert C1 (110 001) • overflow on 001 page • split the page

8.3 Dynamic Hashing(Cont’) 00 A0, B0 A0, B0 10 0 C2 C2 001 01 A1, B1 A1, B1 1 101 11 C5 C3 C3 (a) two-level trie on four pages (b) inserting C5 with overflow

8.3 Dynamic Hashing(Cont’) A0, B0 C2 0001 A1, C1 1001 B1 101 C5 C3 (c) Inserting C1 with overflow Figure 8.9 : A trie to hold identifiers

8.3 Dynamic Hashing(Cont’) • Disadvantage of Trie • The access time depends on the number of used bits • Skewed distribution of identifiers • Extendible hashing • Directory : a table of page pointers • if k bits are used, the directory has 2k entries • Save the search time on Trie • to find an identifier, we use the binary integer equal to the last k bits of the identifier • Search the directory • Fig 8.10 shows the directories corresponding to the tries in Fig 8.9 • Same depth of binary radix search time : balanced directory

8.3 Dynamic Hashing(Cont’) (b) 3 bits (c) 4 bits (a) 2 bits Figure 8.10 : Tries collapsed into directories

8.3 Dynamic Hashing(Cont’) const int WordSize = 5; // maximum number of directory bits const int PageSize = 10; // maximum size of a page const int MaxDir = 32; // maximum size of a directory struct TwoChars { char str[2];}; struct page { int LocalDepth; // number of bits to distinguish ids TwoChars names [PageSize]; // the actual identifiers int NumIdents; // number of identifiers in this page }; typedef page* paddr; struct record { // a sample record TwoChars KeyField; int IntData; char CharData; }; paddr rdirectory [MaxDir]; // will contain pointers to pages int gdepth; // not to exceed WordSize

8.3 Dynamic Hashing(Cont’) paddr hash(const TwoChars&key, const int precision); // key is hashed using a uniform hash function, and the low order precision bits // are returned as the page address. paddr buddy(const paddr index); // Take an address of a page and return the page’s buddy; i.e., the leading bits is // complemented. int size(const paddr ptr); // Return the number of identifiers in the page paddr coalesce(const paddr ptr, const paddr buddy); // Combine page ptr and its buddy, buddy into a single page. Boolean PageSearch(const TwoChars& key, const paddr index); // Search page index for key key. If found, return TRUE; otherwise return FALSE int convert(const paddr p); // Convert a pointer to a page to an equivalent integer. void enter(const record r, const paddr p); // Insert the new record r into the page pointed at by p

8.3 Dynamic Hashing(Cont’) Void PageDelete(cost TwoChar& key, const paddr p); // Remove the record with key key from the page pointed at by p Paddr find(const TwoChars& key) // Search for a record with key key in the file. If found, return the address of the // page in which it was found.. If not found, return 0; { paddr index = hash(key, gdepth); int IntIndex = convert(index); paddr ptr = rdirectory[IntIndex]; if (PageSearch(key, ptr) retrun ptr; else return 0; } void insert(const record& r, const TwoChars& key) // Insert a new record into the file pointed at by the directory { paddr p = find(key); // check if key is present if(p) return; // key already in if(p  NumIdents !=PageSize) { //page not full enter (r,p); pNumIdents ++; }

8.3 Dynamic Hashing(Cont’) else { Split the page into two, insert the new key, and update gdepth if necessary; if this causes gdepth to exceed WordSize, print an error and terminate. } } void Delete(const TwoChars& key) // Find and delete the record with key key { paddr p = find(key); if (p) { PageDelete(key, p); if (size (p) + size(buddy (p)) <= PageSizze) coalesce(p, buddy (p)); } } void main() {} // main program Program 8.5: Extendible hashing

8.3 Dynamic Hashing(Cont’) • Hash function • convert identifier into random bit sequence • family of hash function : a family of hash functions giving different length of bits hashi : key  {0, …, 2i-1}, 1  i  d • bits are taken from LSB to MSB • hash(key, i) : function generating random numbers of i bit-length for the key • Terminology • directory depth : number of bits used in the directory • buddies : pages having their low-order i bits in common

8.3 Dynamic Hashing(Cont’) • Overflow handling • when a page can hold only p records, a record is added • allocate a new page • use one more bit and divide the page • if the number of bits used is greater than the depth of the directory, the whole directory doubles That is, the depth increases by 1 • Merge of pages • two buddy pages are merged into one • reduce the number of bits used • directory depth could be reduced by 1 • not easy

8.3 Dynamic Hashing(Cont’) 8.3.3 Analysis of Directory Dynamic Hashing • Time and Space • Retrieving requires only two disk accesses • In case of non-uniform distribution, many pointers point to the same page  waste of storage • Space utilization • (number of record stored) / (amount of space) • without special strategy of handling overflows, space utilization is approximately 69% • directory size • directory size could be large. (skewed data) • The directory could be stored in the auxiliary memory

8.3 Dynamic Hashing(Cont’) Figure 8.11: Directory size given n records and p page size

CH8. HASHING

CH8. HASHING

Presentation Transcript

Hashing

Hashing

CH8

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

CH8 Cell Reproduction

Hashing

Hashing

Hashing