Efficient Cuckoo Hashing for Exact String Search in Large Dictionaries
300 likes | 402 Vues
This paper explores an efficient method for storing and searching strings using Cuckoo Hashing in large dictionaries. Given a dictionary D of K strings with a total length N, we propose a structure that enables quick searches for a pattern P. Key concepts include the fundamentals of hashing, the importance of a good hash function, and approaches that improve memory utilization. Additionally, we discuss how Cuckoo Hashing can efficiently handle insertions and searches with multiple hash choices, enhancing performance and reducing collisions.
Efficient Cuckoo Hashing for Exact String Search in Large Dictionaries
E N D
Presentation Transcript
Dictionary search Exact string search Paper on Cuckoo Hashing
Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing
Key issue: a good hash function Basic assumption:Uniform hashing • Avg #keys per slot = n * (1/m) = n/m • =a(load factor)
Search cost m = Q(n)
In practice A trivial hash function is: prime
A “provably good”hash is l = max string len m = table size ≈log2 m • Each ai is selected at random in [0,m) a0 k0 k1 a1 k2 a2 kr ar K prime r ≈ L / log2 m a not necessarily: (...mod p) mod m
Cuckoo Hashing A B C E D 2 hash tables, and 2 random choices where an item can be stored
A running example A B C F E D
A running example A B C F E D
A running example A B C F G E D
A running example E G B C F A D
Cuckoo Hashing Examples A B C G E D F Random (bipartite) graph: node=cell, edge=key
Natural Extensions • More than 2 hashes (choices) per key. • Very different: hypergraphs instead of graphs. • Higher memory utilization • 3 choices : 90+% in experiments • 4 choices : about 97% • 2 hashes + bins of B-size. • Balanced allocation and tightly O(1)-size bins • Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory ...but more local
Dictionary search Making one-side errors Paper on Bloom Filter
Crawling How to keep track of the URLs visited by a crawler? • URLs are long • Check should be very fast • No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs
2 TTT
Opt k = 5.45... m/n = 8 We do have an explicit formula for the optimal k
Dictionary search Prefix-string search Reading 3.1 and 5.2
Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.
2 2 0 5 1 1 4 5 6 7 2 3 Trie: speeding-up searches s y z omo aibelyite stile zyg czecin etic ygy ial Pro: O(p) search time Cons: edge + node labels and tree structure
5 5 2 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... Front-coding: squeezing strings ….systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Gzip may be much better...
Internal Memory Disk 2-level indexing • 2 advantages: • Search ≈ typically 1 I/O • Space ≈ Front-coding over buckets CT on a sample • A disadvantage: • Trade-off ≈ speed vsspace (because of bucket size) systileszaielyite ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….