1.61k likes | 1.77k Vues
Near-Optimal Space Perfect Hashing Algorithms. Fabiano C. Botelho Supervisor: Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil Thesis Presentation, Brazil, September 29, 2008. Summary. Problem and Objectives Applications
E N D
Near-Optimal Space Perfect Hashing Algorithms Fabiano C. Botelho Supervisor: Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil Thesis Presentation, Brazil, September 29, 2008 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1
Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2
What is the Problem to Solve? • To generate PHFs and MPHFs for very large static key sets • The resulting functions must be practical LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3
Objectives • Space: • Generate compact functions (close to the optimal) • Use little memory to generate the functions (Scalability) • Time: • Efficient computation at retrieval time • Efficient Construction of MPHFs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4
Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5
Where to Use a PHF or an MPHF? Access items based on the value of a key is ubiquitous in Computer Science Work with huge static item sets: In data warehousing applications In Web search engines: Large vocabularies Map long URLs in smaller integer numbers that are used as IDs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6
Indexing: Representing the Vocabulary Vocabulary Inverted List Collection of documents Term 1 Doc 1 Doc 5 ... Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 ... Doc n Term 2 Doc 1 Doc 2 ... Term 3 Doc 3 Doc 4 ... Term 4 Doc 7 Doc 9 ... Term 5 Doc 6 Doc 10 ... Term 6 Doc 1 Doc 5 ... Term 7 Term 8 ... Term t Doc 9 Doc 11 ... Indexing LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7
Mapping URLs to Web Graph Vertices 0 Web Graph Vertices 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8
Mapping URLs to Web Graph Vertices 0 Web Graph Vertices M P H F 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9
Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10
Perfect Hash Function Static key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11
Minimal Perfect Hash Function Static key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12
Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m • Family H of uniform hash function • # of bits to encode each function • Probability of collisions: LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13
Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m • Family H of uniform hash function • # of bits to encode each function • Probability of collisions: • Family H of universal hash function • # of bits to encode each function • Probability of collisions : X LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14
Intuition Behind Universal Hashing • We often lose relatively little compared to using a completely random map (uniform hashing) • If S of size n is hashed to n2 buckets, with probability more than ½, no collisions occur • Even with complete randomness, we do not expect little o(n2) buckets to suffice (the birthday paradox) • So nothing is lost by using a universal family instead! LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15
Lower Bounds for Storage Space • PHFs (m > n): For m = 1.23n: • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16
Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17
Related Work Theoretical Results (use uniform hashing) Practical Results (use universal hashing - assume uniform hashing) Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18
Theoretical ResultsUse Complete Randomness (Uniform Hash Functions) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19
Theoretical ResultsUse Complete Randomness (Uniform Hash Functions) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20
Practical ResultsAssume Uniform Hashing for Free (Use Universal Hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21
Practical ResultsAssume Uniform Hashing for Free (Use Universal Hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22
Practical ResultsAssume Uniform Hashing for Free (Use Universal Hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23
Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24
Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25
The Algorithm A perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets: Key set fits in the internal memory Internal Random Access memory algorithm (IRA) Key set larger than the internal memory External Cache-Aware memory algorithm (ECA) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26
Algorithm Used for the Buckets: Internal Random Access Memory Algorithm... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27
Internal Random Access Memory Algorithm • Near-optimal space LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28
Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29
Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30
Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31
Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32
Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33
Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits • We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34
Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35
Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36
Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37
Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4 4 5 • 3-graph is induced by three uniform hash functions • Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38
Internal Random Access Memory Algorithm (r=2) S jan feb mar apr LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39
Internal Random Access Memory Algorithm (r=2) S Gr: jan feb mar apr h0 1 2 0 3 jan Mapping mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40
Acyclic 2-graph Gr: L:Ø h0 1 2 0 3 jan mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41
Acyclic 2-graph Gr: L: {0,5} h0 1 2 0 3 jan apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42
Acyclic 2-graph 0 1 Gr: L: {0,5} {2,6} h0 1 2 0 3 jan apr h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43
Acyclic 2-graph 0 1 2 Gr: L: {0,5} {2,6} {2,7} h0 1 2 0 3 jan h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44
Acyclic 2-graph 0 1 2 3 Gr: L: {0,5} {2,6} {2,7} {2,5} h0 1 2 0 3 Gr is acyclic h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45
Internal Random Access Memory Algorithm (r=2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 r 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46
Internal Random Access Memory Algorithm (r=2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47
Internal Random Access Memory Algorithm (r=2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 1 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48
Internal Random Access Memory Algorithm (r=2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49
Internal Random Access Memory Algorithm (r=2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50