1 / 161

Near-Optimal Space Perfect Hashing Algorithms

Near-Optimal Space Perfect Hashing Algorithms. Fabiano C. Botelho Supervisor: Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil Thesis Presentation, Brazil, September 29, 2008. Summary. Problem and Objectives Applications

fiona
Télécharger la présentation

Near-Optimal Space Perfect Hashing Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Near-Optimal Space Perfect Hashing Algorithms Fabiano C. Botelho Supervisor: Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil Thesis Presentation, Brazil, September 29, 2008 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1

  2. Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2

  3. What is the Problem to Solve? • To generate PHFs and MPHFs for very large static key sets • The resulting functions must be practical LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3

  4. Objectives • Space: • Generate compact functions (close to the optimal) • Use little memory to generate the functions (Scalability) • Time: • Efficient computation at retrieval time • Efficient Construction of MPHFs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4

  5. Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5

  6. Where to Use a PHF or an MPHF? Access items based on the value of a key is ubiquitous in Computer Science Work with huge static item sets: In data warehousing applications In Web search engines: Large vocabularies Map long URLs in smaller integer numbers that are used as IDs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6

  7. Indexing: Representing the Vocabulary Vocabulary Inverted List Collection of documents Term 1 Doc 1 Doc 5 ... Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 ... Doc n Term 2 Doc 1 Doc 2 ... Term 3 Doc 3 Doc 4 ... Term 4 Doc 7 Doc 9 ... Term 5 Doc 6 Doc 10 ... Term 6 Doc 1 Doc 5 ... Term 7 Term 8 ... Term t Doc 9 Doc 11 ... Indexing LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7

  8. Mapping URLs to Web Graph Vertices 0 Web Graph Vertices 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8

  9. Mapping URLs to Web Graph Vertices 0 Web Graph Vertices M P H F 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9

  10. Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10

  11. Perfect Hash Function Static key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11

  12. Minimal Perfect Hash Function Static key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12

  13. Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m • Family H of uniform hash function • # of bits to encode each function • Probability of collisions: LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13

  14. Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m • Family H of uniform hash function • # of bits to encode each function • Probability of collisions: • Family H of universal hash function • # of bits to encode each function • Probability of collisions : X LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14

  15. Intuition Behind Universal Hashing • We often lose relatively little compared to using a completely random map (uniform hashing) • If S of size n is hashed to n2 buckets, with probability more than ½, no collisions occur • Even with complete randomness, we do not expect little o(n2) buckets to suffice (the birthday paradox) • So nothing is lost by using a universal family instead! LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15

  16. Lower Bounds for Storage Space • PHFs (m > n): For m = 1.23n: • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16

  17. Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17

  18. Related Work Theoretical Results (use uniform hashing) Practical Results (use universal hashing - assume uniform hashing) Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18

  19. Theoretical ResultsUse Complete Randomness (Uniform Hash Functions) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19

  20. Theoretical ResultsUse Complete Randomness (Uniform Hash Functions) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20

  21. Practical ResultsAssume Uniform Hashing for Free (Use Universal Hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21

  22. Practical ResultsAssume Uniform Hashing for Free (Use Universal Hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22

  23. Practical ResultsAssume Uniform Hashing for Free (Use Universal Hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23

  24. Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24

  25. Summary • Problem and Objectives • Applications • Basic Concepts and Lower Bounds • Related Work • The Agorithms • MPHFs and Random Graphs with Cycles • Indexing Internal Memory with MPHFs • Contributions • Conclusions and Publications LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25

  26. The Algorithm A perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets: Key set fits in the internal memory Internal Random Access memory algorithm (IRA) Key set larger than the internal memory External Cache-Aware memory algorithm (ECA) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26

  27. Algorithm Used for the Buckets: Internal Random Access Memory Algorithm... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27

  28. Internal Random Access Memory Algorithm • Near-optimal space LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28

  29. Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29

  30. Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30

  31. Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31

  32. Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32

  33. Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33

  34. Internal Random Access Memory Algorithm • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits • We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34

  35. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35

  36. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36

  37. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37

  38. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4 4 5 • 3-graph is induced by three uniform hash functions • Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38

  39. Internal Random Access Memory Algorithm (r=2) S jan feb mar apr LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39

  40. Internal Random Access Memory Algorithm (r=2) S Gr: jan feb mar apr h0 1 2 0 3 jan Mapping mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40

  41. Acyclic 2-graph Gr: L:Ø h0 1 2 0 3 jan mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41

  42. Acyclic 2-graph Gr: L: {0,5} h0 1 2 0 3 jan apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42

  43. Acyclic 2-graph 0 1 Gr: L: {0,5} {2,6} h0 1 2 0 3 jan apr h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43

  44. Acyclic 2-graph 0 1 2 Gr: L: {0,5} {2,6} {2,7} h0 1 2 0 3 jan h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44

  45. Acyclic 2-graph 0 1 2 3 Gr: L: {0,5} {2,6} {2,7} {2,5} h0 1 2 0 3 Gr is acyclic h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45

  46. Internal Random Access Memory Algorithm (r=2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 r 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46

  47. Internal Random Access Memory Algorithm (r=2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47

  48. Internal Random Access Memory Algorithm (r=2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 1 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48

  49. Internal Random Access Memory Algorithm (r=2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49

  50. Internal Random Access Memory Algorithm (r=2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50

More Related