1 / 61

Indexing Internal Memory with Minimal Perfect Hash Functions

Indexing Internal Memory with Minimal Perfect Hash Functions. Fabiano C. Botelho Hendrickson R. Langbehn Guilherme V. Menezes Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil Brazilian Symposium on Database Campinas, Brazil, October 13, 2008. Summary.

Télécharger la présentation

Indexing Internal Memory with Minimal Perfect Hash Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Internal Memory with Minimal Perfect Hash Functions Fabiano C. Botelho Hendrickson R. Langbehn Guilherme V. Menezes Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil Brazilian Symposium on Database Campinas, Brazil, October 13, 2008 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1

  2. Summary • The Problem • Motivation and Objective • Basic Concepts • Minimal Perfect Hashing Method Used • Methods to Compare MPHFs with • Experimental Results • Conclusions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2

  3. The Problem Pairs: <Key, Data> Static Key Set LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3

  4. The Problem Data Structure Pairs: <Key, Data> Static Key Set Insert LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4

  5. The Problem Lookups Data Structure Pairs: <Key, Data> Static Key Set Insert LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5

  6. The Problem Which data structure gives the best trade-off between lookup time and internal memory usage? Lookups Data Structure Pairs: <Key, Data> Static Key Set Insert LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6

  7. Where Does This Problem Appear? In data warehousing applications Large vocabularies in web search engines LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7

  8. Indexing: Representing the Vocabulary Vocabulary Inverted List Collection of documents Term 1 Doc 1 Doc 5 ... Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 ... Doc n Term 2 Doc 1 Doc 2 ... Term 3 Doc 3 Doc 4 ... Term 4 Doc 7 Doc 9 ... Term 5 Doc 6 Doc 10 ... Term 6 Doc 1 Doc 5 ... Term 7 Term 8 ... Term t Doc 9 Doc 11 ... Indexing LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8

  9. Summary • The Problem • Motivation and Objective • Basic Concepts • Minimal Perfect Hashing Method Used • Methods to Compare MPHFs with • Experimental Results • Conclusions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9

  10. Motivation • Minimal perfect hash functions were not considered a good option in the past: • Same space overhead of other hashing schemes • Other hashing schemes were more efficient considering lookup time as metric. • New results related to minimal perfect hash functions changed this scenario. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10

  11. Objective • Compare the use of minimal perfect hash functions with traditional hash techniques, namely: • Linear Hashing • Quadratic Hashing • Double Hashing • Cuckoo Hashing • Sparse Hashing LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11

  12. Summary • The Problem • Motivation and Objective • Basic Concepts • Minimal Perfect Hashing Method Used • Methods to Compare MPHFs with • Experimental Results • Conclusions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12

  13. Hash Function Set of n keys S jan feb mar apr may ... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13

  14. Hash Function Set of n keys S jan feb mar apr may ... Hash Table Hash Function ... m -1 0 1 2 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14

  15. Hash Function Set of n keys S jan feb mar apr may ... Collision Hash Table Hash Function ... m -1 0 1 2 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15

  16. Perfect Hash Function Static key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16

  17. Minimal Perfect Hash Function Static key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17

  18. Theoretical Lower Bounds for Storage Space • PHFs (m > n): For m = 1.23n: • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18

  19. Summary • The Problem • Motivation and Objective • Basic Concepts • Minimal Perfect Hashing Method Used • Methods to Compare MPHFs with • Experimental Results • Conclusions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19

  20. Minimal Perfect Hashing Method Used • Functions Simple to describe and implement • Functions generated based on random acyclic hypergraphs • Used before by Majewski et all (1996): O(n log n) bits LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20

  21. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21

  22. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22

  23. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23

  24. Random Hypergraphs (r-graphs) • 3-graph: jan, feb,mar Key set S h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4 4 5 • 3-graph is induced by three uniform hash functions • Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24

  25. MinimalPerfectHashingMethodUsed (r = 2) S jan feb mar apr LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25

  26. MinimalPerfectHashingMethodUsed (r = 2) S Gr: jan feb mar apr h0 1 2 0 3 jan Mapping mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26

  27. Acyclic 2-graph Gr: L:Ø h0 1 2 0 3 jan mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27

  28. Acyclic 2-graph Gr: L: {0,5} h0 1 2 0 3 jan apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28

  29. Acyclic 2-graph 0 1 Gr: L: {0,5} {2,6} h0 1 2 0 3 jan apr h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29

  30. Acyclic 2-graph 0 1 2 Gr: L: {0,5} {2,6} {2,7} h0 1 2 0 3 jan h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30

  31. Acyclic 2-graph 0 1 2 3 Gr: L: {0,5} {2,6} {2,7} {2,5} h0 1 2 0 3 Gr is acyclic h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31

  32. MinimalPerfectHashingMethodUsed (r = 2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 r 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32

  33. MinimalPerfectHashingMethodUsed (r = 2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33

  34. MinimalPerfectHashingMethodUsed (r = 2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 1 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34

  35. MinimalPerfectHashingMethodUsed (r = 2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35

  36. MinimalPerfectHashingMethodUsed (r = 2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36

  37. MinimalPerfectHashingMethodUsed (r = 2) Hash Table g S assigned Gr: assigned 0 0 mar jan feb mar apr r - 1 h0 1 2 0 3 jan 0 2 L 3 - r jan Mapping Assigning mar apr - r feb 4 5 - r h1 4 5 6 7 6 feb 1 7 apr 1 assigned assigned i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1 phf(feb) = hi=1 (feb) = 6 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37

  38. MinimalPerfectHashingMethodUsed (r = 2) g S assigned Gr: assigned 0 0 jan feb mar apr Hash Table r 1 h0 1 2 0 3 0 0 mar 2 L 3 jan r 1 jan Mapping Assigning Ranking mar apr 2 feb r feb 4 5 apr r 3 h1 4 5 6 7 6 1 7 1 assigned assigned i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1 phf(feb) = hi=1 (feb) = 6 mphf(feb) = rank(phf(feb)) = rank(6) = 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38

  39. Space to Represent the Functions g S Gr: 0 0 jan feb mar apr 2 bits for each entry r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39

  40. Space to Represent the Functions (r = 3) • PHF g: [0,m-1] → {0,1,2} • m = cn bits, c = 1.23 → 2.46 n bits • (log 3) cn bits, c = 1.23 → 1.95 n bits (arith. coding) • Optimal: 0.89n bits LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40

  41. Space to Represent the Functions (r = 3) • PHF g: [0,m-1] → {0,1,2} • m = cn bits, c = 1.23 → 2.46 n bits • (log 3) cn bits, c = 1.23 → 1.95 n bits (arith. coding) • Optimal: 0.89n bits • MPHF g: [0,m-1] → {0,1,2,3}(ranking info required) • 2m + εm = (2+ ε)cn bits • For c = 1.23 and ε = 0.125 → 2.62 n bits • Optimal: 1.44n bits LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41

  42. Implications of Generating Compact Functions • Practice • Compact functions (cache effects) • Efficiency at retrieval time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42

  43. Summary • The Problem • Motivation and Objective • Basic Concepts • Minimal Perfect Hashing Method Used • Methods to Compare MPHFs with • Experimental Results • Conclusions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43

  44. Methods to Compare MPHFs with • Linear Hashing (LH) • Quadratic Hashing (QH) • Double Hashing (DH) • Cuckoo Hashing (CH) • Dense Hashing (DeH) • Sparse Hashing (SH) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44

  45. Traditional Hashing Data Structure • Linear Hashing, Quadratic Hashing, Double Hashing, Cuckoo Hashing, Dense Hashing LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45

  46. Sparse Hashing Data Structure LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46

  47. Hash Table K7 Kn D7 Dn 0 1 … … K1 D1 10 K2 D2 n-1 Minimal Perfect Hashing Data Structure LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47

  48. Summary • The Problem • Motivation and Objective • Basic Concepts • Minimal Perfect Hashing Method Used • Methods to Compare MPHFs with • Experimental Results • Conclusions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48

  49. Experimental Setup • Commodity PC with a cache of 4 Mbytes • 1.86 GHz, 4 GB, Linux 2.6, 64 bits architecture • Vocabularies • Successful and Unsuccessful Searches LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49

  50. Comparing Minimal Perfect Hashing with Linear Hashing, Quadratic Hashing and Double Hashing Successful Searches LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50

More Related