1 / 17

Bloom Filters

Bloom Filters. Kira Radinsky. Slides based on material from: Michael Mitzenmacher and Hanoch Levy Some modifications by Naama Kraus . Bloom Filter. Problem : membership testing Does item X belong to a set S ?

trynt
Télécharger la présentation

Bloom Filters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bloom Filters Kira Radinsky Slides based on material from: Michael Mitzenmacher and Hanoch Levy Some modifications by Naama Kraus

  2. Bloom Filter • Problem: membership testing • Does item X belong to a set S ? • Assumption: the great majority of items tested will not belong to the given set • Data structure should be: • Fast (faster than searching through S). • Small (smaller than explicit representation). • The “price”: allow some probability of error • Allow false positive errors • Don’t allow false negative errors

  3. Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 4 Web Cache 5 Web Cache 6 Sample Application:Distributed Web Caches • Summary Cache: [Fan, Cao, Almeida, & Broder] If local caches know each other’s content... …try local cache before going out to Web • The idea: each cache keeps a summary of the content of each participating cache • Store each summary in a Bloom Filter

  4. Why Bloom Filters? • Size is very economical • Efficient query time • Percentage of false positives is 1%-2% for 8 bits per entry • False positives are possible • Penalty is a wasted cache query. Small cost. • No false negatives • Never miss a cache hit. Big potential gain.

  5. Bloom Filters B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Start with an m bit array, filled with 0s. Hash each item xjin S k times. If Hi(xj) = a, set B[a] = 1. To check if y is in S, check B at Hi(y). All k values must be 1. Possible to have a false positive; all k values are 1, but y is not in S.

  6. Bloom Filter 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 x V0 Vm-1 h1(x) h2(x) h3(x) hk(x)

  7. Bloom Errors 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 a b c d V0 Vm-1 h1(x) h2(x) h3(x) hk(x) x didn’t appear, yet its bits are already set

  8. Computational Factors • Size m/n : bits per item. • |U| = n: Number of elements to encode. • hi: U[1..m] : Maintain a Bit Vector V of size m • Time k : number of hash functions. • Use k hash functions (h1..hk) • Error f : false positive probability.

  9. Error Estimation • Assumption: Hash functions are perfectly random • Probability of a bit being 0 after hashing all elements: • Let p=e-kn/m, probability of a false positive is: • Assuming we are given m and n, the optimal k is:

  10. Example m/n= 8 Opt k = 8 ln 2 = 5.45...

  11. Bloom Filter Tradeoffs • Three factors: m,k and n. • Normally, n and m are given, and we select k. • More hash functions yields more chances to find a 0 bit for elements not in S • Fewer hash functions increases the fraction of the bits that are 0. • Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is 0.5 .

  12. Bloom Filters and Deletions • Cache contents change • Items both inserted and deleted. • Insertions are easy – add bits to BF • Can Bloom filters handle deletions? • Use Counting Bloom Filters to track insertions/deletions

  13. Handling Deletions • Bloom filters can handle insertions, but not deletions. • If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj. xixj B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0

  14. Counting Bloom Filters B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B B B 0 0 0 2 3 1 0 0 0 0 0 0 1 0 0 0 0 0 1 2 2 0 0 0 0 0 0 3 1 3 1 2 2 1 1 1 0 0 0 1 1 2 1 1 1 0 0 0 Start with an m bit array, filled with 0s. Hash each item xjin S k times. If Hi(xj) = a, add 1 to B[a]. To delete xjdecrement the corresponding counters. Can obtain a corresponding Bloom filter by reducing to 0/1.

  15. Variations and Extensions • Bloomier Filter • Distance-Sensitive Bloom Filters

  16. Extension: Bloomier Filter • Bloomierfilter [Chazelle, Kilian, Rubinfeld, Tal]: • Map: associate a value with each element (key) • Elements not in the map have a null value • Always return correct value for elements in the map • No false negatives: • If null is returned, element is not in the map • False positives: • Returns a value for an element that is not in the map

  17. Extension: Distance-Sensitive Bloom Filters • Instead of answering questions of the form we would like to answer questions of the form • That is, is the query close to some element of the set, under some metric and some notion of close. • Applications: • DNA matching • Virus/worm matching • Databases • Some initial results [KirschMitzenmacher]. Hard.

More Related