180 likes | 341 Vues
Bloom Filters. Kira Radinsky. Slides based on material from: Michael Mitzenmacher and Hanoch Levy Some modifications by Naama Kraus . Bloom Filter. Problem : membership testing Does item X belong to a set S ?
E N D
Bloom Filters Kira Radinsky Slides based on material from: Michael Mitzenmacher and Hanoch Levy Some modifications by Naama Kraus
Bloom Filter • Problem: membership testing • Does item X belong to a set S ? • Assumption: the great majority of items tested will not belong to the given set • Data structure should be: • Fast (faster than searching through S). • Small (smaller than explicit representation). • The “price”: allow some probability of error • Allow false positive errors • Don’t allow false negative errors
Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 4 Web Cache 5 Web Cache 6 Sample Application:Distributed Web Caches • Summary Cache: [Fan, Cao, Almeida, & Broder] If local caches know each other’s content... …try local cache before going out to Web • The idea: each cache keeps a summary of the content of each participating cache • Store each summary in a Bloom Filter
Why Bloom Filters? • Size is very economical • Efficient query time • Percentage of false positives is 1%-2% for 8 bits per entry • False positives are possible • Penalty is a wasted cache query. Small cost. • No false negatives • Never miss a cache hit. Big potential gain.
Bloom Filters B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Start with an m bit array, filled with 0s. Hash each item xjin S k times. If Hi(xj) = a, set B[a] = 1. To check if y is in S, check B at Hi(y). All k values must be 1. Possible to have a false positive; all k values are 1, but y is not in S.
Bloom Filter 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 x V0 Vm-1 h1(x) h2(x) h3(x) hk(x)
Bloom Errors 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 a b c d V0 Vm-1 h1(x) h2(x) h3(x) hk(x) x didn’t appear, yet its bits are already set
Computational Factors • Size m/n : bits per item. • |U| = n: Number of elements to encode. • hi: U[1..m] : Maintain a Bit Vector V of size m • Time k : number of hash functions. • Use k hash functions (h1..hk) • Error f : false positive probability.
Error Estimation • Assumption: Hash functions are perfectly random • Probability of a bit being 0 after hashing all elements: • Let p=e-kn/m, probability of a false positive is: • Assuming we are given m and n, the optimal k is:
Example m/n= 8 Opt k = 8 ln 2 = 5.45...
Bloom Filter Tradeoffs • Three factors: m,k and n. • Normally, n and m are given, and we select k. • More hash functions yields more chances to find a 0 bit for elements not in S • Fewer hash functions increases the fraction of the bits that are 0. • Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is 0.5 .
Bloom Filters and Deletions • Cache contents change • Items both inserted and deleted. • Insertions are easy – add bits to BF • Can Bloom filters handle deletions? • Use Counting Bloom Filters to track insertions/deletions
Handling Deletions • Bloom filters can handle insertions, but not deletions. • If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj. xixj B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
Counting Bloom Filters B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B B B 0 0 0 2 3 1 0 0 0 0 0 0 1 0 0 0 0 0 1 2 2 0 0 0 0 0 0 3 1 3 1 2 2 1 1 1 0 0 0 1 1 2 1 1 1 0 0 0 Start with an m bit array, filled with 0s. Hash each item xjin S k times. If Hi(xj) = a, add 1 to B[a]. To delete xjdecrement the corresponding counters. Can obtain a corresponding Bloom filter by reducing to 0/1.
Variations and Extensions • Bloomier Filter • Distance-Sensitive Bloom Filters
Extension: Bloomier Filter • Bloomierfilter [Chazelle, Kilian, Rubinfeld, Tal]: • Map: associate a value with each element (key) • Elements not in the map have a null value • Always return correct value for elements in the map • No false negatives: • If null is returned, element is not in the map • False positives: • Returns a value for an element that is not in the map
Extension: Distance-Sensitive Bloom Filters • Instead of answering questions of the form we would like to answer questions of the form • That is, is the query close to some element of the set, under some metric and some notion of close. • Applications: • DNA matching • Virus/worm matching • Databases • Some initial results [KirschMitzenmacher]. Hard.