1 / 13

Optimizing Data Popularity Conscious Bloom Filters

Optimizing Data Popularity Conscious Bloom Filters. Ming Zhong Pin Lu Kai Shen Joel Seiferas University of Rochester. Problem Overview. Bloom filters: compact set representation in which each object is hashed into several bits in the filter;

Télécharger la présentation

Optimizing Data Popularity Conscious Bloom Filters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Data Popularity Conscious Bloom Filters Ming Zhong Pin Lu Kai Shen Joel Seiferas University of Rochester

  2. Problem Overview • Bloom filters: • compact set representation in which each object is hashed into several bits in the filter; • allows possible false positives in membership queries; • useful in distributed applications communicating sets. • Highly skewed data popularity distributions. • Data popularity conscious Bloom filters: • use a large number of hashes for likely false positive candidates – popular objects in queries; unpopular objects in sets. • Goal: customize the hash number for each object to minimize the false positive prob. PODC 2008

  3. Object Popularity Stability • Stable object popularity is important for learning the object popularity and for low adjustment overhead. • Illustration of stability across month-long trace segments: PODC 2008

  4. Problem Formulation and Result • Problem formulation: • in a universe of N objects, an n-object set is represented by an m-bit filter; • object i’s membership pop. is pi, non-member query pop. is q’i; • find object hash numbers k1, k2, …, kN to minimize the false positive probability ∑1≤i≤N q’i ∙ pow(B,ki); • B is the probability for an arbitrary filter bit to be 1, therefore ∑1≤i≤N pi ∙ ki = K = ln(1-B) / (n ∙ ln(1-1/m)). • Result (assume ki‘s are unrestricted real numbers): • Lagrangian function: ∑1≤i≤N q’i ∙ pow(B,ki) + λ∙ (∑1≤i≤N pi ∙ ki – K); • optimization is reached when the function’s partial derivatives on ki’sand λ are all zero; • we find ki = C + log1/B(q’i/pi), C is a constant; • also B = 0.5. PODC 2008

  5. Ranged Integer Problem • Practical constraint: • object i’s hash number ki must be a positive integer, and often upper-bounded by kmax. • Rounding real-number solutions to integers: • may increase the false positive rate; • no understanding on how much the increase may be. • Overview of our approach: • introduce an importance score for each object (intuitively more important objects desire more hashes); • the importance ranking helps produce fast approximation solutions. PODC 2008

  6. Object Importance Score • Intuition: • revisit the optimal real-number solution: ki = C + log2(q’i/pi); • Hint: q’i/pi provides a ranking on object hash numbers in a “good” solution. • Results: • for the ranged real-number problem, an optimal solution k1, k2, …, kNmust follow the importance ranking; • └k1┘, └k2┘, …,└kN ┘is a 2-approximation solution to the ranged integer problem; it also follows the importance ranking. PODC 2008

  7. Polynomial-Time 2-Approximation • Our result indicates that at least one solution that follows the importance score ranking is provably 2-approximation. • ⇒ If we enumerate all importance-ranked solutions, the best is a 2-approximation. • O(Nkmax) time 2-approximation: • no more than (N+1)kmax-1 importance-ranked solutions in total; • it takes O(N) to check constraint and calculate the false positive rate for each solution. • Practically expensive: • N can be huge; • the constant kmax may not be very small (e.g., 20). PODC 2008

  8. Faster Solutions • (2+ε)-approximation: • the problem of identifying the best importance-ranked solution can be transformed into a knapsack problem; • dynamic programming produces (2+ε)-approximation solution in O(N2/ε) time. • Coarse-grained optimization: • partition large number of objects into a small number of groups (objects in each group have similar importance scores); • optimize at the group granularity (then assign equal hash number to objects within one group) ⇒much smaller N. PODC 2008

  9. Evaluation on Synthetic Data • Non-member query pop. q’i follows Zipf-like distribution. • Membership pop. pi follows a uniform distribution. • Our integer approximation solution significantly outperforms the real-rounding solution, particularly at high popularity skewness. PODC 2008

  10. Trace-driven Evaluation on Distributed Caching • Distributed caches exchange their content (set of cached web objects) to cooperate. • Evaluation driven by web access traces from IRCache.net. PODC 2008

  11. Trace-driven Evaluation on Distributed Keyword Searching • Distributed search engines pass keyword indexes to support distributed joins. False positives resolved by additional comm. • Evaluation driven by web page listing at dmoz.com and keyword query traces at Ask.com. PODC 2008

  12. Related Work • Compressed Bloom filters [Mitzenmacher 2002]. • Bloom filters with additional functionalities: • deletion [Fan et al. 2000]; • frequency queries [Cohen and Matias 2003]; • associating objects with values [Chazelle et al. 2004]. • Alternative data structure [Pagh et al. 2005]. • Weighted Bloom filters [Bruck et al. 2006]: • optimal real-number solution with integer rounding; • analytically, the rounding-induced error increase is unbounded; • practically, the error increase can be substantial. PODC 2008

  13. Conclusions • Popularity conscious Bloom filters: • motivated by skewed, stable data popularity distributions; • customize each object’s hash number according to its popularity in sets and queries. • Unrestricted real-number problem: • optimal solution when object hash number is linear to log(query-pop’/set-pop). • Ranged integer problem: • query-pop’/set-pop serves as an object importance indicator; • O(Nkmax) time 2-approximation; • O(N2/ε) time (2+ε)-approximation. • Quantitative evaluations driven by real distributed application traces. PODC 2008

More Related