1 / 15

Flexible Approximate Counting

Flexible Approximate Counting. Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium.

bona
Télécharger la présentation

Flexible Approximate Counting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Approximate Counting Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. Outline • What is approximate counting? • What’s new? • Functional form • Increment decision strategies • Speed it up! • Random number and bit generators • Inverse problem • Find function given how high you want to count (Focus on red since that’s what’s significant)

  3. What is approximate counting? • Approximate counter C • Trade decreased memory for decreased accuracy • Standard (unsigned) integer or bit field, but C represents some bigger number N • Normal integers use log2N bits to represent 0..N • Counter C can use log2(log2N) bits to represent 1..log2N • Accurate to within a factor of 2 • “Count” to 2^(28) using 8 bits N=φ(C) function Count using only the exponent 100 110 100110 unary  binary  floating point

  4. What is approximate counting? ? p=1/(32-16) • Count occurrences of datastream objects, pairs of IP addresses • Problem • Object arrives, decide whether to increment • N+1 = ? if you only stored C? • C=4, N=16. Choose 16+1 = 16 or 16+1 = 32  • Solution • Coin flipping. 16+1 = 32 with probability p = 1/(32-16)  • Flajolet papers prove expected value and error are reasonable, 1985-2004+ • Two sources of error • Unavoidable: intermediate numbers not representable. Constant-factor approximation. • Datastream: can’t view all the data at once, random decisions. Expected error bounds.

  5. Motivation • Old idea (memory-accuracy) with some new uses • Morris 1978, one small register on a CPU • Today big data, lots of counters • Data-summarization • Approximate Counting useful by itself, for counting all objects • Database merge • Choose most efficient algorithm, pre-allocate memory • May be combined with other techniques • Bloom filters • Replace 1-bit with a small counter, Van Durme & Lall 2009 • Spread counter into multiple bits of a Bloom filter, Talbot 2009 vary the number of bits for skewed data,

  6. Generalize Functionq-ary counting and Floating Point AC • ΔN = 2C. Why base 2? • p=2-CUse fast random-bit source for increment decisions • Csűrös2010 • Treat counter as binary-exponent floating point number • Exponent gives powers-of-two increment probabilities • Significand gives better accuracy than base 2 • Stair-step approximationto “q-ary” counting: • I.e. Restricted to 9choices for 8-bit counters • First contribution Get these advantages… …without these restrictions 8-d bitsexponent d-bits signficand 0100 0110

  7. Our Flexible AC • Flexible AC • Perfect counting below a threshold T, then • ΔN = aC-T. p=1/aC-T, a is any floating point value. • a small (<2) since 255 = log2(5.7e76) • Round ΔN to integer • Still get prior speedups Round all ΔN to powers-of-two If speed(RandomBit) < ½ speed(RandomNumber)

  8. Random Bit Generator • Many well-tested random numbergenerators • Fewer random bitgenerators • Knuth vol. 2 eq 10 – very simple (fast!) A = x0102010081010101 //64-bit constant X = X << 1 //shift left If overflow X = X xor A RandomBit = X & 1 // lowest bit of X • A is your choice of primitive polynomial mod 2 with many one-bits: 8 out of 64, Rajski & Tyszer 2003 • Every length-64 bit-sequence occurs once before repetition • Consider accuracy in terms of intended use.What matters for our application • k one-bits in a row occurs 1 in 2^k times • Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42 times verified experimentally

  9. Speed Comparison • If this is embedded in a datastream application, speed may be important. • Random number generator is the bottleneck (goal is incrementing a counter!) if RandNumber < p increment //p = 2^{-k} if k RandomBits in a row increment

  10. Random Countdown Speedup • Why generate a random number every time? • Set countdown counter P P = number of times in a row RandNumber > p [no increment] • This is the definition of a geometric distribution • Need one countdown counter per counter value (1..255)not per counter (billions) • Calculating P is (relatively) very expensive • Fast on average if P is large  p is small • Hybrid algorithm • RandNumber < p? or RandomBit for small p • Random Countdown for large p • “small” means <10 or <22

  11. Fixed Countdown Speedup • Why generate a random number at all? • Increment “1 in Δφ” times deterministically Slightly different value to get correct expected value  Best possible accuracy if only one item  Fastest  Relies on randomness of stream • E.g. alternating items bad counts

  12. Punchline Speed: RandomCountFixedCount RandomCount = 1.5x Fixed Count for Δφ=255 Random Count = ¼x RandomBit for Δφ=172

  13. How High Do You Want to Count?Inverse problem (David M. Day) • Find a, never discussed in approximate counting literature • For some applications, determine by hand ahead of time • Our run-time solution • Inverse geometric sum const Find root >1 for r(a) Initial guess depends on scompared to K. I.e. aK+1 vs. savs.(s-1) tricky case

  14. Inverse Problem Alternatives • We’re only approximately counting, • So accuracymay not be important • We only calculate function once, • So efficiency may not be important (Application dependent) • Use the initial guesses • Use binary search or lookup table • Use N=φ(C) function with easier inverse • E.g. exponential + linear function,but increments are too small for small C

  15. Conclusion • Flexible Approximate Counting provides • Customization of functional form • At run-time, for maximum value to count to • Fast decisions of whether to increment • If datastream is sufficiently random • Use fixed countdown • Else • Switch to random countdown for large increments • If speed is more important than accuracy for small increments • Use random bits and power-of-two increments • Random generator accuracy limits • Consider the intended use • RandNumber Min r : probability(u<r) ≈ r • RandomBit Max k: probability(k one-bits in row) ≈ 2-k • Thank you • Have a safe trip home

More Related