1 / 47

Range-Efficient Counting of Distinct Elements

Range-Efficient Counting of Distinct Elements. Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan). Range-Efficient F 0. Stream: [100,200], [0,10], [60, 120], [5,25] F 0 : |[0,25] U [60,200]| = 167. 120. 200. 100. 60. 0. 5. 10. 25.

mannr
Télécharger la présentation

Range-Efficient Counting of Distinct Elements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Range-Efficient Counting of Distinct Elements Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan)

  2. Range-Efficient F0 Stream: [100,200], [0,10], [60, 120], [5,25] F0: |[0,25] U [60,200]| = 167 120 200 100 60 0 5 10 25 IIT Kanpur Streams Workshop

  3. Range-Efficient F0 Input Stream:Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri <= n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: • Single pass through the data • Small Workspace • Fast Processing Time IIT Kanpur Streams Workshop

  4. Reductions to Range-Efficient F0 Duplicate Insensitive Sum Max-Dominance Norm Range-Efficient F0 Counting Triangles in Graphs IIT Kanpur Streams Workshop

  5. Duplicate-Insensitive Sum Problem:Sum of all distinct elements in a stream of integers Input Stream:Sequence of integers S = a1,a2,….., an Output: distinct ai in S ai Example: S = 4, 5, 15, 4, 100, 4, 16, 15 Distinct Elements = 4,5,15,100, 16 Sum = 140 IIT Kanpur Streams Workshop

  6. Reduction from Dup-Insensitive Sum to F0 Stream from U = [0,m-1] Alternate Stream from U’=[0,m2-1] Duplicate-Insensitive Sum Number of Distinct Elements IIT Kanpur Streams Workshop

  7. Max Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j≤ na1,1 a1,2 .. a1,ma2,1 a2,2 … a2,m … ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b IIT Kanpur Streams Workshop

  8. Input stream I, output stream O:F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range[(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream Reduction From Max Dominance Norm a b IIT Kanpur Streams Workshop

  9. Talk Outline • Range Efficient F0 • Reductions Among Data Stream Problems • Algorithm for Range Efficient F0 (building on distinct sampling) • Update Streams • Open Questions IIT Kanpur Streams Workshop

  10. Counting Distinct Elements (F0) • Example • How many different users accessed my website today? • Stream = 1,1,2,3,4,1,2 F0 = 4 • Numerous Applications in databases and networking • Prior Work • Flajolet-Martin (1985) • Alon, Matias and Szegedy (1996) • Gibbons and Tirthapura (2001) • Bar-Yossef et al. (2002) (currently most space-efficient) • Indyk-Woodruff (2003) (Lower Bounds) IIT Kanpur Streams Workshop

  11. Range-Efficient F0 (Pavan and Tirthapura) Range Sampling for 2-way Independent Hash Functions Distinct Sampling Algorithm for F0 + IIT Kanpur Streams Workshop

  12. Sampling Based Algorithm for F0(Gibbons and Tirthapura 2001) D = Distinct Elements In Stream U = {1,2,3,…..,n} S0 p=1/2 D  S1 S0, S1, S2.. stored implicitly implicitly using hash functions {2,4,7,…} S1 p=1/2 D  S2 {4,7,11,..} S2 IIT Kanpur Streams Workshop

  13. Distinct Sampling Sample = {}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  14. Distinct Sampling Sample = {5}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  15. Distinct Sampling Sample = {5,3}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  16. Distinct Sampling Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  17. Distinct Sampling Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  18. Distinct Sampling Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  19. Distinct Sampling Sample = {5,3,7,6,8}, p = 1 Overflow Sample = Sample  S1 Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  20. Distinct Sampling Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  21. Distinct Sampling Same Decision for both Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  22. Distinct Sampling Sample = {3,6,8,9,2}, p= ½ Overflow Sample = Sample  S2 Sample = {6,9}, p=¼ Target Workspace = 4 numbers IIT Kanpur Streams Workshop

  23. Distinct Sampling Finally, Sample = {6,9}, p=¼ Estimate of F0 = (Sample Size)(4) = 8 IIT Kanpur Streams Workshop

  24. Counting Distinct Elements • Finally, return a sample of distinct elements of the stream of a “large enough” size • If target workspace = O((1/2)(log(1/)) integers, then estimate of F0 is a (, )-approximation • Hash functions need only be pairwise independent and can be stored in small space IIT Kanpur Streams Workshop

  25. Sampling Using Independent Coin Tosses Distinct Sampling Using Hash Functions Hash Function 0 1 0 0 0 1 IIT Kanpur Streams Workshop

  26. Adaptive Sampling for Range-Efficient F0 • Naïve Approach: Given range [x,y], successively insert {x, x+1, … y} into F0 sampling algorithm • Problem: Time per range very large • Range-Sampling: Given stream element [p,q], how to sample all elements in [p,q] quickly? • At sampling level i, quickly compute |[p,q] ∩ Si| IIT Kanpur Streams Workshop

  27. Hash Functions, and S0,S1,S2… 1 v2 h(x)=(ax+b) mod p p primea,b random in [0,p-1] v3 0 v1 p-1 n If h(x) Є[0,vi], then x Є Si IIT Kanpur Streams Workshop

  28. Range Sampling v 1 X1 0 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| IIT Kanpur Streams Workshop

  29. v f(x1) 0 f(x1+1) p-1 Arithmetic Progression 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a IIT Kanpur Streams Workshop

  30. v f(x1) 0 f(x1+1) p-1 Low and High Revolutions • Each revolution, number of hits on [0,v] is • floor(v/a) (low rev) • floor(v/a) +1 (high rev) • Task: Count number of low, high revolutions IIT Kanpur Streams Workshop

  31. v f(x1) 0 f(x1+1) p-1 Starting Points of Revolutions • Can find r = (v - v mod a) such that: • If starting point in [0,r], then high revolution • Else low revolution • Task: Count the number of revolutions with starting point in [0,r] r IIT Kanpur Streams Workshop

  32. a r r 0 0 a-1 p-1 Recursive Algorithm modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) IIT Kanpur Streams Workshop

  33. Recursive Algorithm • Focus on common difference • Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 IIT Kanpur Streams Workshop

  34. Range Sampling Theorem: There is an algorithm for sampling range [x,y] using 2-way independent hash functions with • Time complexity O(log (y-x)) • Space Complexity O(log (y-x) + log m) Plug back into distinct sampling to get range-efficient F0 algorithm IIT Kanpur Streams Workshop

  35. Input StreamSequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output | [l1,r1] U[l2,r2] U … U[lm,rm]| Results • Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream • Processing Time (n is the size of the universe): • Amortized processing time per interval: O(log(1/) (log (n/))) • Time to answer a query for F0 is a constant • WorkSpace: O((1/2)(log(1/)) (log n)) Pavan,TirthapuraSICOMP (to appear) IIT Kanpur Streams Workshop

  36. Prior Work • Bar-Yossef, Kumar, Sivakumar 2002 • First studied range-efficient F0 • Algorithms with higher space complexity • Cormode, Muthukrishnan 2003 • Max-dominance Norm • Nath, Gibbons, Seshan, Anderson 2004 • Duplicate-insensitive Sum assuming ideal hash functions IIT Kanpur Streams Workshop

  37. Comparison IIT Kanpur Streams Workshop

  38. Other Applications of Distinct Sampling • Sample of distinct elements of the stream of any desired target size • Approximate median of all distinct elements in stream (duplicate insensitive median) • Distinct Frequent elements (“heavy hitters” in network monitoring) IIT Kanpur Streams Workshop

  39. Update Streams • Insertions and Deletions of elements into the streams(11, +1), (7, +3), (4, +2), (7, -2), (11,-1)… • Distinct Elements Problem: How many elements have a positive cumulative weight? • Assume a “sanity constraint”, no element has weight less than 0 • Sampling algorithm described so far fails, since it can only decrease sampling probability as stream becomes larger IIT Kanpur Streams Workshop

  40. Distinct Sampling on Update Streams (three independent approaches) • Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi: Processing Set Expressions over Continuous Update Streams. SIGMOD 2003, followed up by Ganguly, 2005 and Ganguly, Majumder 2006 • Graham Cormode, S. Muthukrishnan, Irina Rozenbaum: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. VLDB 2005 • Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling in dynamic data streams and applications. SocG 2005 IIT Kanpur Streams Workshop

  41. Distinct Elements on Update Streams Use of K-Set Structure in storing samples Ganguly, Garofalakis, Rastogi 2003 Ganguly 2005 Ganguly, Majumder 2006 IIT Kanpur Streams Workshop

  42. K-Set Structure • Small space data structure for multi-set S (size Ỡ(K)) • Operations • Insert (x,v) into S • Delete (x,v’) from S • Membership Query (is x in S?) what is the number of distinct elements in S? • If |S| ≤ K, then Queries answered correctly K Active Silent Active IIT Kanpur Streams Workshop

  43. Counting Distinct Elements on Update Streams • Sample Stream at different probabilities, 1, ½, ¼,….. • Store each of (D ∩ S0, D ∩ S1,D ∩ S2,…..) in a k-set structure for an appropriate value of k • When queried, use the highest probability sample that hasn’t overflowed yet IIT Kanpur Streams Workshop

  44. Distributed Streams Alice Workspace = $$ Stream A Sketch(A) 11 54 21 11 2 45 21 1… Referee Bob ComputeDup-Ins-Sum(A,B) Workspace = $$ 1 5 21 2 54 21 35 … Sketch(B) Stream B IIT Kanpur Streams Workshop

  45. Summary Range-Efficiency(range-sampling) Update Streams(k-set structure) Sliding Windows(multiple samples) Distinct Sampling IIT Kanpur Streams Workshop

  46. Open Questions • Can we efficiently handle higher-dimensional ranges? • Klee’s measure problem in streaming model IIT Kanpur Streams Workshop

  47. Open Questions • Range-Efficient F0 under update streams • Duplicate-insensitive Fk (k ≥ 2), range-efficient Fk IIT Kanpur Streams Workshop

More Related