Counting Distinct Objects over Sliding Windows

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of New South Wales, Australia

Introduction Counting distinct objects: • Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: • Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications • traffic management, call centers, wireless communication, stock market etc.

Introduction Approximate counting: Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee; |n-n’|/n ≤ ε with confidence (1 – δ) Contribution: • FM based algorithms • SE-FM (accuracy guarantee + space usage guarantee) • PCSA-based algorithm (No accuracy guarantee (although practical) + more efficient) • k-Skyband (Accuracy guarantee + efficient + no space usage guarantee)

FM Algorithm FM SKETCH Let h(x) be a uniform hash function • Let “pivot” p(y) be the position of left most 1-bit of h(x) • FM be an array of size k initialized to zero • For each record x in dataset • FM[pivot] = 1; • Let B=FMmin be the position of left most 0-bit of FM • Number of distinct elements = α * 2B where α = 1.2897385 • Each bit i of h(x) has 1/2 probability to be one k = 4 h(r1) h(r2) h(r3) FM P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985 FMmin = 1

FM Algorithm • Each bit i of h(x) has 1/2 probability to be one • A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2i+1 Let n be the number of distinct elements • FM[0] is accessed appx. n/2 times • FM[1] is accessed appx. n/4 times • …. • FM[i] is accessed appx. n/2i+1 times • If i >> log2 n • FM[i] will almost certainly be zero • If i << log2 n • FM[i] will almost certainly be one • If i ≈ log2 n • FM[i] may be zero or one • Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n. h(r1) h(r2) h(r3) FM FMmin = 1

FM Algorithm Use r hash functions to create r FM Sketches • Initialize each FM to zero • For each record x in dataset • For each hash function hi(x) • FMi[pivot] = 1; • Let Bi be the position of left most 0-bit of FMi • B = (B1 + B2 … + Br )/ r • Number of distinct elements = α * 2B where α = 1.2897385 FM1 B1 = 1 FM2 B2 = 2 Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є2 log 1/δ) FM3 B3 = 2 B = (1 + 2 + 2)/3 = 1.67

FM-based Algorithm Maintaining one FM sketch • For each record (x,t) in dataset • FM[pivot] = t; Answering a query • For any t, let B = FMmin (t) be the position of left most entry of FM with value less than t • Number of distinct elements arrived after (inclusive) t = α * 2B where α = 1.2897385 h(r1) h(r2) h(r3) FM FMmin (4) = 0

FM-based Algorithm Maintain r FM sketches • Initialize each FM to zero • For each record (x,t) in dataset • For each hash function hi(x) • FMi[pivot] = t; Answering a query • For any t, let Bi (t) be the position of left most entry smaller than t in i-th FM • Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r • Number of distinct elements arrived after (inclusive) t = α * 2B where α = 1.2897385

Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є2 log 1/δ) • Total Space: O(1/є2 log 1/δ log m) • Total maintenance cost for one record: O(1/є2 log 1/δ log log m) • Total query cost: O(1/є2 log 1/δ log log m)

PCSA-based Algorithm Maintain r FM sketches but update j < r sketches • Generate j hash functions H(x) that map x to [1,r] • Initialize each FM to zero • For each record (x,t) in dataset • For each of the j hash functions H() • i = H(x) • Update i-th FM sketch Answering a query • For any t, let Bi (t) be the position of left most entry smaller than t in i-th FM • Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r • Number of distinct elements arrived after (inclusive) t = (α * 2B)/ j where α = 1.2897385 • Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985” NOTE: No accuracy guarantee but performs well in practice

BJKST Algorithm • Main Idea • Let h() be a hash function to hash D to [1,m3] where m = |D| • For each record x, we generate its hash value h(x) • Maintain k-th smallest distinct hash value k_min • Number of distinct elements = n = km3/k_min • Improved algorithm • Use r hash functions • Compute ni for each hash function hi() as above • Report final answer as median of ni values • Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in datastream. In RANDOM'02.

K-Skyband Technique • Main Idea • Let h() be a hash function to hash D to [1,m3] where m = |D| • For each record (x,t’) we generate h(x) and store record (x, h(x), t’) • Answering a query q(t): • Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t • Get the k-th smallest distinct hashed value and apply BJKST algorithm • Limitation: Requires storing all records

K-Skyband Technique • For any time t, we need to find k-th smallest hash value arriving no later than t • A record x dominates another record y if x arrives after y and has smaller hash value • K-Skybands keeps only the objects that are dominated by at most (k-1) records • Maintaining K-Skyband: • Keep a counter for each record • When a new element (x,t) arrives, increment the counter of all records dominated by it • Remove the records with counter at least equal to k • We increment the counters of groups to improve efficiency (Domination aggregation search tree) k = 2 b e c t d a h(x)

K-Skyband Technique • Answering Query: • Find k_min (the k-th smallest hash value among elements arriving no later than t) • Let z be the number of elements arrived before t • k_min is the (z+k)-th overall smallest hash value • Algorithm: • Maintain a binary search tree eT that stores elements according to t • Maintain a binary search tree eH that stores elements according to h(x) • When a query q(t) arrives • Compute z by using eT • Find (z+k)-th overall smallest hash value from eH k_min = 5th smallest h(x) k = 2 b e c t d a z = 3 f h(x)

Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Expected total space: O(1/є2 log 1/δ log n) Expected time complexity: O(log 1/δ (log 1/є+ log n))

Experiments • Synthetic datasets following Uniform and Zipf distribution • Real dataset WorldCup 98 HTTP requests (20 M records) j

Space Efficiency

Time Efficiency Maintenance cost

Time Efficiency Query response time

Accuracy

Thanks

P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Space usage: 1/ε2 log 1/δ m1/2 • Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio-temporal aggregation using sketches. In ICDE 2004. Space usage: O(N/ε2 log 1/δ log m)

Space Requirement (SE-FM) To guarantee the performance we require the following; • k = O(log m + log 1/є + log 1/δ ) • r = O(1/є2 log 1/δ) Let m > 1/є and m > 1/δ; then k = O(log m) Size of one sketch is k = O(log m); Size of r sketches is: O(r log m) = O(1/є2 log 1/δ log m); Total Space: O(1/є2 log 1/δ log m)

Time Complexity (SE-FM) To guarantee the performance we require the following; • k = O(log m + log 1/є + log 1/δ ) • r = O(1/є2 log 1/δ) The elements in a sketch are stored in a min-heap to support logarithmic search/update; • Hence, cost of one search/update operation: O( log k) = O( log log m) • To maintain the sketches, we update r sketches for each record x • Total maintenance cost for one record: O( r log log m) = O(1/є2 log 1/δ log log m) • To answer a query, we search in r sketches • Total cost: O( r log log m) = O(1/є2 log 1/δ log log m)

Space Usage (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Expected size of k-skyband = O (k ln (n/k) ) Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є2 log 1/δ log n)

Time Complexity (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Answering Query q(t): Search eT to compute z: log (k log n) = O(log k + log n) Search eH to find (z+t)-th element: O(log k + log n) We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є+ log n))

Counting Distinct Objects over Sliding Windows