Maintaining Stream Statistics Over Sliding Windows
Maintaining Stream Statistics Over Sliding Windows. Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani. Presentation by Adam Morrison. Sliding Window Intro. Infinite stream. Only last N elements relevant. Packet streams. N is huge. Stronger model…. 1. 2. 3. 4.
Maintaining Stream Statistics Over Sliding Windows
E N D
Presentation Transcript
Maintaining Stream Statistics Over Sliding Windows Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani Presentation by Adam Morrison.
Sliding Window Intro • Infinite stream. • Only last N elements relevant. • Packet streams. • N is huge. • Stronger model…
1 2 3 4 Model • Count memory bits. • Online algorithm. Arrival: 5 6 7 Timestamp: 3 2 1 3 2 1 3 2 1
Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else
Basic Counting • Exact Solution? (Counter?) Exact solution requires (N) bits. 2 1 1 0 2 1
=0.05 95 100 105 Approximate Basic Counting • Solution: Approximate the answer and bound the relative error
Bucket sizes? Policy for creating new buckets? What is it good for? The idea • Dynamic histogram of active 1s. • New 1s go into right most bucket. • For each bucket keep the timestamp of the most recent 1 and the bucket’s size. • When timestamp expires, free bucket.
Timestamp: Size: 1 1 2 3 4 5 1 1 2 2 2 2 2 1 Example (N=4)
9 10 11 12 13 14 14 9 10 11 12 13 14 0 0 5 4 3 2 1 0 6 5 4 3 2 1 0 (Timestamps are easy) Cyclic counter mod N. N=15
What does the histogram buy us? • Active bucket Contains an active 1. • Only the last bucket might contain expired 1s.
Estimating number of 1s Conclusion: • T – sum of all bucket sizes but last. • So there are at least T 1s. • C – size of last bucket. • Actual # of 1s can be anything from 1 to C.
Bucket sizes: True count Absolute Relative
If at all times we’d have that for all j, Bounding the error Goal: Relative error at most =1/k.
Exponential Histogram How can we do that?(With as few buckets as possible?) • Non-decreasing bucket sizes. • Bucket sizes constrained to • At most buckets of each size. • For all sizes but that of last bucket, at least buckets of each size.
4 2 4 4 2 2 2 2 1 1 1 1 3 1 2 2 3 2 2 1 1 1 3 1 2 1 4 2 4 1 1 1 5 2 5 2 4 1 2 1 6 2 7 2 7 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 4 2 3 2 2 2 2 1 1 1 3 1 New 1 – create bucket Check if invariant violated. Too many buckets – merge
Why it works (correctness) If there are at least buckets of sizes
Why it works (space) • Can account for all 1s with just
Space usage # of buckets: Bucket size: T counter for estimation:
Bucket of size B accounts for all operations related to it: B inserts, B-1 merges (& maybe delete). Sum of all buckets in life time (including deleted) is all insertions. past Operations • Estimation: O(1) • Insertion: Cascading makes it worst case. • But only O(1) amortized!
Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else
Extending to Sum • Integers in range [0, R]. • On value V, insert V 1s. • Timestamps: • Bucket counter: • # of buckets: • Total space: Insertion takes (R)!
Picking gives amortized time. Reducing insertion time • If we had a way to rebuild the entire histogram… • We could buffer new values… • And rebuild histogram when buffer reaches size B. • If it takes , amortized is
Would it really? Is this representation unique? k/2 canonical representation The k/2 canonical representation of S : If S is the total size of the buckets, computing its k/2 canonical representation would help us rebuild the histogram.
Find the largest j for which If find Total time required is O(log S). =01 j=2 =5
2 5 If a value gets “unindexed”, it will never be indexed in the future. 8 6 4 3 2 1 10 8 6 5 4 3 9 7 5 4 3 2 Calculate S1+S2 representation: 10 6 2 1 1 1 1
Lower Bounds • More about timestamps. • Applications. • More problems Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else
Lower Bounds • More about timestamps. • Applications. • More problems Lower bounds • Basic Counting and Sum algorithms are optimal. • Similar techniques will show that lots of other problems are intractable. (Later.)
Big block d Left most such subblock Same idea works for Sum.
Lower bound applies to randomized algorithms. Randomized bound • Yao minimax principle: • Expected space complexity of optimal algorithm for an input distribution is a lower bound on expected space complexity of randomized algorithm.
Lower Bounds • More about timestamps. • Applications. • More problems Timestamps If much less than N items can arrive during the window, memory usage is reduced. • Define window based on real time – equate timestamp with clock. • No work needs to be done when items don’t arrive, so deletions can be deferred.
Lower Bounds • More about timestamps. • Applications. • More problems Applications • Adapting algorithms to the sliding window model using EH to replace counters. • Counters require bits, EH takes . • Also factor loss in accuracy.
Lower Bounds • More about timestamps. • Applications. • More problems More Problems • Min/Max • Storing subsequence of (say) mins is optimal. • Distinct values • Basic Counting reduces to it.
Other Problems • Distinct values with deletions. • Factor 2 estimation requires (N) space. • Map 1s in a bit string to distinct values. Pad with zeros to infer value of last bit, then use deletion to cancel that bit. • Repeat.
Other Problems • Sum with negative integers. • Factor 2 estimation requires (N) space. • Maps 1s in bit string to (-1,1) and 0s to (1,-1). • Pad with 0s and query at odd time instants.