Optimal Approximations of Frequency Moments in Data Streams

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff

4 3 7 3 1 1 7 The Streaming Model … • Stream of elements a1, …, an each in {1, …, m} • Want to compute statistics on stream • Elements arranged in adversarial order • Algorithms given one pass over stream • Goal: Minimum space algorithm

Frequency Moments [AMS96] n = stream size, m = universe size fi= # occurrences of item i k-th moment • F0 = # of distinct elements • F1 = n = stream size • F2 = self-join size Why are frequency moments important?

Applications • Estimating distinct elements with low space • Estimate query selectivity to huge DB without sorting • Routers gather # distinct destinations • F2 estimates size of self-joins: , fB2 + fA2 = 4 + 1 = 5 • Fk measures data skewness

The Best Deterministic Algorithm • Trivial algorithm for Fk • Store/update fifor each item i, sum fik at end • Space = O(mlog n): m items i, log n bits to count fi • Negative Results [AMS96]: • Compute Fk exactly (m) space • Any deterministic alg. outputs X with • |Fk – X| < Fk must use (m) space What about randomized algorithms?

Randomized Approx Algs for Fk • Randomized alg. -approximatesFk if outputs X s.t. Pr[|Fk – X| <  Fk ] > 2/3 • Previous work (table suppresses polylog mn)

Matching Upper Bound Our Contribution: For every k there is a 1-pass O~(m1-2/k) space algorithm to -approximate Fk • Additional Features: • Works even if we allow deletions, that is, stream of elements (i, +), (i,-) • 2. Constant update time

Techniques • Previous Algorithms [AMS96, CK04, G04] • 1. Cleverly construct small-space estimator X s.t. • E[X] = Fk • Var[X] small • 2. Apply Chebyshev’s inequality • Our “algorithm’’ • 1. Divide frequencies into “buckets” • 0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), … • 2. Estimate size si of each bucket • 3.Output X = i si 2ik

What’s Left? • Remaining Problem: Estimate si = # of elements with frequency in each bucket [2i-1, 2i) • Is this always easy? No. • Suppose always easy – then could approximate the maximum frequency • This is HARD – (m) space [AMS96] • However, (m) only applies to “worst-case” streams, otherwise can do better: Countsketch [CCF-C]

For the moment, let’s assume: 1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch) Max frequency items • 2. We have a very long RAM of random bits • (we remove this using Nisan’s generator)

General Idea: Max + Sampling • Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p. … 4 3 7 3 1 1 7 Random subset = {1, 3} …

Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p. General Idea: Max + Sampling • What are chances the maximum lies in • Si= elements r such that fr2 [2i-1, 2i)? q = (1-p) j > i sj¢ (1 – (1-p)si) Idea: 1. Estimate q as q’ by taking independent trials and computing fraction of max in Si 2. If already estimated sj for j > i, solve this expression for si.

When is this estimate any good? Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si: Need 1. (holds inductively) (tight concentration of q’) 2. Requires 9 p so that q > 1/R, where R = # trials used to estimate q

When is this estimate any good? q = (1-p)j > i sj (1 – (1-p)si) p too large? ! q too small p too small? ! q too small Motivates the following: Say a class Sicontributes if and only if si > j > i sj /R If R = (log n), then Fk¼contributing i si 2ik

The Idealized Algorithm • Use the random string to generate hash functions hjr : [m] -> [2j]for j 2 [log m] and r 2 [R] • Restrict stream Str to Strjr, those items i with hjr(i) = 1 • For each Strjr, compute Max(Strjr) • To estimate si given s’t for t > i, find some j for which “enough” of the Max(Strjr) come from Si, and then set • Output F’k = i s’i 2ik

Removing the assumptions 1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space [CCF-C02]:9 a 1-pass O(B)-space algorithm CountSketch which, given stream Str, outputs all x for which fx2¸ F2/B Recall: Sicontributes if and only if si > j > i sj /R Lemma: If Si = [2i-1, 2i) contributes, then Proof: Holder’s inequality.

Consider a space-S algorithm A and a function f, with random strings R1, …, Rn that, when processing a stream, maintains a variable C, and updates as follows: C = C + f(i, Ri) Removing the assumptions 2. We have an infinite string of random bits [Indyk00] Then R1, …, Rn can be generated using Nisan’s PRG, and: The new algorithm A’ has space O~(S) The outputs of A’ and A are indistinguishable Our algorithm follows this framework

Conclusions • Result: Tight O~(m1-2/k) upper bound • Handle deletions (j, -) • O~(1) update time • Open Problem: Reduce O~ factors

Optimal Approximations of Frequency Moments in Data Streams

Optimal Approximations of Frequency Moments in Data Streams

Presentation Transcript

Approximate Frequency Counts over Data Streams

Data Streams

Optimal Space Lower Bounds for all Frequency Moments

SARSOP Successive Approximations of the Reachable Space under Optimal Policies

Data Streams

CLASSIFICATION OF DATA: FREQUENCY DISTRIBUTION

Data Streams

Properties of Approximations

Active Mining of Data Streams

Optimal Sequencing of Traffic Streams at a Signalized Intersection

Data Stream Algorithms Frequency Moments

Fast Moment Estimation in Data Streams in Optimal Space

Optimal Space Lower Bounds for All Frequency Moments

Data Streams

HARDNESS OF APPROXIMATIONS

Organization of Data Frequency Tables

Dynamic Visualization of Transient Data Streams

Adaptive Frequency Counting over Bursty Data Streams

On Demand Classification of Data Streams

Sketch based Summarization of Data Streams

Approximate Frequency Counts over Data Streams