Compact Representations in Streaming Algorithms

Compact Representations in Streaming Algorithms Moses CharikarPrinceton University

Talk Outline • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items

Frequency Moments[Alon, Matias, Szegedy ’99] • Stream consists of elements from {1,2,…,n} • mi = number of times i occurs • Frequency moment • F0 = number of distinct elements • F1 = size of stream • F2 =

Overall Scheme • Design estimator (i.e. random variable) with the right expectation • If estimator is tightly concentrated, maintain number of independent copies of estimator E1, E2, …, Er • Obtain estimate E from E1, E2, …, Er • Within (1+) with probability 1-

Randomness • Design estimator assuming perfect hash functions, as much randomness as needed • Too much space required to explicitly store such a hash function • Fix later by showing that limited randomness suffices

Distinct Elements • Estimate the number of distinct elements in a data stream • “Brute Force solution”: Maintain list of distinct elements seen so far • (n) storage • Can we do better ?

Distinct Elements[Flajolet, Martin ’83] • Pick a random hash function h:[n]  [0,1] • Saythere are k distinct elements • Then minimum value of h over k distinct elements is around 1/k • Apply h() to every element of data stream; maintain minimum value • Estimator = 1/minimum

(Idealized) Analysis • Assume perfectly random hash function h:[n]  [0,1] • S: set of k elements of [n] • X = min aS { h(a) } • E[X] = 1/(k+1) • Var[X] = O(1/k2) • Mean of O(1/2) independent estimators is within (1+) of 1/k with constant probability

Analysis • [Alon,Matias,Szegedy]Analysis goes through with pairwise independent hash functionh(x) = ax+b • 2 approximation • O(log n) space • Many improvements[Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan]

Estimating F2 • F2 = • “Brute force solution”: Maintain counters for all distinct elements • Sampling ? • n1/2space

Estimating F2[Alon,Matias,Szegedy] • Pick a random hash functionh:[n]  {+1,-1} • hi = h(i) • Z = • Z initially 0, add hievery time you see i • Estimator X = Z2

Analyzing the F2 estimator

Analyzing the F2 estimator • Median of means gives good estimator

What about the randomness ? • Analysis only requires 4-wise independence of hash function h • Pick h from 4-wise independent family • O(log n) space representation, efficient computation of h(i)

Properties of F2 estimator • “sketch” of data stream that allows computation of • Linear function of mi • Can be added, subtracted • Given two streams, frequencies mi , ni • E[(Z1-Z2)2] = • Estimate L2 norm of difference • How about L1 norm ? Lp norm ?

Stable Distributions • p-Stable distribution DIf X1, X2, … Xn are i.i.d. samples from D,m1X1+m2X2+…mnXn is distributed as||(m1,m2,…,mn)||pX • Defining property up to scale factor • Gaussian distribution is 2-stable • Cauchy distribution is 1-stable • p-Stable distributions exist for all0 < p  2

Talk Outline • Similarity preserving hash functions • Similarity estimation • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items

Variants of F2 estimator[Alon, Gibbons, Matias, Szegedy] • Estimate join size of two relations(m1,m2,…) (n1,n2,…) • Variance may be too high

Finding Frequent Items [C,Chen,Farach-Colton ’02] Goal: Given a data stream, return an approximate list of the k most frequent items in one pass and sub-linear space Applications: Analyzing search engine queries, network traffic.

Finding Frequent Items ai: ith most frequent element mi : frequency If we hadan oracle that gave us exact frequencies, can find most frequent items in one pass Solution: A data structure called a Count Sketch that gives good estimates of frequencies of the high frequency elements at every point in the stream

Intuition • Consider a single counter X with a single hash function h:{a}  { +1, -1} • On seeing each element ai, update the counter with X += h(ai) • X =  mi • h(ai) • Claim: E[X • h(ai)] = mi • Proof idea: Cross-terms cancel because of pairwise independence

Finding the max element • Problem with the single counter scheme: variance is too high • Replace with an array of t counters, using independent hash functions h1... ht h1: a  {+1, -1} ht: a  {+1, -1}

Analysis of “array of counters” data structure • Expectation still correct • Claim: Variance of final estimate <  mi2 /t • Variance of each estimate <  mi2 • Proof idea: cross-terms cancel • Set t = O(log n •  mi2 / (m1)2) to get answer with high prob. • Proof idea: “median of averages”

Problem with “array of counters” data structure • Variance of estimator dominated by contribution of large elements • Estimates for important elements such as ak corrupted by larger elements (variance much more than mk2) • To avoid collisions, replace each counter with a hash table of b counters to spread out the large elements

In Conclusion • Simple powerful ideas at the heart of several algorithmic techniques for large data sets • “Sketches” of data tailored to applications • Many interesting research questions

Compact Representations in Streaming Algorithms

Compact Representations in Streaming Algorithms

Presentation Transcript

Compact Data Representations and their Applications

Representations

Advances and challenges in the design of compact representations of meshes and complexes

Streaming Algorithms

Approximations and Streaming Algorithms for Geometric Problems

Compact Representations of Coalitional Games

Representations

Sketching, Sampling and other Sublinear Algorithms: Streaming

Streaming Algorithms

Compact Representations of Separable Graphs

Streaming Algorithms for Geometric Problems

Representations

Streaming Algorithms for Geometric Problems

Compact Data Representations and their Applications

Near Optimal Streaming algorithms for Graph Spanners

Streaming Algorithms

Streaming Algorithms for Geometric Problems

Streaming Algorithms for Geometric Problems

Near Optimal Streaming algorithms for Graph Spanners