250 likes | 407 Vues
This presentation explores the statistical properties of data streams, focusing on key challenges such as estimating the number of distinct elements and frequency moments. It covers foundational concepts from the works of Alon, Matias, and Szegedy, introducing efficient estimators and the importance of randomness in these calculations. Attendees will gain insights into practical algorithms like Count Sketch for finding frequent items, understanding variance management, and how stable distributions play a role in effective analysis.
E N D
Compact Representations in Streaming Algorithms Moses CharikarPrinceton University
Talk Outline • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items
Frequency Moments[Alon, Matias, Szegedy ’99] • Stream consists of elements from {1,2,…,n} • mi = number of times i occurs • Frequency moment • F0 = number of distinct elements • F1 = size of stream • F2 =
Overall Scheme • Design estimator (i.e. random variable) with the right expectation • If estimator is tightly concentrated, maintain number of independent copies of estimator E1, E2, …, Er • Obtain estimate E from E1, E2, …, Er • Within (1+) with probability 1-
Randomness • Design estimator assuming perfect hash functions, as much randomness as needed • Too much space required to explicitly store such a hash function • Fix later by showing that limited randomness suffices
Distinct Elements • Estimate the number of distinct elements in a data stream • “Brute Force solution”: Maintain list of distinct elements seen so far • (n) storage • Can we do better ?
Distinct Elements[Flajolet, Martin ’83] • Pick a random hash function h:[n] [0,1] • Saythere are k distinct elements • Then minimum value of h over k distinct elements is around 1/k • Apply h() to every element of data stream; maintain minimum value • Estimator = 1/minimum
(Idealized) Analysis • Assume perfectly random hash function h:[n] [0,1] • S: set of k elements of [n] • X = min aS { h(a) } • E[X] = 1/(k+1) • Var[X] = O(1/k2) • Mean of O(1/2) independent estimators is within (1+) of 1/k with constant probability
Analysis • [Alon,Matias,Szegedy]Analysis goes through with pairwise independent hash functionh(x) = ax+b • 2 approximation • O(log n) space • Many improvements[Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan]
Estimating F2 • F2 = • “Brute force solution”: Maintain counters for all distinct elements • Sampling ? • n1/2space
Estimating F2[Alon,Matias,Szegedy] • Pick a random hash functionh:[n] {+1,-1} • hi = h(i) • Z = • Z initially 0, add hievery time you see i • Estimator X = Z2
Analyzing the F2 estimator • Median of means gives good estimator
What about the randomness ? • Analysis only requires 4-wise independence of hash function h • Pick h from 4-wise independent family • O(log n) space representation, efficient computation of h(i)
Properties of F2 estimator • “sketch” of data stream that allows computation of • Linear function of mi • Can be added, subtracted • Given two streams, frequencies mi , ni • E[(Z1-Z2)2] = • Estimate L2 norm of difference • How about L1 norm ? Lp norm ?
Stable Distributions • p-Stable distribution DIf X1, X2, … Xn are i.i.d. samples from D,m1X1+m2X2+…mnXn is distributed as||(m1,m2,…,mn)||pX • Defining property up to scale factor • Gaussian distribution is 2-stable • Cauchy distribution is 1-stable • p-Stable distributions exist for all0 < p 2
Talk Outline • Similarity preserving hash functions • Similarity estimation • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items
Variants of F2 estimator[Alon, Gibbons, Matias, Szegedy] • Estimate join size of two relations(m1,m2,…) (n1,n2,…) • Variance may be too high
Finding Frequent Items [C,Chen,Farach-Colton ’02] Goal: Given a data stream, return an approximate list of the k most frequent items in one pass and sub-linear space Applications: Analyzing search engine queries, network traffic.
Finding Frequent Items ai: ith most frequent element mi : frequency If we hadan oracle that gave us exact frequencies, can find most frequent items in one pass Solution: A data structure called a Count Sketch that gives good estimates of frequencies of the high frequency elements at every point in the stream
Intuition • Consider a single counter X with a single hash function h:{a} { +1, -1} • On seeing each element ai, update the counter with X += h(ai) • X = mi • h(ai) • Claim: E[X • h(ai)] = mi • Proof idea: Cross-terms cancel because of pairwise independence
Finding the max element • Problem with the single counter scheme: variance is too high • Replace with an array of t counters, using independent hash functions h1... ht h1: a {+1, -1} ht: a {+1, -1}
Analysis of “array of counters” data structure • Expectation still correct • Claim: Variance of final estimate < mi2 /t • Variance of each estimate < mi2 • Proof idea: cross-terms cancel • Set t = O(log n • mi2 / (m1)2) to get answer with high prob. • Proof idea: “median of averages”
Problem with “array of counters” data structure • Variance of estimator dominated by contribution of large elements • Estimates for important elements such as ak corrupted by larger elements (variance much more than mk2) • To avoid collisions, replace each counter with a hash table of b counters to spread out the large elements
In Conclusion • Simple powerful ideas at the heart of several algorithmic techniques for large data sets • “Sketches” of data tailored to applications • Many interesting research questions