1 / 20

Lecture 4: CountSketch High Frequencies

Lecture 4: CountSketch High Frequencies. COMS E6998-9 F15. Plan. Scriber? Plan: CountMin / CountSketch (continuing from last time) High frequency moments via Precision Sampling. Part 1: CountMin / CountSketch. Let be frequency of Last lecture: 2 nd moment: Tug of War

jersey
Télécharger la présentation

Lecture 4: CountSketch High Frequencies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4:CountSketchHigh Frequencies COMS E6998-9F15

  2. Plan • Scriber? • Plan: • CountMin/CountSketch (continuing from last time) • High frequency moments via Precision Sampling

  3. Part 1: CountMin/CountSketch • Let be frequency of • Last lecture: • 2nd moment: • Tug of War • Max: heavy hitter • CountMin

  4. CountMin: overall AlgorithmCountMin: Initialize(L, w): array S[L][w] L hash functions , into {1,…w} Process(inti): for(j=0; j<L; j++) S[j][ ] += 1; Estimator: foreachi in PossibleIP { = (S[j][]); } • Heavy hitters: • If , not reported • If , reported as heavy hitter • Space: cells

  5. Time • Can improve time; space degrades to • Idea: dyadic intervals • Each level: one CountMin sketch on the virtual stream • Find heavy hitters by following down the tree the heavy hitters (virtual) stream 1, with 1 element [1,n] [1,n] (virtual) stream 2, with 2 elements [1,n/2], [n/2+1,n] [1,n/2] … (virtual) stream , with elements [1,n/2^j], … [3,4] (real) stream , with elements 1,2,…n 3

  6. CountMin: linearity • Is CountMin linear? • CountMin() from CountMin() and CountMin() ? • Just sum the two! • sum the 2 arrays, assuming we use the same hash function • Used a lot in practice https://sites.google.com/site/countminsketch/

  7. CountSketch AlgorithmCountSketch: Initialize(L, w): array S[L][w] L hash func’s, into [w] L hash func’s, into {} Process(inti, real ): for(j=0; j<L; j++) S[j][ ] += ; Estimator: foreachi in PossibleIP { = (S[j][]); } • How about ? • Or general streaming • “Heavy hitter”: • if • “min” is an issue • But median is still ok • Ideas to improve it further? • Use Tug of War in each bucket => CountSketch • Better in certain sense (cancelations in a cell)

  8. CountSketchCompressed Sensing • Sparse approximations: • -sparse approximation : • Solution: = the heaviest elements of • Compressed Sensing: [Candes-Romberg-Tao’04, Donoho’04] • Want to acquire signal • Acquisition: linear measurements (sketch) • Goal: recover -sparse approximation from • Error guarantee: • Theorem: need only –size sketch!

  9. Signal Acquisition for CS • Single pixel camera [Takhar-Laska-Waskin-Duarte-Baron-Sarvotham-Kelly-Baraniuk’06] • One linear measurement = one row of • CountSketch: a version of Compr Sensing • Set • : take all the heavy hitters (or largest) • Space: source: http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/cscam-SPIEJan06.pdf

  10. Back to Moments • General moments: • moment: • normalized: • : • via Tug of War (Lec. 3) • : count # distinct! • [Flajolet-Martin] from Lec. 2 • : • : will see later (for all ) • (normalized): • Impossible to approximate, but can heavy hitters (Lec. 3) • Remains: ? • Space: Precision Sampling (next)

  11. A task: estimate sum Compute an estimate from a3 a1 a2 a4 a3 a1 • Given: quantities in the range • Goal: estimate “cheaply” • Standard sampling: pick random set of size • Estimator: • Chebyshev bound: with 90% success probability • For constant additive error, need

  12. Precision Sampling Framework Compute an estimate from u4 u1 u2 u3 ã3 ã4 ã1 ã2 a2 a4 a3 a1 • Alternative “access” to ’s: • For each term , we get a (rough) estimate • up to some precision, chosen in advance: • Challenge: achieve good trade-off between • quality of approximation to • use only weak precisions (minimize “cost” of estimating )

  13. Formalization Sum Estimator Adversary 1. fix precisions 1. fix • 2. fix s.t. 3. given , output s.t. (for ) • What is cost? • Here, average cost = • to achieve precision , use “resources”: e.g., if is itself a sum computed by subsampling, then one needs samples • For example, can choose all • Average cost ≈

  14. Precision Sampling Lemma • Goal: estimate from satisfying . • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: • with average cost equal to • Example: distinguish vs • Consider two extreme cases: • if three : enough to have crude approx for all • if all : only few with good approx, and the rest with

  15. Precision Sampling: Algorithm • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: • with average cost equal to • Algorithm: • Choose each i.i.d. • Estimator: . • Proof of correctness: • Claim 1: • Hence, • Claim 2: Avg cost =

  16. -moments via Prec. Sampling • Theorem:linear sketch for -moment with approximation, and space (with 90% success probability). • Sketch: • Pick random , and • let • Hash into a hash table , cells • Estimator: • Linear

  17. Correctness of estimation AlgorithmPrecisionSamplingFp: Initialize(w): array S[w] hash func, into [w] hash func, into {} reals , from distribution Process(vector ): for(i=0; i<n; i++) S[] += ; Estimator: • Theorem: is approximation with 90% probability, with cells • Proof: • Use Precision Sampling Lem. • Need to show small • more precisely:

  18. Correctness 2 • Claim: • Consider cell • How much chaff is there? • What is ? • By Markov’s: with probability >90% • Set , then where exponential r.v.

  19. Correctness (final) • Claim: • where • Proved: • this implies with 90% for fixed • But need for all ! • Want: with high probability for some smallish • Can indeed prove for with strong concentration inequality (Bernstein).

  20. Recap • CountSketch: • Linear sketch for general streaming • -moment for • Via Precision Sampling • Estimate of sum from poor estimates • Sketch: Exp variables + CountSketch

More Related