1 / 153

Offline, Stream and Approximation Algorithms for Synospis Construction

Learn about offline, streaming, and approximation algorithms for synopsis construction. Discover the basics, issues, and connections to signal processing and approximation theory.

antonior
Télécharger la présentation

Offline, Stream and Approximation Algorithms for Synospis Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto GuhaUniversity of Pennsylvania Kyuseok ShimSeoul National University

  2. About this Tutorial • Information is incomplete and could be inaccurate • Our presentation reflects our understanding which may be erroneous A tutorial on synopsis construction algorithms

  3. Synopses Construction Where is the life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? T. S. Eliot, from The Rock. • Routers • Sensors • Web • Astronomy and sciences Too much data too little time. A tutorial on synopsis construction algorithms

  4. The idea • To see the world in a grain of sand… • Broad characteristics of the data • Compression • Dimensionality Reduction • Approximate query answering • Denoising, Outlier Detection and a broad array of signal processing A tutorial on synopsis construction algorithms

  5. What is a synopsis ? • Hmm. • Any “shorthand” representation • Clustering! • SVD! • In this tutorial we will focus on signal/time series processing A tutorial on synopsis construction algorithms

  6. The basic problem • Formally, given a signal X and a dictionary {i} find a representation F=i zii with at most B non-zero zi minimizing some error which is a fn of X-F • Note, the above extends to any dim. A tutorial on synopsis construction algorithms

  7. Many issues • What is the dictionary ? • Which B terms ? • What is the error ? • What are the constraints ? A tutorial on synopsis construction algorithms

  8. Many issues • What is the dictionary ? • Set of vectors • Maybe a basis • Which B terms ? • What is the error ? • What are the constraints ? Top K A tutorial on synopsis construction algorithms

  9. Many issues • What is the dictionary ? • Set of vectors • Maybe a basis • Which B terms ? • What is the error ? • What are the constraints ? Haar Wavelets Also Fourier, Polynomials,… A tutorial on synopsis construction algorithms

  10. Many issues • What is the dictionary ? • Set of vectors • May not be a basis • Histograms: • There are n choose 2 vectors • But since we impose a non-overlapping restriction we get a unique representation. • Which B terms ? • What is the error ? • What are the constraints ? A tutorial on synopsis construction algorithms

  11. Many issues • What is the dictionary ? • Which B terms ? • First B ? • Best B ? • What is the error ? • What are the constraints ? Why should we choose first B ? • B vs 2B numbers • Also … A tutorial on synopsis construction algorithms

  12. Approximation theory • Discipline of Math associated with approximation of functions. • Same as our problem • Linear theory (Parseval, 1800 over two centuries) • Non-Linear theory (Schmidt 1909, Haar 1910) • Is it relevant ? Yes. However Math treatment has been “extremal”, i.e., how does the error change as a function of B. Is that bound tight? • Note: a yes answer does not say anything about “given this signal, is that the best we can do ?” A tutorial on synopsis construction algorithms

  13. Many issues • What is the dictionary ? • Which B terms ? • What is the error ? • This controls which B. • ||X-F||2is most common, used all over in mathematics • ||X-F||1,||X-F||1are useful also • Weights. Relative error of approximation • 1000 by 1010 is not so bad. • 1 by 11 is not too good an idea. • What are the constraints ? A tutorial on synopsis construction algorithms

  14. Many issues • What is the dictionary ? • Which B terms ? • What is the error ? • What are the constraints ? • Input ? Stream, stream of updates … • Space, time, precision and range of values (for zi in the expression F=i zii ) A tutorial on synopsis construction algorithms

  15. In this tutorial • Histograms & Wavelets • Will focus on Optimal, Approximation and Streaming algorithms • How to get one from the other! • Connections to top K and Fourier. A tutorial on synopsis construction algorithms

  16. I. Histograms. A tutorial on synopsis construction algorithms

  17. VOpt Histograms • Lets start simple • Given a signal X, find a piecewise constant representation H with at most B pieces minimizing ||X-H||2 • Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, 1998 • Consider one bucket. • The mean is the best value. • A natural Dynamic programming formulation A tutorial on synopsis construction algorithms

  18. Location (i) 1 2 3 4 5 6 7 Value (Xi) 12 10 2 8 14 28 16 Range [1,4] [5,5] [6,6] [7,7] Representative 8 14 28 16 An Example Histogram Data Distribution V-Optimal Histogram A tutorial on synopsis construction algorithms

  19. Idea: VOpt Algorithm • Within “step/bucket”: Mean is the best. • Assume that the last bucket is [j+1,n]. What can we say about the rest k-1 ? OPT[j,k-1] SQERR[j+1,n] Last bucket Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !! A tutorial on synopsis construction algorithms

  20. Idea: VOpt Algorithm • Within “step/bucket”: Mean is the best. • Assume that the last bucket is [j+1,n]. What can we say about the rest k-1 ? OPT[j,k-1] SQERR[j+1,n] Last bucket Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !! A tutorial on synopsis construction algorithms

  21. Idea: VOpt Algorithm • Within “step/bucket”: Mean is the best. • Assume that the last bucket is [j+1,n]. What can we say about the rest k-1 ? OPT[j,k-1] SQERR[j+1,n] Last bucket Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !! A tutorial on synopsis construction algorithms

  22. Idea: VOpt Algorithm • Dynamic programming algorithm was given to construct the V optimal Histogram. • OPT[n,k] = min {OPT[j,k-1,]+SQERR[(j+1)..n]} 1≤j<n • OPT[j, k] : the minimum cost of representing the set of values indexed by [1..j] by a histogram with k buckets. • SQERR[(j+1)..n]: the sum of the squared absolute errors from (j+1) to n. A tutorial on synopsis construction algorithms

  23. The DP-based VOpt Algorithm for i=1 to n do for k=1 to B do for j=1 to i-1 do (split pt of k-1 bucket hist. and last bucket) OPT[i, k]= min{ OPT[i, k], OPT[j,k-1] + SQERR[j+1,i] } • We need O(Bn) entries for the table OPT • For each entry OPT[i,k], it takes O(n) time if SQERR[j+1.i] can be computed O(1) time • O(Bn) space and O(Bn2) time OPT B n A tutorial on synopsis construction algorithms

  24. Computation of Sum of Squared Absolute Error in O(1) time sum(2,3) = x[2]+x[3] = sum[3]-sum[1]= 12-2 = 10 A tutorial on synopsis construction algorithms

  25. Computation of Sum of Squared Absolute Error in O(1) time Let and Then, Thus, A tutorial on synopsis construction algorithms

  26. Analysis of VOpt Algorithm • O(n2B) time O(nB) space • The space can be reduced (Wednesday) • Main Question : The end use of histogram is to approximate something. • Why not find an “approximately optimal” (e.g., (1+ε) ) histogram? A tutorial on synopsis construction algorithms

  27. If you had to improve something ? Via Wavelets ssq O(n) time O(B2/2) space O(n2B) time O(nB) space (1+) streaming O(nB2/) time. O(B2/) space (1+) streaming ssq O(n) time. O(B/2) space O(n2B) time O(n) space (1+) streaming O(n) time. O(B2/) space offline O(n) time. O(B2/) space Offline O(n) time. O(n+B/) space A tutorial on synopsis construction algorithms

  28. Take 1: For i=1 to n do For K=1 to B do For j=1 to i-1 do (split point for the last bucket) OPT[ 1…i, k] = Min [ OPT[1…i, k], OPT[1…j,k-1]+ SQERR(j+1,i) ] • OPT[1..j,k] is increasing • SQERR(j+1,i) is decreasing • Question: Can we use the monotonicity for searching the minimum ? As j increases A tutorial on synopsis construction algorithms

  29. No • Consider a sequence of positive y1,y2,…,yn • F(i) = i yi and G(i) = F(n) – F(i-1) • F(i): monotonically increasing … Opt[1..j,k-1] • G(i): monotonically deceasing … SQERR(j+1,i) • (n) time is necessary to find mini{ F(i)+G(i) } • Open Question: Does it extend to (n2) over the entire algorithm ? A tutorial on synopsis construction algorithms

  30. What gives ? • Consider a sequence of positive y1,y2,…,yn • F(i) = i yi and G(i) = F(n) – F(i-1) • Thus, F(i)+G(i) = F(i) + xi • Any i gives a 2 approximation to mini{ F(i) + G(i)} • F(i) + G(i) = F(n) + xi ≤ 2 F(n) • mini{ F(i) + G(i)} is at least F(n) A tutorial on synopsis construction algorithms

  31. ·(1+d)h h Round 1 • Use a histogram to approximate the fn • Bootstrap! • Approximate the increasing fn in powers of (1+d) • Right end pt is (1+) approximation of left end pt A tutorial on synopsis construction algorithms

  32. (1+) ¸ ¸ Why ? What does that do ? • Consider evaluating the fn at the two endpoints • Proof by picture. h h’ By construction. Why ? By monotonicity! A tutorial on synopsis construction algorithms

  33. Therefore… SQERR • The right hand point is a (1+δ) approximation! • Holds for any point in between. • OPT[x]+SQERR[x+1]≥ OPT[a]+SQERR[b] • ≥ OPT[b]/(1+ δ) + SQERR[b] • ≥ {OPT[b] + SQERR[b]}/ (1+δ) • Are we done ? • Not quite yet. • What happens for B>2 ? – we do not compute OPT[i,b] exactly !! OPT h’ a b A tutorial on synopsis construction algorithms

  34. Zen and the art of histograms • Approximate the increasingfn in powers of (1+d) • Right end pt is (1+d) approximation • Prove by induction that the error is (1+)B • This tells us what  should be (small), in fact if we set =/2B then (1+)B· 1+ A tutorial on synopsis construction algorithms

  35. Complexity analysis • # of intervals p ~ (B/) log n • Why ? • c(1+δ) (p-1) ≤ nR2 and δ = /(2B) • R is the largest number in data • Assume R is polynomially bounded by n • Running time ~ nB (B/) log n • Why are we approximating the increasing function ? Why not the decreasing one ? A tutorial on synopsis construction algorithms

  36. The first streaming model • The signal X is specified by xi arriving in increasing order of i • Not the most general model • But extremely useful for modeling time series data A tutorial on synopsis construction algorithms

  37. Streaming 1b xi 1b x2i Need to store 1a xi 1a x2i a b Required space is (B2/) log n A tutorial on synopsis construction algorithms

  38. VOpt Construction: O(Bn2) • [Jagadish et al.: VLDB 1998] • OPT(i,k) = min1≤j<i{OPT(j,k-1)+SQERR(j+1,i)} 10 7 8 9 3 4 5 6 1 2 OPT[j,k] 8 9 10 4 5 6 7 2 3 1 OPT[j,k-1] n A tutorial on synopsis construction algorithms n

  39. δ = ε /2B c b a P = O(Bε-1logn) AHIST-S: (1+ε) Approximation • AOPT[j,k] = min1≤j<i{AOPT[bjp,k-1]+SQERR[bjp+1,n]} • O(B2ε-1nlogn) time and O(B2ε-1logn) space AOPT[j,k] (1+δ)a ≥b AOPT[j,k-1] n (1+δ)a < c A tutorial on synopsis construction algorithms P

  40. The overall idea The natural DP table The approximate table A tutorial on synopsis construction algorithms

  41. Do s talk to us ? • DJIA data from 1901-1993 execution time B A tutorial on synopsis construction algorithms

  42. Take 2: GK02 • Sliding window streams • Potentially infinite data – interested in the last n only • Q: Suppose we constructed histogram for [1..n] and now want it for [2..(n+1)] • Previous idea is a dead on arrival. • Consider 100,1,2,3,4,5,7,8,… A tutorial on synopsis construction algorithms

  43. Formal problem • Maintain a data structure • Given an interval [a,b] construct a B bucket histogram for [a,b] • Compute on the fly • Generalizes the window! • Generalizes VOpt when a=1,b=n A tutorial on synopsis construction algorithms

  44. Reconsider the take 1 • We are evaluating • Left to right, i.e., But we are still evaluating this guy ! A tutorial on synopsis construction algorithms

  45. A brave new world • Assume a O(n) size buffer holds xi values • The previous algorithm was: • Several issues • Which values are necessary and sufficient • We are not evaluating all values – what induction ? A tutorial on synopsis construction algorithms

  46. b a P = O(Bε-1logn) GK02: Enhanced (1+ε) Approximation • Lazy evaluation using Binary Search • O(B3ε-2log3n) time and O(n) space • Pre-processing takes O(n) time – SUM and SQSUM (1+δ)a ≥z AOPT[j,k] (1+δ)a < z+1 AOPT[j,k-1] n P A tutorial on synopsis construction algorithms

  47. GK02: Enhanced (1+ε) Approximation • Creates all of B interval lists at once • The values of necessary AOPT[j,k] are computed recursively to find the intervals [ajp, bjp] where bjp is the largest z s.t. • (1+ε) AOPT[ajp,k] ≥ (1+ε) AOPT[z,k] • (1+ε) AOPT[ajp,k] < (1+ε) AOPT[z+1,k] • Note that AOPT increases as z increases • Thus, we can use binary search to find z • O(n) space of SUM and SQSUM arrays needs to be maintain to allow the computation of SQERR(j+1,i) in O(1) time • O(n+B3ε-2log3n) time and O(n) space A tutorial on synopsis construction algorithms

  48. Take 2 summary • O(n) space and O(n+B3-2log2 n) time • Is that the best ? Obviously no. A tutorial on synopsis construction algorithms

  49. Take 3: AHIST-L-  • Suppose we knew · OPT · 2  then… • Instead of powers of (1+/B) additive terms of /(2B) then … • Time is O(B3-2 log n) • To get ? • 2-approximation: =O(1) • a binary search: O(log n) • Thus, O(B3log n * log n) • Overall O(n+B3(-2+logn)log n) time and O(n+B2/) space: O(B/) A tutorial on synopsis construction algorithms

  50. Take 4: AHIST-B • Consider the take 4 algorithm. • How to stream it ? On the new part Overall M A tutorial on synopsis construction algorithms

More Related