1 / 50

520 likes | 718 Vues

Big Data. Lecture 5: Estimating the second moment, dimension reduction, applications . The second moment. A , B , A ,C, D , D , A , A , E , B , E , E ,F,…. The second moment: . Alon , Matias , Szegedy 96. Gödel Prize 2005. Maintain: . Alon , Matias , Szegedy 96. Gödel Prize 2005.

Télécharger la présentation
## Big Data

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Big Data**Lecture 5: Estimating the second moment, dimension reduction, applications**The second moment**A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment:**Alon, Matias, Szegedy 96**Gödel Prize 2005 Maintain:**Alon, Matias, Szegedy 96**Gödel Prize 2005 Maintain:**2-wise independent hash family**Suppose h : [d] [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? t1 x1 ? x2 t2**2-wise independent hash family**H, a family of hash functions h, is 2-wise independent iff x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 t1 x1 ? x2 t2**2-wise independent hash family**H={(ax+b) mod T | 0 a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0 a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions**Draw h from 2-wise ind. family**Z2 is an unbiased estimator for F2 !**What is the variance of Z2 ?**Here we will assume that h is drawn from a 4-wise inde. family H**Chebyshev’s Inequality**If is small this is meaningless… We need to reduce the variance How ?**Averaging**Draw k ind. hash functions h1, h2, …. , hk Use**Boosting the confidence – Chernoff bounds**Pick 1/4 1/4**Boosting the confidence – Chernoff bounds**Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?**Boosting the confidence – Chernoff bounds**Each of A1,…..,As is bad ((1 ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As**Boosting the confidence – Chernoff bounds**What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then s = O(log(1/)) with a large enough constant**Recap**=**This is a random projection..**= Preserve distances in the sense:**Make it look more familiar..**= Preserve distances in the sense:**Dimension reduction**(A random orthonormal k d) = We project into a random k-dim. subspace**Dimension reduction**(A random orthonormal k d) = We project into a random k-dim. subspace JL: ε[0,1]**Dimension reduction**(A random orthonormal k d) = We project into a random k-dim. subspace JL: ε[0,1]**Johnson-Lindenstrauss**JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :**The proof**(A random orthonormal k d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:**The proof**(A random orthonormal k d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:**The proof**(A random orthonormal k d) = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:**The proof**Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:**The case k=1**Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:**The case k=1**Random unit vec = JL:**The case k=1**1 ε[0,1]**An application: approximate period**m 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized**An application, approximate period**10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized**An application, approximate period**10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized**An exact algorithm**Find r such that For each value of r takes linear time O(m2) is minimized**An exact algorithm**Find r such that For each value of r takes linear time O(m2) is minimized We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…**Obs1: We can sketch faster..**B h A A running inner-product with a unit vector This is similar to a convolution of two vectors**Convolution**4 5 0 2 1 3 3 1 2 0**Convolution**4 5 0 2 1 3 3 1 2 0**Convolution**4 5 0 2 1 3 3 1 2 0**Convolution**4 5 0 2 1 3 3 1 2 0**Convolution**4 5 0 2 1 3 3 1 2 0 We can compute the convolution in O(mlog(r)) time using the FFT**Obs1: We can sketch faster**h We can compute the first coordinate of all sketches in O(mlog(r)) time We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…**Obs2: Sketch only in powers of 2**We compute all sketches in O(log(m)mlog(r)k)**When r is not a power of 2 ?**z x y S(x) S(y) Use S(x) + S(y) as S(z)**The algorithm**z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)**The algorithm**z x y S(x) S(y) Total running time is O(mlog3m)**Bibliography**• Noga Alon, YossiMatias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), 137-147 • W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. • JiríMatousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): 142-156 (2008) • PiotrIndyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000: 363-372

More Related