1 / 50

# Big Data

Big Data. Lecture 5: Estimating the second moment, dimension reduction, applications . The second moment. A , B , A ,C, D , D , A , A , E , B , E , E ,F,…. The second moment: . Alon , Matias , Szegedy 96. Gödel Prize 2005. Maintain: . Alon , Matias , Szegedy 96. Gödel Prize 2005.

Télécharger la présentation

## Big Data

E N D

### Presentation Transcript

1. Big Data Lecture 5: Estimating the second moment, dimension reduction, applications

2. The second moment A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment:

3. Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain:

4. Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain:

5. AMS Analysis

6. 2-wise independent hash family Suppose h : [d]  [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? t1 x1 ? x2 t2

7. 2-wise independent hash family H, a family of hash functions h, is 2-wise independent iff  x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 t1 x1 ? x2 t2

8. 2-wise independent hash family H={(ax+b) mod T | 0  a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0  a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions

9. Draw h from 2-wise ind. family Z2 is an unbiased estimator for F2 !

10. What is the variance of Z2 ? Here we will assume that h is drawn from a 4-wise inde. family H

11. What is the variance of Z2 ?

12. Chebyshev’s Inequality If  is small this is meaningless… We need to reduce the variance How ?

13. Averaging Draw k ind. hash functions h1, h2, …. , hk Use

14. Boosting the confidence – Chernoff bounds Pick 1/4 1/4

15. Boosting the confidence – Chernoff bounds Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?

16. Boosting the confidence – Chernoff bounds Each of A1,…..,As is bad ((1  ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As

17. Boosting the confidence – Chernoff bounds What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then  s = O(log(1/)) with a large enough constant

18. This is a random projection.. = Preserve distances in the sense:

19. Make it look more familiar.. = Preserve distances in the sense:

20. Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace

21. Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace JL: ε[0,1]

22. Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace JL: ε[0,1]

23. Johnson-Lindenstrauss JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :

24. The proof (A random orthonormal k  d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

25. The proof (A random orthonormal k  d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

26. The proof (A random orthonormal k  d) = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

27. The proof Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

28. The case k=1 Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

29. The case k=1 Random unit vec = JL:

30. The case k=1 1 ε[0,1]

31. An application: approximate period m 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

32. An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

33. An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

34. An exact algorithm Find r such that For each value of r takes linear time  O(m2) is minimized

35. An exact algorithm Find r such that For each value of r takes linear time  O(m2) is minimized We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…

36. Obs1: We can sketch faster.. B h A A running inner-product with a unit vector This is similar to a convolution of two vectors

37. Convolution 4 5 0 2 1 3 3 1 2 0

38. Convolution 4 5 0 2 1 3 3 1 2 0

39. Convolution 4 5 0 2 1 3 3 1 2 0

40. Convolution 4 5 0 2 1 3 3 1 2 0

41. Convolution 4 5 0 2 1 3 3 1 2 0 We can compute the convolution in O(mlog(r)) time using the FFT

42. Obs1: We can sketch faster h We can compute the first coordinate of all sketches in O(mlog(r)) time  We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…

43. Obs2: Sketch only in powers of 2 We compute all sketches in O(log(m)mlog(r)k)

44. When r is not a power of 2 ? z x y S(x) S(y) Use S(x) + S(y) as S(z)

45. The algorithm z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)

46. The algorithm z x y S(x) S(y) Total running time is O(mlog3m)

47. Bibliography • Noga Alon, YossiMatias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), 137-147 • W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. • JiríMatousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): 142-156 (2008) • PiotrIndyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000: 363-372

More Related