Big Data

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications

The second moment A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment:

Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain:

AMS Analysis

2-wise independent hash family Suppose h : [d]  [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? t1 x1 ? x2 t2

2-wise independent hash family H, a family of hash functions h, is 2-wise independent iff  x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 t1 x1 ? x2 t2

2-wise independent hash family H={(ax+b) mod T | 0  a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0  a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions

Draw h from 2-wise ind. family Z2 is an unbiased estimator for F2 !

What is the variance of Z2 ? Here we will assume that h is drawn from a 4-wise inde. family H

What is the variance of Z2 ?

Chebyshev’s Inequality  

Chebyshev’s Inequality If  is small this is meaningless… We need to reduce the variance How ?

Averaging Draw k ind. hash functions h1, h2, …. , hk Use

Chebyshev’s Inequality Pick

Boosting the confidence – Chernoff bounds Pick 1/4 1/4

Boosting the confidence – Chernoff bounds Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?

Boosting the confidence – Chernoff bounds Each of A1,…..,As is bad ((1  ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As

Boosting the confidence – Chernoff bounds What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then  s = O(log(1/)) with a large enough constant

Recap =

This is a random projection.. = Preserve distances in the sense:

Make it look more familiar.. = Preserve distances in the sense:

Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace

Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace JL: ε[0,1]

Johnson-Lindenstrauss JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :

The proof (A random orthonormal k  d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

The proof (A random orthonormal k  d) = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

The proof Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

The case k=1 Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

The case k=1 Random unit vec = JL:

The case k=1 1 ε[0,1]

An application: approximate period m 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

An exact algorithm Find r such that For each value of r takes linear time  O(m2) is minimized

An exact algorithm Find r such that For each value of r takes linear time  O(m2) is minimized We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…

Obs1: We can sketch faster.. B h A A running inner-product with a unit vector This is similar to a convolution of two vectors

Convolution 4 5 0 2 1 3 3 1 2 0

Convolution 4 5 0 2 1 3 3 1 2 0 We can compute the convolution in O(mlog(r)) time using the FFT

Obs1: We can sketch faster h We can compute the first coordinate of all sketches in O(mlog(r)) time  We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…

Obs2: Sketch only in powers of 2 We compute all sketches in O(log(m)mlog(r)k)

When r is not a power of 2 ? z x y S(x) S(y) Use S(x) + S(y) as S(z)

The algorithm z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)

The algorithm z x y S(x) S(y) Total running time is O(mlog3m)

Bibliography • Noga Alon, YossiMatias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), 137-147 • W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. • JiríMatousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): 142-156 (2008) • PiotrIndyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000: 363-372

Big Data

Big Data

Presentation Transcript

Big Data

Big Data

Big Data

„Big data ”

Big Data

Big Data

Big Data – Big ROI

Big Data

BIG DATA

BIG DATA

Big Data

Big Data

BIG DATA

Big Data

Big Data

Big Data Training | Big Data Courses | Big Data Online Courses

Big Data Big Data

Big Data

Big Data