Estimating L2 Norm

Estimating L2 Norm MIT Piotr Indyk

Basic Data Stream Model • Single pass over the data i1, i2,…,in from 0…m • Typically, we assume n, m are known • Bounded storage (often logO(1) n) • Units of storage: bits, words or „elements” (e.g., points, nodes/edges) • Randomness and approximation OK (in fact, almost always necessary) • Last lecture: estimating the number of distinct elements, up to 1±ε, with probability 2/3, using space O(log(n)+ 1/ε2) About 8 distinct elements 8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ...

Vector x: 0 1 ………………………m Generalization • A stream can be viewed as a sequence of updates (i,a) xi=xi+a (initially x=0) • Basic streaming model corresponds to updates (i,1) • In general, a could be negative • Number of distinct elements  number of non-zero coordinates in x, denoted by ||x||0 • Similar algorithms as in the previous lecture work for ||x||0as well • Today: two methods for estimating ||x||2 • Alon-Matias-Szegedy (AMS) • Johnson-Lindenstrauss • Really cute and simple • Really powerful, need in future lectures

Why estimate L2 norm ? Rel1 Rel2 • Database join (on A): • All triples (Rel1.A, Rel1.B, Rel2.B) s.t. Rel1.A=Rel2.A • Self-join: if Rel1=Rel2 • Size of self-join: ∑val of A Rows(val)2 • Updates to the relation increment/decrement Rows(val)

Algorithm I: AMS

Alon-Matias-Szegedy’96 • Choose r1 … rmto be i.i.d. r.v., with Pr[ri=1]=Pr[ri=-1]=1/2 • Maintain Z=∑i ri xi under increments/decrements to xi • Return Y=Z2 as an estimate for ||x||22 • Analysis: • Compute the expectation of Y • Bound the variance of Y (or the second moment)

Expectation • The expectation of Z2 = (∑i ri xi )2 is equal to E[Z2] = E[∑i,j rixirjxj] = ∑i,j xi x j E[rirj] • We have • For i≠j, E[rirj] = E[ri] E[rj] =0 – term disappears • For i=j, E[rirj] = E[ri2] = E[1] =1 • Therefore E[Z2] = ∑i xi2 =||x||22 (unbiased estimator) • Now we just need to bound the variance

Bounding the second moment • The second moment of Z2 = (∑i ri xi )2 is equal to the expectation of Z4 = (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) • This can be decomposed into a sum of • ∑i (ri xi )4 →expectation= ∑i xi 4 • 6 ∑i<j (ri rj xixj )2 →expectation= 6∑i<j xi2xj2 • Terms involving single multiplier ri xi (e.g., r1x1r2x2r3x3r4x4) →expectation=0 Total: ∑i xi 4 +6∑i<j xi2xj2 ≤ 3 (∑i xi 2 )2

Analysis, ctd. • We have an estimator Y=Z2 • E[Y] = ∑i xi2 • σ2 =Var[Y] ≤ 3 (∑i xi 2 )2 • Chebyshev inequality : Pr[ |E[Y]-Y| ≥ cσ ] ≤ 1/c2 • Algorithm AMS+: • Maintain Z1 … Zk(and thusY1 … Yk), define Y’ = ∑i Yi /k • E[Y’] = k ∑i xi2 /k = ∑i x i2 • σ’2 = Var[Y’] ≤ 3k(∑i xi 2 )2 /k2 = 3 (∑i xi 2 )2 /k • Guarantee: Pr[ |Y’ - ∑i xi2 | ≥c (3/k)1/2 ∑i xi2 ] ≤ 1/c2 • Setting c to a constant and k=O(1/ε2) gives (1 ε)-approximation with const. probability • Total space: O( log (nm) / ε2 ) bits (not counting ri’s)

Comments • Only needed 4-wise indepence of r0…rm • Can generate such vars from O(log m) random bits (previous lecture) • What we did: • Maintain a “linear sketch” vector Z=[Z1...Zk] = R x • Estimator for ||x||22 : (Z12 +... + Zk2)/k = ||Rx||22 /k • “Dimensionality reduction”: x→ Rx … but the tail somewhat “heavy” • Reason: only used second moment of the estimator

Algorithm II: Dim. Reduction (JL)

Interlude: Normal Distribution • Normal distribution N(0,1): • Range: (-∞, ∞) • Density: f(x)=e-x^2/2 / (2π)1/2 • Mean=0, Variance=1 • Basic facts: • If X and Y independent r.v. with normal distribution, then X+Y has normal distribution • Var(cX)=c2 Var(X) • If X,Y independent, then Var(X+Y)=Var(X)+Var(Y)

A different linear sketch • Instead of 1, let ri be i.i.d. random variables from N(0,1) • Consider Z=∑i ri xi • We still have that E[Z2] = ∑i xi2 =||x||22, since: • E[ri] E[rj] = 0 • E[ri2] = variance of ri, i.e., 1 • As before we maintain Z=[Z1 … Zk ] and define Y = ||Z||22= ∑j Zj2(so that E[Y]=k||x||22 ) • We show that there exists C>0 s.t. for small enough ε>0 Pr[ | Y - k||x||22 |> εk||x||22] ≤ exp(-C ε2 k) • Set k=O(1/ε2 log(1/δ)) to get 1±ε approx. with prob. 1-δ

Proof • See the attached notes, by Ben Rossman and Michel Goemans

JL - comments • Can use k-wise independence to generate ri’s, but this is much more messy than for AMS • Time to compute the sketch vector Z from x is O(dk) • Good if k is small, not so great if k is large • Fast JL, Sparse JL to the rescue(a few lectures from now)

Estimating L2 Norm