Algorithms for massive data sets

Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Negative Result for Sampling [Charikar, Chaudhuri, Motwani, Narasayya 2000] Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any , E must have relative error with probability at least . • Example • Say, r = n/5 • Error 20% with probability 1/2

Scenario Analysis Scenario A: • all values in X are identical (say V) • D(X) = 1 Scenario B: • distinct values in X are {V, W1, …, Wk}, • V appears n-k times • each Wi appears once • Wi’s are randomly distributed • D(X) = k+1

Proof • Little Birdie – one of Scenarios A or B only • Suppose • E examines elements X(1), X(2), …, X(r) in that order • choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1) • Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ] • Why? • No information on whether Scenario A or B • Wi values are randomly distributed

Proof (continued) • Define EV – event {X(1)=X(2)=…=X(r)=V} • Last inequality because

Proof (conclusion) • Choose to obtain • Thus: • Scenario A  • Scenario B  • Suppose • E returns estimate Z when EV happens • Scenario A  D(X)=1 • Scenario B  D(X)=k+1 • Z must have worst-case error >

k Bit vector : 0000101010001001111 b Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996]) Theorem:For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c. A bit vector BV will represent the set Let b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F. For a in A, let h(a) = r ·a + s = 101****10….0 Set k’th bit. Estimate is 2^{max bit set}. Pr(h(a)=k) k 0 1 k u-1

Randomized Approximation (2)(based on [Indyk-Motwani 1998]) • Algorithm SM – For fixed t, is D(X) >> t? • Choose hash function h: U[1..t] • Initialize answer to NO • For each , if h( ) = t, set answer to YES • Theorem: • If D(X) < t, P[SM outputs NO] > 0.25 • If D(X) > 2t, P[SM outputs NO] < 0.136 = 1/e^2

Analysis • Let – Y be set of distinct elements of X • SM(X) = NO no element of Y hashes to t • P[element hashes to t] = 1/t • Thus – P[SM(X) = NO] = • Since |Y| = D(X), • If D(X) < t, P[SM(X) = NO] > > 0.25 • If D(X) > 2t, P[SM(X) = NO] < < 1/e^2 • Observe – need 1 bit memory only!

Boosting Accuracy • With 1 bitcan probabilistically distinguish D(X) < t from D(X) > 2t • Running O(log 1/δ) instances in parallel reduces error probability to any δ>0 • Running O(log n) in parallel for t = 1, 2, 4, 8 …, n  can estimate D(X) within factor 2 • Choice of factor 2 is arbitrary  can use factor (1+ε) to reduce error to ε • EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space

Sampling: Basics • Idea: A small random sample S of the data often well-represents all the data • For a fast approx answer, apply the query to S & “scale” the result • E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 R.a 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 Red = in S Est. count = 5*2 = 10, Exact count = 10 • Leverage extensive literature on confidence intervals for sampling • Actual answer is within the interval [a,b] with a given probability • E.g., 54,000 ± 600 with prob  90%

Sampling versus Counting • Observe • Count merely abstraction – need subsequent analytics • Data tuples – X merely one of many attributes • Databases – selection predicate, join results, … • Networking – need to combine distributed streams • Single-pass Approaches • Good accuracy • But gives only a count -- cannot handle extensions • Sampling-based Approaches • Keeps actual data – can address extensions • Strong negative result

Distinct Sampling for Streams[Gibbons 2001] • Best of both worlds • Good accuracy • Maintains “distinct sample” over stream • Handles distributed setting • Basic idea • Hash – random “priority” for domain values • Tracks highest priority values seen • Random sample of tuples for each such value • Relative error with probability

Hash Function • Domain U = [0..m-1] • Hashing • Random A, B from U, with A>0 • g(x) = Ax + B (mod m) • h(x) – # leading 0s in binary representation of g(x) • Clearly – • Fact

Overall Idea • Hash  random “level” for each domain value • Compute level for stream elements • Invariant • Current Level –cur_lev • Sample S – all distinct values scanned so far of level at least cur_lev • Observe • Random hash  random sample of distinct values • For each value  can keep sample of their tuples

Algorithm DS (Distinct Sample) • Parameters – memory size • Initialize –cur_lev0; Sempty • For each input x • L  h(x) • If L>cur_levthen add x to S • If |S| > M • delete from S all values of level cur_lev • cur_lev cur_lev +1 • Return

Analysis • Invariant – S contains all values x such that • By construction • Thus • EXERCISE – verify deviation bound

Hot list queries • Why is it interesting: • Top ten – best seller list • Load balancing • Caching policies

djkkdkvza Hot list queries • Let use sampling edoejddkaklsadkjdkdkpryekfvcuszldfoasd k3d2jvza

Hot list queries • The question is: • How to sample if we don’t know our sample size?

1 2 1 5 3 1 3 a b c d a b a c a a b d b a d d Gibbons & Matias’ algorithm Hotlist: 0 0 0 0 p = 1.0 Produced values:

1 2 1 5 3 1 3 a b c d a b a c a a b d b a d d Gibbons & Matias’ algorithm Need to replace one value Hotlist: 0 0 0 0 p = 1.0 Produced values: e

Throw biased coins with probability f Multiply p with some amount f 1 2 1 (f = 0.75) 4 5 3 3 0 1 3 2 Replace counts by number of seen heads a b c d a b a c a a b d b a d d Gibbons & Matias’ algorithm Hotlist: 0 0 0 0 p = 0.75 Produced values: e

2 1 1 4 5 3 3 1 1 2 3 a b e d a b a c a a b d b a d d Gibbons & Matias’ algorithm Replace a value which has zero count Hotlist: 0 0 0 0 p = 0.75 Count/p is an estimate of number of times a value has been seen. E.g., the value ‘a’ has been seen 4/p = 5.33 times Produced values: e

Counters • How many bits need to count? • Prefix code • Approximated counters

Rarity • Paul goes fishing. • There are many different fish species U={1,..,u} • Paul catch one fish at a time atU • Ct[j]=|{ai| ai=j,i≤t}| number of time catches the species j • Species j is rare at time t if it appears only once • [t]=|{j| Ct[j]=1}|/u

Rarity • Why is it interesting?

Again lets use sampling U={1,2,3,4,5,6,7,8,9,10,11,12…u} U’={4,9,13,18,24} Xt[i]=|{t|aj=U’[i],j≤t}|

Again lets use sampling Xi[t]=|{t|aj=Xi,j≤t}| [t]=|{Ct[i]| Ct[i]=1}|/u תזכורת: ’[t]=|{Xt[i]| Xt[i]=1}|/k

Rarity • But [t] need to be at least 1/k to get a good estimator.

Min-wise independent hash functions • Family of hash functions H[n]->[n]call Min-wise independent • If for any X [n] and xX

Algorithms for massive data sets