Randomization for Massive and Streaming Data Sets

Randomization for Massive and Streaming Data Sets Rajeev Motwani CS Forum Annual Meeting

Data Streams Mangement Systems • Traditional DBMS – data stored in finite, persistentdata sets • Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, … • Emerging DSMS – variety of modern applications • Network monitoring and traffic engineering • Telecom call records • Network security • Financial applications • Sensor networks • Manufacturing processes • Web logs and clickstreams • Massive data sets

Streamed Result Register Query DSMS – Big Picture Stored Result DSMS Input streams Archive Scratch Store Stored Relations

Algorithmic Issues • Computational Model • Streaming data (or, secondary memory) • Bounded main memory • Techniques • New paradigms • Negative Results and Approximation • Randomization • Complexity Measures • Memory • Time per item (online, real-time) • # Passes (linear scan in secondary memory)

1 1 0 0 1 0 1 1 1 0 1 Stream Model of Computation Main Memory (Synopsis Data Structures) Increasing time Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size ε:error parameter Data Stream

“Toy” Example – Network Monitoring Intrusion Warnings Online Performance Metrics Register Monitoring Queries DSMS Network measurements, Packet traces, … Archive Scratch Store Lookup Tables

Top-k most frequent elements Find elements that occupy 0.1% of the tail. Mean + Variance? Median? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Frequency Related Problems Analytics on Packet Headers – IP Addresses How many elements have non-zero frequency?

Example 1– Distinct Values • Input Sequence X = x1, x2, …, xn, … • Domain U = {0,1,2, …, u-1} • Compute D(X)number of distinct values • Remarks • Assume stream size n is finite/known (generally, n is window size) • Domain could be arbitrary (e.g., text, tuples)

Naïve Approach • Counter C(i) for each domain value i • Initialize counters C(i) 0 • Scan X incrementing appropriate counters • Problem • Memory size M << n • Space O(u)– possibly u >> n (e.g., when counting distinct words in web crawl)

Negative Result Theorem: Deterministic algorithms need M = Ω(n log u) bits Proof:Information-theoretic arguments Note:Leaves open randomization/approximation

Randomized Algorithm h:U  [1..t] Input Stream Hash Table Analysis • Random h few collisions & avg list-size O(n/t) • Thus • Space: O(n) – since we need t = Ω(n) • Time: O(1) per item [Expected]

Improvement via Sampling? • Sample-based Estimation • Random Sample R (of size r) of n values in X • Compute D(R) • EstimatorE = D(R) x n/r • Benefit – sublinear space • Cost – estimation erroris high • Why? – low-frequency values underrepresented

Negative Result for Sampling • Consider estimator E of D(X) examining r items in X • Possibly in adaptive/randomized fashion. Theorem: For any , E has relative error with probability at least . • Remarks • r = n/10  Error 75% with probability ½ • Leaves open randomization/approximation on full scans

Randomized Approximation • Simplified Problem – For fixed t, is D(X) >> t? • Choose hash function h: U[1..t] • Initialize answer to NO • For each xi, if h(xi) = t, set answer to YES • Observe – need 1 bit memory only ! • Theorem: • If D(X) < t, P[output NO] > 0.25 • If D(X) > 2t, P[output NO] < 0.14 Boolean Flag 1 h:U  [1..t] t YES/NO Input Stream

Analysis • Let – Y be set of distinct elements of X • output NOno element of Y hashes to t • P [element hashes to t] = 1/t • Thus – P[output NO] = (1-1/t)|Y| • Since |Y| = D(X), • D(X) < t P[output NO] > (1-1/t)t > 0.25 • D(X) > 2t P[output NO] < (1-1/t)2t < 1/e^2

Boosting Accuracy • With 1 bitdistinguish D(X)<t from D(X)>2t • Running O(log 1/δ) instances in parallel reduce error probability to any δ>0 • Running O(log n) in parallel for t = 1, 2, 4, 8,…, n  can estimate D(X) within factor 2 • Choice of multiplier 2 is arbitrary  can use factor (1+ε) to reduce error to ε • Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space

Example 2 – Elephants-and-Ants Stream • Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001]

Window 1 Window 2 Window 3 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later…

Frequency Counts + First Window At window boundary, decrement all counters by 1 Lossy Counting in Action ... Empty

Frequency Counts + Next Window At window boundary, decrement all counters by 1 Lossy Counting continued ...

Error Analysis How much do we undercount? If current size of stream = N and window-size W = 1/ε then# windows = εN frequency error Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%

Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? • Worst case bound: 1/ε log εN counters • Implementation details…

Stream 28 31 41 34 15 30 23 35 19 Algorithm 2: Sticky Sampling  Create counters by sampling  Maintain exact counts thereafter What is sampling rate?

Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N Same error guarantees as Lossy Counting but probabilistic Sticky Sampling contd... For finite stream of length N Sampling rate = 2/εN log 1/s  = probability of failure Output: Elements with counter values exceeding (s-ε)N Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01%

Independent of N Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/s Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/ log 1/s

Example 3 – Correlated Attributes C1 C2 C3 C4 C5 R1 1 1 1 1 0 R2 1 1 0 1 0 R3 1 0 0 1 0 R4 0 0 1 0 1 R5 1 1 1 0 1 R6 1 1 1 1 1 R7 0 1 1 1 1 R8 0 1 1 1 0 … … … • Input Stream – items with boolean attributes • Matrix – M(r,c) = 1  Row r has Attribute c • Identify – Highly-correlated column-pairs

Correlation  Similarity • View column as set of row-indexes (where it has 1’s) • Set Similarity (Jaccard measure) • Example CiCj 0 1 1 0 1 1sim(Ci,Cj) = 2/5 = 0.4 0 0 1 1 0 1

Identifying Similar Columns? • Goal– finding candidate pairs in small memory • Signature Idea • Hash columns Ci to small signaturesig(Ci) • Set of signatures fits in memory • sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj)) • Naïve Approach • Sample P rows uniformly at random • Define sig(Ci) as P bits of Ci in sample • Problem • sparsity would miss interesting part of columns • sample would get only 0’s in columns

Key Observation • For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 • Overload notation: A = # rows of type A • Observation

Min Hashing • Randomly permute rows • Hashh(Ci) = index of first row with 1 in column Ci • Suprising Property P[h(Ci) = h(Cj)] = sim(Ci, Cj) • Why? • Both are A/(A+B+C) • Look down columns Ci, Cj until first non-Type-D row • h(Ci) = h(Cj)  if type A row

Min-Hash Signatures • Pick – k random row permutations • Min-Hash Signature sig(C) = k indexes of first rows with 1 in column C • Similarity of signatures • Define:sim(sig(Ci),sig(Cj)) = fraction of permutations where Min-Hash values agree • Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)

Example Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00

Implementation Trick • Permuting rows even once is prohibitive • Row Hashing • Pick k hash functions hk: {1,…,n}{1,…,O(n)} • Ordering under hk gives random row permutation • One-pass implementation

Comparing Signatures • Signature Matrix S • Rows = Hash Functions • Columns = Columns • Entries = Signatures • Need– Pair-wise similarity of signature columns • Problem • MinHash fits column signatures in memory • But comparing signature-pairs takes too much time • Limiting candidate pairs –Locality Sensitive Hashing

Summary • New algorithmic paradigms needed for streams and massive data sets • Negative results abound • Need to approximate • Power of randomization

Thank You!

References Rajeev Motwani (http://theory.stanford.edu/~rajeev) STREAM Project (http://www-db.stanford.edu/stream) • STREAM: The Stanford Stream Data Manager.Bulletin of the Technical Committee on Data Engineering 2003. • Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System.CIDR 2003. • Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems.PODS 2002. • Manku-Motwani. Approximate Frequency Counts over Streaming Data.VLDB 2003. • Babcock-Datar-Motwani-O’Callahan.Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003. • Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.

References (contd) • Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics over Sliding Windows.SIAM Journal on Computing 2002. • Babcock-Datar-Motwani. Sampling From a Moving Window Over Streaming Data.SODA 2002. • O’Callahan-Guha-Mishra-Meyerson-Motwani. High-Performance Clustering of Streams and Large Data Sets.ICDE 2003. • Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams.FOCS 2000. • Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000. • Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values.PODS 2000. • Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing.VLDB 1999. • Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.STOC 1998.

Randomization for Massive and Streaming Data Sets