Algorithms for Data Streams

Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2nd Dec, 2002

Motivation • Traditional DBMS – data stored in finite, persistentdata sets • New Applications – data input as continuous, ordereddata streams • Network monitoring and traffic engineering • Telecom call records • Financial applications • Sensor networks • Web logs and clickstreams

Data Stream Model • Data elements in the stream arrive online • System has no control over arrival order, either within a data stream or across many streams • Data streams are potentially bounded in size • Once an element from a data stream has been processed, it is discarded unless otherwise archived.

Goals • To identify the needs of data stream applications • To study algorithms for data stream applications

Sample Applications • Network security(e.g., iPolicy, NetForensics/Cisco, Niksun) • Network packet streams, user session information • Queries: URL filtering, detecting intrusions & DOS attacks & viruses • Financial applications(e.g., Traderbot) • Streams of trading data, stock tickers, news feeds • Queries: arbitrage opportunities, analytics, patterns

Distributed Streams Evaluation • Logical stream = many physical streams • maintain top 100 Yahoo pages • Correlate streams at distributed servers • network monitoring • Many streams controlled by few servers • sensor networks • Issues • Move processing to streams, not streams to processors

Synopses • Queries may access or aggregate past data • Need bounded-memory history-approximation • Synopsis? • Succinct summary of old stream tuples • Like indexes, but base data is unavailable • Examples • Sliding Windows • Samples • Sketches • Histograms • Wavelet representation

1 1 0 0 1 0 1 1 1 0 1 Model of Computation Synopses/Data Structures Increasing time Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # tuples so far, or window size ε:error parameter Data Stream

Algorithmic Issues • Sketching Techniques S = {x1,…xN}, xi {1,..,d}, mi = |{ j |xj = i}| Kth frequency moment Fk of S = mki • Wavelets • Coefficients are projections of the given signals onto an orthogonal set of basis vector • Higher valued coefficients retain most information • Sliding Windows • Prevents stale data from influencing analysis and statistics • Statistics including sketches can be maintained over sliding windows

Streaming Algorithms[Yossef, Kumar, Sivakumar] • Input: A string (x), error parameter , a confidence parameter 0   <1, one pass access to (x) • Output: A streaming algorithm gives -approx of f(x) with probability  1- , for any input x and for any permutation  • Frequency moments can be computed to find # of distinct elements in a stream • F0 can be computed using O(1/3log(1/)log(m)) space and processing time per data item • Count # triangles in a graph presented as a stream • Each edge is a data item (adjacency stream) or • Each node with neighbors is a data item (incidence stream)

Streaming Algorithms[Ajtai et. al] • Measures Sortedness • Estimates the number of inversions in a permutation to within a factor of  • Motivation • Smart engineering of sorting algorithms • Evaluate ranking function that defines permutation • Complexity • Requires space O(log(N)loglog(N)) and time O(log (N)) per data element.

Clustering Data Streams • K-median problem for data streams [Guha, Mishra, Motwani and Callaghan] • In k-Median problem, the objective is to minimize the average distance from data points to their closest cluster centers • K-Median problem can be related to facility-location problem. • In k-Center problem, the objective is to minimize the maximum radius of a cluster

Algorithm • Algorithm based on divide-and-conquer • Running time is O(n 1+) and uses O(n) memory • Makes a single pass over the data • Randomization reduces running time to O(nk) in one pass • No deterministic algorithm can achieve bounded-approximation in deterministic o(nk) time

Divide-and-Conquer • Algorithm Small-space(S) 1. Divide S into L disjoint pieces X1, …, XL 2. For each i, find O(k) centers in Xi. Assign each point in Xi to its closest center 3. Let X’ be the O(lk) centers obtained in (2), where each center c is weighted by the number of points assigned to it 4. Cluster X’ to find k centers [Using c-approximation algorithm].

Theorems… • Theorem 1: Given an instance of the k-median problem with a solution of cost C, where the medians may not belong to the set of input points,  a solution of cost 2C where all the medians belong to the set of input points. • Proof: Let j1, …, jq be assigned to median i in solution with cost C. Consider jlwhich is closest to i as median (instead of i). Thus the assignment distance of every point jr at most doubles since cxy  cxi + ciy. Over all n points in the original set, the assignment can at most double, summing to at most 2C

Theorems.. • Theorem 2: If the sum of the costs of the l optimum k-median solutions for X1, .., XL is C and if C* is the cost of the optimum k-median solution for the entire set S, then  a solution of cost  2(C+C*) to the new weighted instance X’. • Proof: The cost of X’ = I’C i’(i’) di’ = iC i’(i’) Cost ofX’  iC i’(i’) since  isoptimal for X’.iC i’(i’)  i(C i’i + C i(i) ) = C+C* . The cost 2(C+C*) follows from Theorem 1.

Data Stream Algorithm • Input the first m points and reduce them to O(k) points. The weight of intermediate medians is # points assigned to it. • Repeat above till we see m2/(2k) of the original data points. There are m intermediate medians now. • Cluster m first-level medians into 2k second level medians • In general, maintain m level-i medians, on seeing m, generate 2k level-i+1 medians with weights as defined earlier • On seeing all the original data points, cluster all the intermediate medians into k final medians # levels = O(log(n/m)/log(m/k)) If k << m and m = O(n) for constant , we have an O(1)-approximation. Running time is O(n1+).

Randomized Clustering • Input O(M/k) points and sample to cluster this to 2k intermediate medians (M = memory size) • Use local search algorithm to cluster O(M) intermediate medians of level i to 2k medians of level i+1 • Use primal dual to cluster the final O(k) medians to k medians Running time is O(nk log(n)) in one pass and it uses n memory for small k

Open Problems • Are there any ``killer apps’’ for data stream systems ? • Techniques which maintain correlated aggregates with provable bounds • How to cluster, maintain summary using sliding windows ? • How to deal with distributed streams and perform clustering on them ?

Algorithms for Data Streams

Algorithms for Data Streams

Presentation Transcript

Managing Data Streams

Data Streams

Clustering Data Streams

Clustering Data Streams

Massive data streams

Algorithms for data streams Lecture 2

Data Streams

Data Streams

Mining Data Streams

Algorithms for geometric data streams

Mining Data Streams

Streams: Infinite Data

Harnessing the Strengths of Anytime Algorithms for Constant Data Streams

Privacy Preservation for Data Streams

Estimating Entropy for Data Streams

Data Streams

Algorithms for Identification of Network Data Streams

Mining Data Streams

Data Mining for Data Streams

Mining Data Streams

Algorithms for geometric data streams