Clustering Data Streams

Clustering Data Streams A presentation by George Toderici

Talk Outline • Goals of the paper • Notation reminder • Clustering With Little Memory • Data Stream Model • Clustering with Data Streams • Lower Bounds and Deterministic Algorithms • Conclusion

Goals of the paper • Since the k-Median problem is NP-hard this paper attempts to create an approximation with the following constraints: • Minimize memory usage • Minimize CPU usage • Work both on metric spaces and the special case of Euclidean space

Notation Reminder O(g(n)) – running time is upper bounded by g(n) Ω(g(n)) – running time is lower bounded by g(n) o(g(n)) – running time is asymptotically negligible θ(g(n)) – memory usage is upper bounded by g(n) [not commonly used] Soft-Oh:

Paper-specific Notation • cijis the distance between points i and j • di the number of points associated with median i • NOTE: Do not confuse c and d. Presumably the distance has been chosen to be “cij” because distance can be treated as a “cost”. It would have been more intuitive to have it called “d” from the word “distance”.

Clustering with little memory Algorithm: SmallSpace(S) • Divide S into l disjoint pieces X1…Xl • For each Xi find O(k) centers in it. Assign each point to its closest center. • Let X’ be the set of O(lk) centers obtained where each center is weighed by the number of points assigned to it • Cluster X’ to find k centers

Chunk1 Main Memory Chunk2 Chunk3 … SmallSpace (2) K

SmallSpace analysis • Since we are interested in using as little memory as possible, l has to be chosen so that both each partition of S and X’ fit in main memory. However, no such l may exist if S is very large. • We will use this algorithm as a starting point and improve it so that it will satisfy all requirements.

Theorem 1 Given an instance of the k-median problem with a solution of cost C, where the medians may not belong to the set of input points, there exists a solution of cost 2C where all the medians belong to the set of input points (metric space requirement).

Theorem 1 Proof Consider the figure: • The distance from (4) [closest to the true mean] to any other point (i) in the data is bounded by cim+cm4[triangle inequality] • Therefore, the maximum cost for the median will be at most two times the cost of the median clustering with no constraints (worst case)

Theorem 2 Consider a set of n points partitioned into x1,…,xl (disjoint sets). The sum of the optimum solution values for the k-median problem on the l sets of points is at most twice the cost of the optimum k-median problem solution for all n points.

Theorem 2 Proof • This is Theorem 1, but on l clusters. • Apply theorem 2 l times, and obtain a maximum cost which is two times the cost in the case when it is allowed to have medians which are not part of the data

Theorem 3 (SmallSize Step 2) If the sum of the costs of the l optimum k-median solutions for x1,…,xl is C and if C* is the cost of the optimum k-median solution on the entire set S, then there exists a solution of cost at most 2(C+C*) to a the new weighted instance X’.

Theorem 3 Proof (1) • Let i’ be a point in X’ (a median obtained by SmallSpaces) • Let the point to which i’ is assigned to in the optimum continuous solution be (i’), and the number of points assigned to i’ be di’ • Then the cost of X’ is

Theorem 3 Proof (2) • Let i be a point in the set S. then let i’(i) be the median in X’ to which it was assigned by SmallSpace. • Then the cost of X’ can be written as: • Let the median assigned to i in the optimal continuous solution on S be (i)

Theorem 3 Proof (3) • Because  is optimal for X’, the cost is no more than • The last sum evaluates to C + C* for the continuous case or 2(C + C*) in the metric space case • [Reminder: The sum of the costs of the l optimum k-median solutions for x1,…,xl is C and C* is the cost of the optimum k-median solution on the entire set S]

Theorem 4 (SmallSize step 2, 4) • If we modify step 2 to use a bicriteria approximation algorithm (a,b) where at most ak medians are output with a cost of at most b times the optimal k-Median solutions, and then: • Modify Step 4 to run a c-approximation algorithm, then: Theorem 4: The algorithm SmallSpace has an approximation factor of 2c(1+b)+2b [not proven here]

SmallerSpace Algorithm SmallerSpace(S,i) • Divide S into l disjoint pieces X1…Xl • For each Xi find O(k) centers in it. Assign each point to its closest center. • Let X’ be the O(lk) centers obtained in (2) where each center is weighed by the number of points assigned to it • Call SmallerSpace(X’, i-1)

k k k k k k k k k k SmallerSpace [2] A small factor is lost in the approximation with each level of divide and conquer • In general, if |Memory|=ne, need 1/e levels, approximation factor 2O(1/e) • If n=1012 and M=106, then regular 2-level algorithm • If n=1012 and M=103 then need 4 levels, approximation factor 24 k …

SmallerSpace Analysis Theorem 5: For a constant i, SmallerSpace(S,i) gives a constant factor approximation to the k-median problem. Proof: The approximation at level j is Aj=2Aj-1(2b+1) + 2b (Theorem 2,4) which has the solution Aj=c(2(b+1))j which is O(1) if j is constant.

SmallerSpace Analysis (2) • Then, since all intermediate medians X’ must be stored in memory, the number of subsets l that we partition S into is limited. • In fact, we need lk <= M, and such an l may not exist (where M is the memory size)

Datastream model • Datastream: set of ordered points: x1,…,xi ,…, xn • Algorithm performance is measured as the number of passes on the data given the constraints of available memory • Usually the number of points is extremely large so it is impossible to fit all of them in memory • Usually once a point has been read it is very expensive to read it again. Most algorithms assume the data will not be available for a second pass.

Data Stream Algorithm • Input the first m points; use a bicriterion algorithm to reduce these to O(k) (e.g., 2k) points. Weigh each intermediate median by the number of points assign to it. (depending on algorithm used this can take O(m2) or O(mk)) • Repeat (1) until we have seen m2/(2k) of the original data points. • Cluster these m first-level medians into 2k second-level medians

Data Stream Algorithm (2) • Maintain at most m level-i medians, and on seeing m, generate 2k level-i+1 medians with the weight of the new median as the sum of the weights of the intermediate medians. • When we have seen all data points or when we decide to stop we cluster all intermediate medians into k final medians

2k 2k 2k 2k 2k 2k m m m m m m Data Stream Algorithm (3) Level 2 M->K 2k … Level 3 Level i M->K M->K 2k FinalK 2k … … …

Data Stream Algorithm Analysis • The algorithm requires O(log(n/m)log(m/k)) levels • If k much smaller than m, and m = O(n) for  < 1: • (n) space • O(n1+ ) run time • up to a O(21/) approximation factor (constant factor approximation)

Randomized Algorithm • Draw a sample of size s = (nk)1/2 • Find k medians from these s points using a primal dual algorithm • Assign each of the original points to its closest median • Collect n/s points with the largest assignment distance • Find k medians from among these n/s points • At this point we have 2k medians

Randomized Algorithm Analysis • The algorithm gives a O(1) approximation with 2k medians with constant probability. • O(log n) passes for high probability results • time and space • Space can be improved to O((nk)1/2)

Full Algorithm • Input the first O(M/k) points then use the randomized algorithm to find 2k intermediate median points • Use a local search algorithm to cluster O(M) intermediate median points of level i to 2k medians of level i+1 • Use the primal dual algorithm to cluster the final O(k) medians into k medians

Full Algorithm (2) • The full algorithm is still one pass (we call the randomized algorithm only once per input set) • Step 1 is • Step 2 is O(nk) • Therefore, the final cost is

Lower Bounds • Consider a clustering where the distance between two points is 1 if they belong to the same cluster and 0 otherwise • An algorithm is not constant factor if it does not discover a clustering of cost 0 • Finding such a clustering is equivalent to the following: in a complete k-partite graph G for some k, find the k-partition of vertices of G into independent sets. • The best algorithm to find that requires (nk) queries and therefore lower bounds any c.f. clustering algorithm

Deterministic Algorithms: A1 • Partition the n original points into p1 subsets • Apply the primal dual algorithm to each subset (O(an2) for each) • Apply it again to the p1k weighted points obtained at (2) to get the final k medians

A1: Details • If we choose the number of subsets p1 = (n/k)2/3 we have: • O(n4/3k2/3) runtime and space • 4c2 + 4c approximation factor by Theorem 4, where c is the approximation given by the primal-dual algorithm

Deterministic Algorithms: A2 • Split the dataset into p2 partitions • Apply A1 on each of them • Apply A1 on all the intermediate medians at (2)

A2: Details • If we choose the number of subsets p1 = (n/k)4/5 in order to minimize the running time we have: • O(n16/15k14/15) runtime and space • We can see a trend!

Deterministic Algorithm • Create algorithm Ai that calls Ai-1 on pi partitions • Then the complexity in both time and space of this algorithm will be:

Deterministic Algorithm (2) • The approximation factor grows with i, however: • We can set i=(log log log n) in order to get the exponent of n in the running time to be 1.

Deterministic Algorithm (2) • This gives an algorithm running in space and time.

Conclusion • We have presented a variety of algorithms optimized to address the problem of clustering in systems where the amount of data is huge • All the algorithms presented are just approximations to the k-means problem

References • Eric W. Weisstein. "Complete k-Partite Graph." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/Completek-PartiteGraph.html • http://theory.stanford.edu/~nmishra/CS361-2002/lecture9-nina.ppt

Clustering Data Streams