1 / 64

Stability Yields a PTAS for k -Median and k -Means Clustering

Stability Yields a PTAS for k -Median and k -Means Clustering. Pranjal Awasthi , Avrim Blum, Or Sheffet Carnegie Mellon University November 3 rd , 2010. Stability Yields a PTAS for k -Median and k -Means Clustering. Introduce k -Median / k -Means problems. Define stability

telyn
Télécharger la présentation

Stability Yields a PTAS for k -Median and k -Means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stability Yields a PTAS for k-Median and k-Means Clustering PranjalAwasthi, Avrim Blum, Or Sheffet Carnegie Mellon University November 3rd, 2010

  2. Stability Yields a PTAS for k-Median and k-Means Clustering • Introduce k-Median / k-Means problems. • Define stability • Previous notion [ORSS06] • Weak Deletion Stability • ¯-distributed instances • The algorithm for k-Median • Conclusion + open problems.

  3. Clustering In Real Life Clustering: come up with desired partition

  4. Clustering in a Metric Space Clustering: come up with desired partition Input • n points • A distance function d:n£n!R¸0 satisfying: • Reflexive: 8p, d(p,p) = 0 • Symmetry: 8p,q, d(p,q) = d(q,p) • Triangle Inequality: 8p,q,r, d(p,q) ·d(p,r)+d(r,q) • k-partition q p r

  5. Clustering in a Metric Space Clustering: come up with desired partition Input: • n points • A distance function d:n£n!R¸0 satisfying: • Reflexive: 8p, d(p,p) = 0 • Symmetry: 8p,q, d(p,q) = d(q,p) • Triangle Inequality: 8p,q,r, d(p,q) ·d(p,r)+d(r,q) • k-partition k is large, e.g. k=polylog(n)

  6. k-Median • Input: 1. n points in a finite metric space 2. k • Goal: • Partition into k disjoint subsets:C*1, C*2 , … , C*k • Choose a center per subset • Cost: cost(C*i)=xd(x,c*i) • Cost of partition: icost(C*i) • Given centers ) Easy to get best partition • Given partition ) Easy to get best centers

  7. k-Means • Input: 1. n points in Euclidean space 2. k • Goal: • Partition into k disjoint subsets :C*1, C*2 , … , C*k • Choose a center per subset • Cost: cost(C*i)=xd2(x, c*i) • Cost of partition: icost(C*i) • Given centers ) Easy to get best partition • Given partition ) Easy to get best centers

  8. We Would Like To… • Solve k–median/ k-means problems. • NP-hard to get OPT(= cost of optimal partition) • Find a c-approximation algorithm • A poly-time algorithm guaranteed to output a clustering whose cost·cOPT • Ideally, find a PTAS • Get a c-approximation algorithm where c =(1+²), for any ²>0. • Runtime can be exp(1/²) cOPT Alg 2OPT Alg 1.5 OPT Alg 1.1 OPT OPT Alg Polynomial Time Approximation Scheme 0

  9. Related Work k-Median k-Means Easy (try all centers) in time nk PTAS, exponential in (k/²) [KSS04] Small k • (3+²)-apx [GK98, CGTS99, AGKMMP01, JMS02, dlVKKR03] • (1.367...)-apx hardness [GK98, JMS02] 9-apx [OR00, BHPI02, dlVKKR03, ES04, HPM04, KMNPSW02] General k No PTAS! Euclidean k-Median [ARR98], PTAS if dimension is small (loglog(n)c) [ORSS06] Special case • We focus on large k(e.g. k=polylog(n)) • Runtime goal: poly(n,k)

  10. World All possible instances

  11. ORSS Result (k-Means)

  12. ORSS Result (k-Means) Why use 5 sites?

  13. ORSS Result (k-Means)

  14. ORSS Result (k-Means)

  15. ORSS Result (k-Means)

  16. ORSS Result (k-Means) • Instance is stable if OPT(k-1) > (1/®)2 OPT(k) (require 1/® > 10) • Give a (1+O(®))-approximation. Our Result (k-Means) • Instance is stable if • OPT(k-1) > (1+®) OPT(k) (require ®> 0) • Give a PTAS ((1+²)-approximation). • Runtime: poly(n,k) exp(1/®,1/²)

  17. Philosophical Note • Stable instances: 9®>0 s.t. OPT(k-1) > (1+®) OPT(k) • Not stable instances: 8®>0 s.t. OPT(k-1) · (1+®) OPT(k) • A (1+®)-approximation can return a (k-1)-clustering. • Any PTAS can return a (k-1)-clustering. • It is not a k-clustering problem, • It is a (k-1)-clustering problem! • If we believe our instance inherently has k clusters “Necessary condition“ to guarantee: PTAS will return a “meaningful” clustering. • Our result: It’s a sufficient condition to get a PTAS.

  18. World All possible instances Any (k-1) clustering is significantly costlier than OPT(k) ORSS Stable

  19. A Weaker Guarantee Why use 5 sites?

  20. A Weaker Guarantee

  21. A Weaker Guarantee

  22. (1+®)-Weak Deletion Stability • Consider OPT(k). • Take any cluster C*i, associate its points with c*j. • This increases the cost to at least (1+®)OPT(k). • An obvious relaxation of ORSS-stability. • Our result: suffices to get a PTAS. ) c*j c*j c*i

  23. World All possible instances ORSS Stable Weak-Deletion Stable Merging any two clusters in OPT(k) increases the cost significantly

  24. ¯-Distributed Instances For every cluster C*i, and every p not in C*i, we have: We show that: • k-median: (1+®)-weak deletion stability )(®/2)-distributed. • k-means: (1+®)-weak deletion stability )(®/4)-distributed. p c*i

  25. Claim:(1+®)-Weak Deletion Stability )(®/2)-Distributed p c*i c*j ·xd(x, c*j) - xd(x, c*i) ®OPT ·x [d(x, c*i) + d(c*i, c*j)] - xd(x, c*i) = x d(c*i, c*j) = |C*i|d(c*i, c*j) ) ®(OPT/|C*i|) · d(c*i, c*j) ·d(c*i, p) + d(p, c*j) · 2d(c*i, p)

  26. World All possible instances ORSS Stable Weak-Deletion Stable ¯-Distributed In optimal solution: large distance between a center to any “outside” point

  27. Main Result • We give a PTAS for ¯-distributed k-median and k-means instances. • Running time: • There are NP-hard ¯-distributed instances. (Superpolynomial dependence on 1/² is unavoidable!)

  28. Stability Yields a PTAS for k-Median and k-Means Clustering • Introduce k-Median / k-Means problems. • Define stability • PTAS for k-Median • High level description • Intuition (“had only we known more…”) • Description • Conclusion + open problems.

  29. k-Median Algorithm’s Overview Input:Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers L := List of “suspected” clusters’ “cores”

  30. k-Median Algorithm’s Overview Input:Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers • Right definition of “core”. • Get the core of each cluster. • L can’t get too big.

  31. Intuition: “Mind the Gap” • We know: • In contrast, an “average” cluster contributes: • So for an “average” point p, in an “average” cluster C*i,

  32. Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i c*i

  33. Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i • Formally, call cluster C*icheap if • Assume all clusters are cheap. • In general: we brute-force guess O(1/¯²) centers of expensive clusters in Stage 0.

  34. Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i • Formally, call cluster C*icheap if • Markov:At most (²/4) fraction of the points of acheap cluster, lie outside the core.

  35. Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i • Formally, call cluster C*icheap if • Markov: At least half of the points of a cheap cluster lie inside the core.

  36. Magic (r/4) Ball If p belongs to the core ) B(p, r/4) contains ¸ |C*i|/2 pts. Denote r = ¯(OPT/|C*i|). “Heavy”: Mass ¸ |C*i|/2 r/4 > r · r/8 c*i p

  37. Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i All points in the core are merged into one set!

  38. Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i Could we merge core pts with pts from other clusters?

  39. Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p x > r · r/8 c*i r/2·d(p,c*i)·3r/4 r/4= r/2 - r/4·d(x,c*i)·3r/4 + r/4 = r

  40. Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4·d(x,c*i)·r x falls outside the core x belongs to C*i

  41. Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4·d(x,c*i)·r More than |C*i|/2 pts fall outside the core! )(

  42. Finding the Right Radius Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. • Problem: we don’t know |C*i| • Solution: Try all sizes, in order! • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Complication: • When s gets small (s=4,3,2,1) we collect many “leftovers” of one cluster. • Solution: once we add a subset to L, we remove close-by points. Denote r = ¯(OPT/|C*i|).

  43. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  44. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  45. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  46. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  47. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  48. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  49. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

  50. Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

More Related