1 / 65

Correlation Clustering

Correlation Clustering. Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla. Introduction. Previous Approaches. Doc1 -> (1,0,1). +. Distance among points. Documents mapped to points. Previous Approaches. k-min clustering K-min sum, k-median …. Approximation algorithms,

craft
Télécharger la présentation

Correlation Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla

  2. Introduction

  3. Previous Approaches Doc1 -> (1,0,1) + Distance among points Documents mapped to points

  4. Previous Approaches k-min clustering K-min sum, k-median … Approximation algorithms, Matrix methods, AI Techniques … K=3 K-min clustering: Minimize Max. Diameter K-min sum : Minimize sum of distances within clusters

  5. Some Limitations 1) Have to specify “k” If k not restricted: Best to just put each vertex in its ownindividual cluster

  6. Some Limitations 2) Restrictions on Edge Weights Edge weights form metric

  7. Some Limitations 3) No Clean notion of quality of clustering E.g. Minimize distance sum within clusters. What really is my Cluster quality?

  8. Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion

  9. Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1

  10. Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 -1 +1

  11. Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 Our Goal: Find a clustering which agrees with this labeling -1 +1

  12. +1: Similar -1: Dissimilar A Disagreement -1 -1 +1 2 edges have disagreements!! Disagreement: -1 edge with in a cluster +1 edge crossing a cluster -1 -1 +1 Our Goal: Minimize number of disagreements

  13. Comparison: 1) Clean notion of quality of clustering # disagreements -> Quality

  14. Comparison: 2) Do not have to specify “k” +1 +1 K determined by Edge labels +1 -1 -1 -1

  15. Comparison: 3) Arbitrary Edge Weights No metric No dependence

  16. A Closer Look Goal: Given graph with +1,-1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements?

  17. A Closer Look Goal: Given graph with +1,-1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements? Answer: No. +1 +1 -1 Any clustering has at least 1 disagreement

  18. Minimizing Disagreements +1 +1 +1 +1 +1 +1 -1 -1 -1 1 Disagreement

  19. Minimizing Disagreements +1 +1 +1 +1 +1 +1 -1 -1 -1 1 Disagreement Minimizing disagreements is NP-Hard Will look for approximation algorithms

  20. Agreements vs. Disagreements Observation: Agreements + Disagreements = Minimizing disagreements , Maximizing agreements Very different in terms of approximation: Opt: 1 disagreement We: n disagreements Disagreements : Ratio n Agreements : Ratio ¼ 1

  21. Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion

  22. Maximizing Agreements A 2 approximation is easy. Algorithm: If #(+1 edges) > #(-1 edges), put all in single cluster Else, individual cluster for each point. Proof: Opt’s agreements at most We agree on at least

  23. Our Result • A PTAS for max. agreements: (1+) approximation, Time = nO(poly(1/))

  24. Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion

  25. Our Result • An O(1) approximation for minimizing disagreements

  26. Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notation 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements

  27. +1: Similar -1: Dissimilar Notation: Given a clustering, vertex d-good if few disagreements Within C < d|C| Outside C < d|C| C v is d-good C A-bad vertex has ¸|C| disagreements v is d-bad Cluster C d-clean if all v2C are d-good

  28. Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements

  29. Existence of Opt(d) Main Idea: Opt -> Opt(d) Opt(d): 1) All “non-singleton” clusters d-clean 2) Constant times worse than Opt Dopt(d) = O(1/d2) Dopt

  30. C1 C2 Transforming OPT to OPT(d) Optimum clustering An Imaginary Procedure applied to Opt

  31. Transforming OPT to OPT(d) Identify d/3-bad vertices Optimum clustering C1 C2 /3-bad vertices

  32. Transforming OPT to OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Optimum clustering C1 C2 OPT(d) Vertex moves out

  33. Transforming OPT to OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Optimum clustering C1 C2 OPT(d) Split

  34. Transforming OPT to OPT(d) Disagreements of OPT(d) OPT(d) Split Disagreements: Earlier ¸ (d/3)2|C1|2 Add · |C1|2/2 Do not split Disagreements: Earlier: Each had ¸d/3|C2| Add : Each has · |C2| So, total disagreements increase by O(1/d2) times

  35. Transforming OPT to OPT(d) “Non-Singleton” clusters are d-clean Optimum clustering Earlier d/3 good vertex Still d-good C1 C2 OPT(d)

  36. Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements

  37. Main Result Opt() -clean Clustering produced by Algorithm 11-clean

  38. The algorithm Input: Graph G Output: A clustering of G 1) Pick arbitrary v2 G, let C=+1 neighbors of v 2) Vertex Removal Phase:Remove bad vertices from C 3) Vertex Addition Phase:Add good vertices to C 4) Repeat on G-C

  39. v C Step 1 Choose v, C= +1 neighbors of v C1 C2

  40. v C Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2

  41. Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2 v C • No vertex in C1 removed. • All vertices in C2 removed

  42. Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C

  43. Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C • All remaining vertices in C1 will be added • None in C2 added • Cluster C is 11d-clean

  44. Case 2: v Singleton in OPT() Choose v, C= +1 neighbors of v C v Same idea works

  45. Main Result Opt() -clean Algorithm 11-clean

  46. Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements

  47. Our Disagreements C1 C2 Opt() +1 +1 • Disagreements: • Involving Singletons • In Non-Singletons Algorithm +1 Type 1 · Dopt(d) +1 11- clean

  48. Disagreements in Non-Singletons Lemma: If d < ¼, disagreements in d-clean clusterings are · 8 dopt +1 Erroneous Triangle: +1 -1 Disagreements of OPT ¸ # of edge disjoint Erroneous D

  49. Lots of these (¸ ½|C|) All cannot be used up +1 +1 -1 Disagreements in Non-Singletons Lemma: If d < ¼, errors in d-clean clusterings are · 8 dopt Proof Idea:For each disagreement will find an edge disjoint erroneous D -clean cluster C

  50. Disagreements in Non-Singletons Lemma: If d < ¼, disagreements in d- clean clusterings are · 8 dopt Identical argument works -1 +1 +1 -clean clusters

More Related