1 / 40

Correlation Clustering

Correlation Clustering. Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla. Clustering. Say we want to cluster n objects of some kind (documents, images, text strings) But we don’t have a meaningful way to project into Euclidean space.

mahoney
Télécharger la présentation

Correlation Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla

  2. Clustering • Say we want to cluster n objects of some kind (documents, images, text strings) • But we don’t have a meaningful way to project into Euclidean space. • Idea of Cohen, McCallum, Richman: use past data to train up f(x,y)=same/different. • Then run fon all pairs and try to find most consistent clustering.

  3. The problem Harry B. Harry Bovik H. Bovik Tom X.

  4. The problem Harry B. + Harry Bovik +: Same -: Different + + H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs

  5. The problem Harry B. + Harry Bovik +: Same -: Different + + H. Bovik Tom X. • Totally consistent: • + edges inside clusters • –edges outside clusters

  6. The problem Harry B. + Harry Bovik +: Same -: Different + H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs

  7. The problem Harry B. + Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering

  8. The problem Harry B. + Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering

  9. The problem Harry B. + Problem: Given a complete graph onnvertices. Each edge labeled+or-. Goal is to findpartitionof vertices asconsistentas possible with edge labels. Max #(agreements) or Min #( disagreements) There is no k : # of clusters could be anything Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X.

  10. The Problem Noise Removal: There is some true clustering. However some edges incorrect. Still want to do well. Agnostic Learning: No inherent clustering. Try to find the best representation using hypothesis with limited representation power. Eg: Research communities via collaboration graph

  11. Our results • Constant-factor approxfor minimizing disagreements. • PTASfor maximizing agreements. • Results for random noise case.

  12. PTAS for maximizing agreements Easy to get ½ of the edges Goal: additive apx ofen2 . Standard approach: • Draw small sample, • Guess partition of sample, • Compute partition of remainder. Can do directly, or plug into General Property Tester of [GGR]. • Running timedoubly exp’l in e, orsinglywith bad exponent.

  13. Minimizing Disagreements Goal: Get a constant factor approx. Problem:Even if we can find a cluster that’s as good as best cluster of OPT, we’re headed toward O(log n) [Set-Cover like analysis] Need a way of lower-bounding Dopt.

  14. Lower bounding idea: bad triangles Consider + + We know any clustering has to disagree with at least one of these edges.

  15. Lower bounding idea: bad triangles If several such disjoint, then mistake on each one 1 + + 5 2 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) + + 4 3 + Dopt¸ #{Edge disjointbadtriangles} (Not Tight)

  16. Lower bounding idea: bad triangles If several such, then mistake on each one 1 + + 5 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) + + 4 3 + Dopt¸ #{Edge disjointbadtriangles} How can we use this ?

  17. d-clean Clusters Given a clustering, vertex d-good if few disagreements N-(v) Within C < d|C| N+(v) Outside C < d|C| C v is d-good +: Similar -: Dissimilar Essentially, N+(v) ¼ C(v) Cluster C d-clean if all v2C are d-good

  18. Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles w Intuitively, enough choices of w for each wrong edge (u,v) + + - v u

  19. Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles w Intuitively, enough choices of w for each wrong edge (u,v) Can find edge-disjoint bad triangles + u’ + + - - v u Similar argument for +ve edges between 2 clusters

  20. General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence:Just need to produce a d-clean clustering for d<1/4

  21. General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence:Just need to produce a d-clean clustering for d<1/4 Bad News:May not be possible !!

  22. General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a d-clean clustering for d<1/4 Approach: 1) Clustering of a special form: Opt(d) • Clusters either d clean or singletons • Not many more mistakes than Opt 2) Produce something “close’’ to Opt(d) Bad News:May not be possible !!

  23. C1 C2 Existence of OPT(d) Opt

  24. Existence of OPT(d) Identify d/3-bad vertices Opt C1 C2 /3-bad vertices

  25. Existence of OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Opt Opt() : Singletons and -clean clusters DOpt() = O(1) DOpt C1 C2 OPT(d) Split

  26. Main Result Opt() -clean

  27. Main Result • Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt() Opt() -clean Algorithm 11-clean

  28. Main Result • Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt() Opt() -clean Choose  =1/44 Non-singletons: ¼-clean Bound mistakes among non-singletons by bad trianges Involving singletons By those of Opt() Algorithm 11-clean

  29. Main Result • Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt() Opt() -clean Choose  =1/44 Non-singletons: ¼-clean Bound mistakes among non-singletons by bad trianges Involving singletons By those of Opt() Approx. ratio= 9/d2 + 8 Algorithm 11-clean

  30. Open Problems • How about a small constant? Is Dopt· 2 #{Edge disjointbadtriangles} ? Is Dopt· 2 #{Fractional e.d. bad triangles}? • Extend to {-1, 0, +1} weights: apx to constant or even log factor? • Extend to weighted case? Clique Partitioning Problem Cutting plane algorithms [Grotschel et al]

  31. + + The problem Harry B. Harry Bovik +: Same -: Different Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering

  32. Nice features of formulation • There’s no k. (OPT can have anywhere from 1 to n clusters) • If a perfect solution exists, then it’s easy to find: C(v) = N +(v). • Easy to get agreement on ½ of edges.

  33. Algorithm + • Pick vertex v. Let C(v) = N (v) • Modify C(v) • Remove 3d-bad vertices from C(v). • Add 7d good vertices into C(v). • Delete C(v). Repeat until done, or above always makes empty clusters. • Output nodes left as singletons.

  34. v C Step 1 Choose v, C= + neighbors of v C1 C2

  35. Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2 v C • No vertex in C1 removed. • All vertices in C2 removed

  36. Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C • All remaining vertices in C1 will be added • None in C2 added • Cluster C is 11d-clean

  37. Case 2: v Singleton in OPT() Choose v, C= +1 neighbors of v C v Same idea works

  38. 1 4 2 3 The problem

More Related