1 / 45

My relationship with correlation clustering started in 2016

My relationship with correlation clustering started in 2016. From June-July 2016 I visited Melbourne as part of the East Asia and Pacific Summer Institute Fellowship. Tony and I studied weighted correlation clustering with low-rank advice

fernd
Télécharger la présentation

My relationship with correlation clustering started in 2016

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. My relationship with correlation clustering started in 2016 • From June-July 2016 I visited Melbourne as part of the East Asia and Pacific Summer Institute Fellowship. • Tony and I studied weighted correlation clustering with low-rank advice • Project was based on an observation Tony and my advisor David Gleich made in 2015 about rank-1 correlation clustering -6 +6 -4 +2 -2 Nate Veldt

  2. Many algorithms focus on complete unweighted correlation clustering + • Given a signed graph G • Each edge indicates similarity (+) or dissimilarity (—) − − − + + − + − − Nate Veldt

  3. In general, edges can be weighted Weights can be stored in an adjacency matrix -6 -4 -3 +2 +6 -2 Nate Veldt

  4. The rank-1 positive semidefinitecase is very simple -6 -4 -3 +2 +6 -2 Nate Veldt

  5. The rank-1 positive semidefinitecase is very simple -6 +3 -2 -4 -3 +2 +6 +2 -1 -2 Nate Veldt

  6. The rank-1 positive semidefinitecase is very simple +3 -2 +2 -1 Nate Veldt

  7. The rank-1 positive semidefinitecase is very simple +3 -2 +2 -1 Ordering v gives a perfect clustering Nate Veldt

  8. The rank-1 positive semidefinitecase is very simple -1 -2 +3 +2 Ordering v gives a perfect clustering Nate Veldt

  9. The rank-1 positive semidefinitecase is very simple -1 -2 +3 +2 Ordering v gives a perfect clustering Nate Veldt

  10. A simple solution for rank-1 positive semidefinite correlation clustering always exists. What happens for other low-rank matrices? Our contributions Polynomial-time solution for rank-d PSD matrices NP-hardness result for negative eigenvalues Heuristic algorithm for PSD matrices Nate Veldt

  11. Rank-d PSD correlation clustering is equivalent to clustering vectors in Rd Nate Veldt

  12. Main observations Nate Veldt

  13. Main observations Cluster Ci Nate Veldt

  14. Main observations Si “Sum point” Cluster Ci Nate Veldt

  15. Main observations Si “Sum point” Cluster Ci For a fixed clustering, the objective can be written in terms of sum points Si: Nate Veldt

  16. Main observations Si “Sum point” Cluster Ci For a fixed clustering, the objective can be written in terms of sum points Si: Also, we can show that the number of clusters can be bounded above by d+1. Nate Veldt

  17. Main observations Si “Sum point” Cluster Ci For a fixed clustering, the objective can be written in terms of sum points Si: d+1 Also, we can show that the number of clusters can be bounded above by d+1. Nate Veldt

  18. Why is the number of clusters bounded? If the clustering is optimal, the sum points will have pairwise negative dot products, i.e. If not, this would indicate that clusters i and j on the whole are “similar”, and merging them would improve the objective. Fact:The maximum number of vectors in Rd with pairwise negative dot products is d+1. [Rankin 1947] Nate Veldt

  19. Our problem can be seen as a special case of the vector partition problem The vector partition problem can be solved in polynomial time by visiting the vertices of the d2-dimensional signing zonotope. [Onn & Schulman 2001] This leads to an algorithm for rank-d positive semidefinite CC. Nate Veldt

  20. Our problem can be seen as a special case of the vector partition problem The vector partition problem can be solved in polynomial time by visiting the vertices of the d2-dimensional signing zonotope. [Onn & Schulman 2001] This leads to an algorithm for rank-d positive semidefinite CC. In practice we developed a faster heuristic for sampling vertices of the zonotope. Nate Veldt

  21. Observation: Assuming low-rank edge weights leads to new complexity results, new algorithms, and connections to other problems. • General question: What other special weighted versions of correlation clustering lead to specialized algorithms and new connections? Nate Veldt

  22. A new idea: simple but unequal weights for positive and negative edges Assign weights with respect to a resolution parameter λ∈ (0,1). • No particular connection to low-rank correlation clustering. However, it similarly leads to: • New complexity results and algorithms • Connections to other known partitioning problems Nate Veldt

  23. This is motivated by applications to graph clustering Given G = (V,E), construct signed graph G’ = (V,E+,E– ), an instance of correlation clustering + + Without weights, unweighted correlation clustering is the same as a problem called cluster editing – + + – – + – Nate Veldt

  24. This is motivated by applications to graph clustering Given G = (V,E), construct signed graph G’ = (V,E+,E– ), an instance of correlation clustering Parameter λ controls your interpretation of the existence or absence of an edge in a network. Nate Veldt

  25. LambdaCC generalizes several graph clustering objectives Modularity Normalized Cut Degree- weighted Standard Cluster Deletion Sparsest Cut Correlation Clustering (Cluster Editing) m = |E| Nate Veldt

  26. LambdaCC generalizes several graph clustering objectives Modularity Normalized Cut Degree- weighted Standard Cluster Deletion Sparsest Cut Correlation Clustering (Cluster Editing) m = |E| Let’s take a quick look at these two! Nate Veldt

  27. Sparse and dense clustering objectives Sparsest cut Nate Veldt

  28. Sparse and dense clustering objectives • Sparsest cut Cluster Deletion Minimize number of edges removed to partition graph into cliques Nate Veldt

  29. Consider a restriction to two clusters S S Positive mistakes: (1 – λ) cut(S) Negative mistakes: λ |E–| – λ [ |S| |S| – cut(S) ] Total weight of mistakes = cut(S) – λ |S| |S| + λ |E–| Nate Veldt

  30. Two-cluster LambdaCC can be written Nate Veldt

  31. Two-cluster LambdaCC can be written constant Nate Veldt

  32. Two-cluster LambdaCC can be written constant Note Nate Veldt

  33. Two-cluster LambdaCC can be written constant Note This is a scaled version of sparsest cut! Nate Veldt

  34. The relationship with sparsest cut holds in general The general LambdaCC objective can be written Theorem Minimizing this objective produces clusters with scaled sparsest cut at most λ (if they exist). There exists some λ’ such that minimizing LambdaCC will return the minimum sparsest cut partition. Nate Veldt

  35. For large λ, LambdaCC generalizes cluster deletion cluster deletion correlation clustering with infinite penalties on negative edges We show this is equivalent to LambdaCC for the right choice of λ ≫ (1-λ) Nate Veldt

  36. Algorithms for LambdaCC • Adapting the approach of van Zuylen and Williamson we obtain new algorithms based on LP relaxations: • ThreeLP: 3-approximation for LambdaCC when λ > ½ • TwoLP: 2-approximation for cluster deletion • We also provide scalable heuristic algorithms • Lambda-Louvain: based on Louvain method for modularity • GrowCluster: greedy agglomeration technique Best known approximation for cluster deletion! [A. van Zuylen and D. P. Williamson. Deterministic pivoting algorithms for constrained ranking and clustering problems. Mathematics of Operations Research, 34(3):594–620, 2009.] Nate Veldt

  37. We cluster social networks with various λto understand the correlation between communities and metadata attributes Student/faculty status Graduation year Dorm Cornell University (Facebook100) Nate Veldt

  38. We cluster social networks with various λ to understand the correlation between communities and metadata attributes Probability that two people who share a cluster also share a metadata attribute Student/faculty status Graduation year Dorm Cornell University (Facebook100) Nate Veldt

  39. We cluster social networks with various λ to understand the correlation between communities and metadata attributes Probability that two people who share a cluster also share a metadata attribute Student/faculty status Probability that they share a related fake attribute Graduation year Dorm Cornell University (Facebook100) Nate Veldt

  40. We cluster social networks with various λ to understand the correlation between communities and metadata attributes Probability that two people who share a cluster also share a metadata attribute Student/faculty status Probability that they share a related fake attribute The gap shows that there is a noticeable correlation between each attribute and the clustering Graduation year Dorm Cornell University (Facebook100) Nate Veldt

  41. Swarthmore Yale Nate Veldt

  42. S/F status and graduation year peak early Swarthmore Yale Nate Veldt

  43. S/F status and graduation year peak early Swarthmore Dorm attribute is more correlated with small, dense communities Yale Nate Veldt

  44. Conclusions and other work • We’ve considered several special cases of correlation clustering that come with novel approximation guarantees and are motivated by different applications in data science. • Other work • Solving the LP relaxation of CC (with James Saunderson) • Choosing λ for LambdaCC • Higher-order correlation clustering • Future work • Correlation clustering for record linkage • Practical algorithms for higher-order correlation clustering • New questions about other low-rank objectives Nate Veldt

  45. Thanks! Papers.arXiv: 1611.07305 (at WWW2017), 1712.05825 (at WWW2018) 1809.09493 (ISAAC, to appear) 1809.01678 (submitted) Software. github: nveldt/LamCCnveldt/MetricOptimization With David Gleich (Purdue), Tony Wirth (Melbourne), and James Saunderson (Monash) Nate Veldt

More Related