1 / 39

Clustering and Modularity

Clustering and Modularity. Global View of Clustering. Clustering is a data mining technique for analyzing structure in a data set. Many many many different criteria available. k-center k-median k-means Inter-Intra etc. k-Center. Minimize maximum distance. k-median.

rhonda
Télécharger la présentation

Clustering and Modularity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering and Modularity

  2. Global View of Clustering Clustering is a data mining technique for analyzing structure in a data set. Many manymany different criteria available. k-center k-median k-means Inter-Intra etc

  3. k-Center Minimize maximum distance

  4. k-median Minimize average distance k-means: minimize distance squared

  5. Inter-Intra T(C) D(C) Maximize D(C) – T(C)

  6. Axioms of Clustering • Clustering function: operates on a set S of more than 2 points and the distances among them where is a partition of S • Distance function: the distance is 0 only for • Does not require the triangle inequality.

  7. Axiom 1 – Scale-Invariance • For any distance function d and any we have that

  8. Axiom 2 - Richness Range(f) is equal to all partitions of S All possible clusterings can be generated given the right distances

  9. d(i,j) d’(i,j) d(i,j) d’(i,j) Axiom 3 - Consistency Let and be two distance functions. If and is such that the distance between all points in a cluster is less than in and the distance between inter-cluster points is larger than in then

  10. Main Result For each , there is no clustering function that satisfies Scale-Invariance, Richness and Consistency Implied by proof that if satisfies Scale-Invariance and Consistency, then Range(f) is an anti-chain

  11. Graph Clustering

  12. What is wrong with just using cuts?

  13. Sparsest Cut Given a graph find a cut that minimizes Favors min cuts that are approximately balanced. ARV gives a approximation to this problem

  14. Modules/communities are statistically significant

  15. The Null Hypothesis The expected degree model

  16. Modularity Definition Deviation for a subset : Modularity of a cut

  17. Computational Questions How can we find to maximize the modularity? How can we find multiple components?

  18. Greedy Modularity Start with every vertex in its own cluster For every pair of clusters, check modularity value if joined. Join the two with largest increase in modularity. Stop when only one cluster remains

  19. The Dendogram Image from Newman

  20. An Aside on HAC A very general and popular clustering algorithm Usually used for point clustering, not graphs. It is an algorithm, not an objective

  21. Agglomerative vs. Divisive Clustering • Agglomerative (bottom-up) • each object in its own cluster • repeatedly merge clusters • Divisive (top-down) • all objects in one cluster • repeatedly split clusters

  22. HAC 3 ways to use the distance metric Single Link: min distance between points in different clusters Complete Link: max distance between points in different clusters Group Average: average distance between points in different clusters This usually approximates some clustering objective

  23. Inter-Intra • New ObjectiveFunction: G(C) • Maximize (Distancebetweenclusters – Tightness) • Single Linkage HAC exactlyoptimizesthisobjective. • Mostclusteringproblems are NP-hard. So thisis a rarity. • Note: No k. T(C) D(C)

  24. Reminder of Axioms • Scale-Invariance: For any distance function d and any we have that • Richness:Range(f) is equal to all partitions of S • Consistency: Let and be two distance functions. If and is such that the distance between all points in a cluster is less than in and the distance between inter-cluster points is larger than in then

  25. Any two axioms • For every pair of axioms, there is a stopping condition for single linkage • Consistency + Richness: only link if distance is less than r • Consistency + SI: stop when you have k connected components • Richness + SI: if x is the diameter of the data points, only add edges with weight βx

  26. Spectral Modularity Formulate modularity as a matrix calculation:

  27. Spectral Modularity Find that maximizes s. t. Solution: Relax! Find the top eigenvector and round the entries based on the sign.

  28. Spectral Modularity vs Partitioning Laplacian of a graph: Modularity Matrix:

  29. Newman 2006

  30. Agarwal, Kempe Paper Modularity as a Linear Program Attempt 1 at bounds: Relax the constraint and solve the fractional LP.Gives an upper bound on the modularity value. Maximize Subject to

  31. Rounding the LP • While not empty • Select • Take to be all vertices within distance ½ • If average distance in is less than ¼ make a cluster • Else make a cluster.

  32. Quadratic Program Formulation Maximize Subject to for all v MAX-CUT QP Maximize Subject to for all v

  33. Extending the Definition to How do we extend the definition of modularity to modules?

  34. Critiques • What value of modularity actually indicates something interesting? • Clauset et al: 0.3 • Guimera et al: G(n,p) graphs can have modularity 0.3

  35. Resolution Limit What is wrong with the null hypothesis? We see a lot of locality in real networks, so assuming you could connect to anyone in the network isn’t right.

  36. Resolution Limit What happens with big graphs? Degree 3 expander graph

  37. Resolution Limit • If G is large enough, small cliques will be merged.

More Related