Clustering and Modularity

Clustering and Modularity

Global View of Clustering Clustering is a data mining technique for analyzing structure in a data set. Many manymany different criteria available. k-center k-median k-means Inter-Intra etc

k-Center Minimize maximum distance

k-median Minimize average distance k-means: minimize distance squared

Inter-Intra T(C) D(C) Maximize D(C) – T(C)

Axioms of Clustering • Clustering function: operates on a set S of more than 2 points and the distances among them where is a partition of S • Distance function: the distance is 0 only for • Does not require the triangle inequality.

Axiom 1 – Scale-Invariance • For any distance function d and any we have that

Axiom 2 - Richness Range(f) is equal to all partitions of S All possible clusterings can be generated given the right distances

d(i,j) d’(i,j) d(i,j) d’(i,j) Axiom 3 - Consistency Let and be two distance functions. If and is such that the distance between all points in a cluster is less than in and the distance between inter-cluster points is larger than in then

Main Result For each , there is no clustering function that satisfies Scale-Invariance, Richness and Consistency Implied by proof that if satisfies Scale-Invariance and Consistency, then Range(f) is an anti-chain

Graph Clustering

What is wrong with just using cuts?

Sparsest Cut Given a graph find a cut that minimizes Favors min cuts that are approximately balanced. ARV gives a approximation to this problem

Modules/communities are statistically significant

The Null Hypothesis The expected degree model

Modularity Definition Deviation for a subset : Modularity of a cut

Computational Questions How can we find to maximize the modularity? How can we find multiple components?

Greedy Modularity Start with every vertex in its own cluster For every pair of clusters, check modularity value if joined. Join the two with largest increase in modularity. Stop when only one cluster remains

The Dendogram Image from Newman

An Aside on HAC A very general and popular clustering algorithm Usually used for point clustering, not graphs. It is an algorithm, not an objective

Agglomerative vs. Divisive Clustering • Agglomerative (bottom-up) • each object in its own cluster • repeatedly merge clusters • Divisive (top-down) • all objects in one cluster • repeatedly split clusters

HAC 3 ways to use the distance metric Single Link: min distance between points in different clusters Complete Link: max distance between points in different clusters Group Average: average distance between points in different clusters This usually approximates some clustering objective

Inter-Intra • New ObjectiveFunction: G(C) • Maximize (Distancebetweenclusters – Tightness) • Single Linkage HAC exactlyoptimizesthisobjective. • Mostclusteringproblems are NP-hard. So thisis a rarity. • Note: No k. T(C) D(C)

Reminder of Axioms • Scale-Invariance: For any distance function d and any we have that • Richness:Range(f) is equal to all partitions of S • Consistency: Let and be two distance functions. If and is such that the distance between all points in a cluster is less than in and the distance between inter-cluster points is larger than in then

Any two axioms • For every pair of axioms, there is a stopping condition for single linkage • Consistency + Richness: only link if distance is less than r • Consistency + SI: stop when you have k connected components • Richness + SI: if x is the diameter of the data points, only add edges with weight βx

Spectral Modularity Formulate modularity as a matrix calculation:

Spectral Modularity Find that maximizes s. t. Solution: Relax! Find the top eigenvector and round the entries based on the sign.

Spectral Modularity vs Partitioning Laplacian of a graph: Modularity Matrix:

Newman 2006

Agarwal, Kempe Paper Modularity as a Linear Program Attempt 1 at bounds: Relax the constraint and solve the fractional LP.Gives an upper bound on the modularity value. Maximize Subject to

Rounding the LP • While not empty • Select • Take to be all vertices within distance ½ • If average distance in is less than ¼ make a cluster • Else make a cluster.

Quadratic Program Formulation Maximize Subject to for all v MAX-CUT QP Maximize Subject to for all v

Extending the Definition to How do we extend the definition of modularity to modules?

Critiques • What value of modularity actually indicates something interesting? • Clauset et al: 0.3 • Guimera et al: G(n,p) graphs can have modularity 0.3

Resolution Limit What is wrong with the null hypothesis? We see a lot of locality in real networks, so assuming you could connect to anyone in the network isn’t right.

Resolution Limit What happens with big graphs? Degree 3 expander graph

Resolution Limit • If G is large enough, small cliques will be merged.

Clustering and Modularity

Clustering and Modularity

Presentation Transcript

Barrel Modularity and Layout

Modularity and Costs

Modularity, Interfaces, and Verification

Density and Modularity

Modularity and Applications

Modularity Clustering

Data clustering, modularity optimization, and total variation on graphs

Modularity

modularity

Modularity

Modularity…

Enforcing Modularity

Enforcing Modularity

Modularity

Modularity

Data Abstraction and Modularity

Modularity

Complexity and Modularity

Modularity, Interfaces, and Verification