The Stability of Good Clustering in Data and Algorithms

The Stability of a Good Clustering Marina Meila University of Washington mmp@stat.washington.edu

similarities Optimizing these criteria is NP-hard’ worst case • Data • Objective • Algorithm K-means Spectral clustering ...but “spectral clustering, K-means work well when good clustering exists” interesting case This talk: If a “good” clustering exists, it is “unique” If “good” clustering found, it is provably good

Results summary • Given • objective = NCut, K-means distortion • data • clustering Y with K clusters • Spectral lower bound on distortion • If small • Then small where = best clustering with K clusters

distortion lower bound A graphical view clusterings

Overview • Introduction • Matrix representations for clusterings • Quadratic representation for clustering cost • The misclassification error distance • Results for NCut (easier) • Results for K-means distortion (harder) • Discussion

Clusterings as matrices • Clustering of { 1,2,..., n } with K clusters (C1, C2,...CK) • Represented by n x K matrix • unnormalized • normalized • All matrices have orthogonal columns

similarities Distortion is quadratic in X NCut K-means

mkk’ k k’ = The Confusion Matrix Two clusterings • (C1, C2, ... CK) with • (C’1, C’2, ... C’K’) with • Confusion matrix (K x K’)

The Misclassification Error distance • computed by the maximal bipartite matching algorithm between clusters k confusion matrix classification error k’

Results for NCut • given • data A (n x n) • clustering X (n x K) • Lower bound for NCut (M02, YS03, BJ03) • Upper bound for (MSX’05) whenever largest e-values of A

Relaxed minimization for s.t. X = n x K orthogonal matrix Solution: X* = K principal e-vectors of A small w.r.t eigengap K+1-K X close to X* convexity proof Two clusterings X,X’ close to X* trace XTX’ large trace XTX’ large small

Why the eigengap matters • Example • A has 3 diagonal blocks • K = 2 • gap( C ) = gap( C’ ) = 0 but C, C’ not close C C’

Remarks on stability results • No explicit conditions on S • Different flavor from other stability results, e.g Kannan & al 00, Ng & al 01 which assume S “almost” block diagonal • But…results apply only if a good clustering is found • There are S matrices for which no clustering satisfies theorem • Bound depends on aggregate quantities like • K • cluster sizes (=probabilities) • Points are weighted by their volumes (degrees) • good in some applications • bounds for unweighted distances can be obtained

Is the bound ever informative? • An experiment: S perfect + additive noise

K = 4 dim = 30 4 K-means distortion • We can do the same ... • ...but, K-th principal subspace typically not stable

New approach: Use K-1 vectors • Non-redundant representation Y • Distortion – new expression • ...and new (relaxed) optimization problem

Solution of the new problem • Relaxed optimization problem given • Solution • U = K-1 principal e-vectors of A • W = KxK orthogonal matrix • with on first row

small Y close to Y* Clusterings Y,Y’ close to Y* ||YTY’||F large ||YTY’||F large small Solve relaxed minimization

Theorem For any two clusterings Y,Y’ with Y, Y’ > 0 whenever Corollary: Bound for d(Y,Yopt)

K = 4 dim = 30 Experiments 20 replicates pmin bound true error

B A D

Conclusions • First (?) distribution independent bounds on the clustering error • data dependent • hold when data well clustered (this is the case of interest) • Tight? – not yet... • In addition • Improved variational bound for the K-means cost • Showed local equivalence between “misclassification error” distance and “Frobenius norm distance” (also known as 2distance) • Related work • Bounds for mixtures of Gaussians (Dasgupta, Vempala) • Nearest K-flat to n points (Tseng) • Variational bounds for sparse PCA (Mogghadan)

The Stability of Good Clustering in Data and Algorithms

The Stability of Good Clustering in Data and Algorithms

Presentation Transcript

Clustering Gene Expression Data: The Good, The Bad, and The Misinterpreted

The stability of the solar system

The Characteristics Of A Good Teacher

Measures of Clustering Quality: A Working Set of Axioms for Clustering

The Myth of Core Stability

THE VALUE OF A GOOD MISTAKE

Stability Yields a PTAS for k -Median and k -Means Clustering

The Shoulder A Balance Of Mobility and Stability

Measures of Clustering Quality: A Working Set of Axioms for Clustering

The Elements of a Good Story

Hierarchical Stability Based Model Selection for Data Clustering

The Stability of a Good Clustering

Stability of the CAL Calibrations

A BROAD VIEW OF MACROECONOMIC STABILITY

Stability of the proton

The Myth of Core Stability

The role of good governance, disclosure and transparency in banking stability

The 5 Pillars of Stability

The Stability of Nuclides