230 likes | 320 Vues
Explore the complexity of clustering algorithms like K-means and spectral clustering based on data similarities. Discover the uniqueness and provable goodness of good clustering, optimal criteria, and bounded distortions. Gain insights into graphical views of clusterings and matrix representations, with a focus on stability results and eigengap significance.
E N D
The Stability of a Good Clustering Marina Meila University of Washington mmp@stat.washington.edu
similarities Optimizing these criteria is NP-hard’ worst case • Data • Objective • Algorithm K-means Spectral clustering ...but “spectral clustering, K-means work well when good clustering exists” interesting case This talk: If a “good” clustering exists, it is “unique” If “good” clustering found, it is provably good
Results summary • Given • objective = NCut, K-means distortion • data • clustering Y with K clusters • Spectral lower bound on distortion • If small • Then small where = best clustering with K clusters
distortion lower bound A graphical view clusterings
Overview • Introduction • Matrix representations for clusterings • Quadratic representation for clustering cost • The misclassification error distance • Results for NCut (easier) • Results for K-means distortion (harder) • Discussion
Clusterings as matrices • Clustering of { 1,2,..., n } with K clusters (C1, C2,...CK) • Represented by n x K matrix • unnormalized • normalized • All matrices have orthogonal columns
similarities Distortion is quadratic in X NCut K-means
mkk’ k k’ = The Confusion Matrix Two clusterings • (C1, C2, ... CK) with • (C’1, C’2, ... C’K’) with • Confusion matrix (K x K’)
The Misclassification Error distance • computed by the maximal bipartite matching algorithm between clusters k confusion matrix classification error k’
Results for NCut • given • data A (n x n) • clustering X (n x K) • Lower bound for NCut (M02, YS03, BJ03) • Upper bound for (MSX’05) whenever largest e-values of A
Relaxed minimization for s.t. X = n x K orthogonal matrix Solution: X* = K principal e-vectors of A small w.r.t eigengap K+1-K X close to X* convexity proof Two clusterings X,X’ close to X* trace XTX’ large trace XTX’ large small
Why the eigengap matters • Example • A has 3 diagonal blocks • K = 2 • gap( C ) = gap( C’ ) = 0 but C, C’ not close C C’
Remarks on stability results • No explicit conditions on S • Different flavor from other stability results, e.g Kannan & al 00, Ng & al 01 which assume S “almost” block diagonal • But…results apply only if a good clustering is found • There are S matrices for which no clustering satisfies theorem • Bound depends on aggregate quantities like • K • cluster sizes (=probabilities) • Points are weighted by their volumes (degrees) • good in some applications • bounds for unweighted distances can be obtained
Is the bound ever informative? • An experiment: S perfect + additive noise
K = 4 dim = 30 4 K-means distortion • We can do the same ... • ...but, K-th principal subspace typically not stable
New approach: Use K-1 vectors • Non-redundant representation Y • Distortion – new expression • ...and new (relaxed) optimization problem
Solution of the new problem • Relaxed optimization problem given • Solution • U = K-1 principal e-vectors of A • W = KxK orthogonal matrix • with on first row
small Y close to Y* Clusterings Y,Y’ close to Y* ||YTY’||F large ||YTY’||F large small Solve relaxed minimization
Theorem For any two clusterings Y,Y’ with Y, Y’ > 0 whenever Corollary: Bound for d(Y,Yopt)
K = 4 dim = 30 Experiments 20 replicates pmin bound true error
Conclusions • First (?) distribution independent bounds on the clustering error • data dependent • hold when data well clustered (this is the case of interest) • Tight? – not yet... • In addition • Improved variational bound for the K-means cost • Showed local equivalence between “misclassification error” distance and “Frobenius norm distance” (also known as 2distance) • Related work • Bounds for mixtures of Gaussians (Dasgupta, Vempala) • Nearest K-flat to n points (Tseng) • Variational bounds for sparse PCA (Mogghadan)