A Discriminative Framework for Clustering via Similarity Functions

A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala

Brief Overview of the Talk Clustering Supervised Learning Learning from unlabeled data. Learning from labeled data. Lack of good unified models. Good theoretical models: • PAC, SLT Vague, difficult to reason about at a general technical level. • Kernels & Similarity fns Our work: fix the problem A PAC-style framework

Clustering: Learning from Unlabeled Data [sports] [fashion] S set of n objects. [documents] 9 ground truth clustering. x, l(x) in {1,…,t}. [topic] Goal: h of low error where err(h) = minPrx~S[(h(x)) l(x)] Problem: unlabeled data only! But have a Similarity Function!

Clustering: Learning from Unlabeled Data [sports] [fashion] Protocol 9 ground truth clustering for S The similarity function K has to be related to the ground-truth. i.e., each x in S hasl(x) in {1,…,t}. S, a similarity function K. Input Clustering of small error. Output

Clustering: Learning from Unlabeled Data [sports] [fashion] Fundamental Question What natural properties on a similarity function would be sufficient to allow one to cluster well?

Contrast with Standard Approaches Approximation algorithms Mixture models Input: embedding into Rd Input: graph or embedding into Rd • score algs based on apx ratios • score algs based on error rate - analyze algs to optimize various criteria over edges • strong probabilistic assumptions Clustering Theoretical Frameworks Discriminative, not generative. Our Approach Much better when input graph/ similarity is based on heuristics. Input: graph or similarity info • score algs based on error rate E.g., clustering documents by topic, web search results by category • no strong probabilistic assumptions

Condition that trivially works. What natural properties on a similarity function would be sufficient to allow one to cluster well? [sports] [fashion] C C’ K(x,y) > 0 for all x,y, l(x) = l(y).K(x,y) < 0 for all x,y, l(x) l(y). A A’

fashion fashion sports sports Lacoste Lacoste soccer soccer Gucci Gucci tennis tennis What natural properties on a similarity function would be sufficient to allow one to cluster well? All x more similar to all y in own cluster than any z in any other cluster Problem: same K can satisfy it for two very different, equally natural clusterings of the same data! K(x,x’)=1 K(x,x’)=0.5 K(x,x’)=0

Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

soccer Lacoste tennis Gucci Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. All topics sports fashion tennis Lacoste soccer Gucci 2.List of clusterings s.t. at least one has low error. Tradeoff strength of assumption with size of list. Obtain a rich, general model.

sports fashion 1 soccer Lacoste 0 0.5 tennis Gucci Strict Separation Property All x more similar to all y in own cluster than any z in any other cluster Sufficient for hierarchical clustering (If K is symmetric) Algorithm Single-Linkage. • merge “parts” whose max similarity is highest. All topics sports fashion tennis Lacoste soccer Gucci

Strict Separation Property All x more similar to all y in own cluster than any z in any other cluster Use Single-Linkage, construct a tree s.t. ground-truth clustering is a pruning of the tree. Theorem Incorporate Approximation Assumptions in Our Model If use c-approx. alg. to objective f (e.g, k-median) to minimize error rate, then implicit assumption: Clusterings within factor c of optimal are -close to the target. Most points (1-O() fraction) satisfy Strict Separation. k-median, k-means Can still cluster well in the tree model.

Stability Property C C’ For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’) A’ A (K(A,A’) - average attraction between A and A’) Neither A or A’ more attracted to the other one than to the rest of its own cluster. Sufficient for hierarchical clustering Single linkage fails, but average linkage works. Merge “parts” whose average similarity is highest.

Stability Property C C’ For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’) A’ K(P1,P3) ¸ K(P1,C-P1) and K(P1,C-P1) > K(P1,P2). A (K(A,A’) - average attraction between A and A’) Use Average Linkage, construct a tree s.t. the ground-truth clustering is a pruning of the tree. Theorem Analysis: All “parts” laminar wrt target clustering. • Failure iff merge P1, P2 s.t. P1½ C, P2Å C =. P2 C P3 P1 • But must exist P3½ C s.t. Contradiction.

0.5 0.25 Stability Property C C’ For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’) A’ A (K(A,A’) - average attraction between A and A’) Average Linkage breaks down if K is not symmetric. Instead, run “Boruvka-inspired” algorithm: • Each current cluster Ci points to argmaxCjK(Ci,Cj) • Merge directed cycles.

Unified Model for Clustering Question 1:Given a property of the similarity function w.r.t. ground truth clustering, what is a good algorithm? … … Property P1 Property Pi Property Pn of the similarity function wrt the ground-truth clustering … Algorithm A1 Algorithm A2 Algorithm Am

Unified Model for Clustering Question 2: Given the algorithm, what property of the similarity function w.r.t. ground truth clustering should the expert aim for? … … Property P1 Property Pi Property Pn of the similarity function wrt the ground-truth clustering … Algorithm A1 Algorithm A2 Algorithm Am

Other Examples of Properties and Algorithms Average AttractionProperty Ex’ 2 C(x)[K(x,x’)] > Ex’ 2 C’ [K(x,x’)]+ (8 C’C(x)) Not sufficient for hierarchical clustering Can produce a small list of clusterings. (sampling based algorithm) Upper bound: Lower bound: tO(t/2 log t/) tO(1/) C Stability of Large Subsets Property C’ For all clustersC, C’, for allAµ C, A’ µ C, |A|+|A’|¸ sn, neither A nor A’ more attracted to the other one than to the rest of its own cluster. A’ A Sufficient for hierarchical clustering Find hierarchy using a multi-stage learning-based algorithm.

C Å C0 C C0 Clustering Stability of Large Subsets Property C C’ For allC, C’,allA ½C, A’µ C’, K(A,C-A) > K(A,A’) |A|+|A’| ¸ sn A’ A Algorithm • Generate list L of candidate clusters (average attraction alg.) Ensure that any ground-truth cluster is f-close to one in L. • For every (C, C0) in L s.t. all three parts are large: If K(C Å C0, C \ C0) ¸ K(C Å C0, C0 \ C), thenthrow out C0 Elsethrow out C. 3) Clean and hook up the surviving clusters into a tree.

Clustering Stability of Large Subsets C C’ For allC, C’, allA½C, A’µC’, |A|+|A’| ¸ sn K(A,C-A) > K(A,A’)+ A’ A If s=O(2/k2),f=O(2/k2),then produce a tree s.t. the ground-truth is -close to a pruning. Theorem

The Inductive Setting Inductive Setting Draw sample S, cluster S (in the list or tree model). Insert new points as they arrive. instance space X Sample S x x x x Many of our algorithms extend naturally to this setting. To get poly time for stab of all subsets, need to argue that sampling preserves stability. [AFKK]

Similarity Functions for Clustering, Summary Main Conceptual Contributions • Natural conditions on K to be useful for clustering. • For robusttheory, relax objective: hierarchy, list. • A general model that parallels PAC, SLT, Learning with Kernels and Similarity Functions in Supervised Classification. Technically Most Difficult Aspects • Algos for stability of large subsets; -strict separation. • Algos and analysis for the inductive setting.

A Discriminative Framework for Clustering via Similarity Functions