Towards Theoretical Foundations of Clustering Margareta Ackerman University of Waterloo

Towards Theoretical Foundations of Clustering Margareta Ackerman University of Waterloo

The Theory-Practice Gap Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering to gain a first understanding of the structure of large data sets.

The Theory-Practice Gap “While the interest in and application of cluster analysis has been rising rapidly, the abstract nature of the tool is still poorly understood” (Wright, 1973) “There has been relatively little work aimed at reasoning about clustering independently of any particular algorithm, objective function, or generative data model” (Kleinberg, 2002) Both statements still apply today.

Inherent Obstacles: Clustering is ill-defined Clustering aims to assign data into groups of similar items Beyond that, there is very little consensus on the definition of clustering

Inherent Obstacles • Clustering is inherently ambiguous • There may be multiple reasonable clusterings • There is usually no ground truth • There are many clustering algorithms with different (often implicit) objective functions

Outline • Previous work • Clustering algorithm selection • Characterization of Linkage-Based clustering • Sketch of proof • Hierarchical algorithms that are not linkage-based • Conclusions and future work

Previous Work Towards a General Theory: Axiomatizing clustering • Clustering in the weighted setting (Wright, ‘73) • Axioms of clustering distance functions (Meila, ACM ‘05) • Impossibility result (Kleinberg, NIPS ‘02) • Rebuttal to impossibility result (Ackerman & Ben-David, NIPS ‘08)

Previous Work Towards a General Theory: Clusterability • Conditions for efficiently uncovering the target clustering [(Balcan, Blum, and Vempala, STOC ‘08),(Balcan, Blum and Gupta, SODA ‘09)] • Theoretical study of clusterability(Ackerman & Ben-David, AISTATS ‘09)]. • Notions of clusterability are pairwise distinct • Data sets that are more clusterable are computationally easier to cluster well.

Outline • Previous work • Clustering algorithm selection • Characterization of Linkage-Based clustering • Sketch of proof • Heirarchical algorithms that are not linkage-based • Conclusions and future work

Clustering Algorithm Selection There are a wide variety of clustering algorithms, which often produce very different clusterings. How should a user decide which algorithm to use for a given application?

Clustering Algorithm Selection Users rely on cost related considerations: running times, space usage, software purchasing costs, etc… There is inadequate emphasis on input-output behaviour

Radical Differences in Input/Output Behavior of Clustering Algorithms

Our Framework for Clustering Algorithm Selection We propose a framework that lets a user utilize prior knowledge to select an algorithm • Identify properties that distinguish between different input-output behaviour of clustering paradigms • The properties should be: 1) Intuitive and “user-friendly” 2) Useful for distinguishing clustering algorithms

Our Framework for Clustering Algorithm Selection • The long-term goal is to construct a large property-based classification for many useful clustering algorithms • This would facilitates the application of prior knowledge. • Enables users to identify a suitable algorithm without the overhead of executing many algorithms • This framework helps understand behaviour of existing and new algorithms

Taxonomy of Partitional Algorithms (Ackerman, Ben-David, Loker, NIPS 2010)

Axioms VS Properties Properties Axioms

Characterization of Linkage-Based Clustering (Ackerman, Ben-David, Loker, COLT 2010)

Characterization of Linkage-Based Clustering (Ackerman, Ben-David, Loker, COLT 2010) The 2010 characterization applies in the partitional setting, by using the k-stopping criteria. This characterization distinguished linkage-based algorithms from other partitional algorithms.

Characterizing Linkage-Based Clustering in the Heirarchical Setting (Ackerman and Ben-David, IJCAI 2011) • Propose two intuitive properties that uniquely indentify hierarchical linkage-based clustering algorithms. • Show that common hierarchical algorithms, including bisecting k-means, cannot be simulated by any linkage-based algorithm

Formal Setup: Dendrograms and clusterings C_i is a clusterin a dendrogramD if there exists a node in the dendrogram so that C_iisthe set of its leaf descendents.

Formal Setup: Dendrograms and clusterings C = {C1, … , Ck} is a clusteringin a dendrogramD if • Ciis a cluster in D for all 1≤ i ≤ k, and • Clusters are disjoint

Formal Setup: Heirarchical clustering algorithm AHierarchical Clustering Algorithm A maps Input: A data set Xwith a dissimilarity function d, denoted (X,d) to Output:A dendrogram of X

Linkage-Based Algorithm • Create a leaf node for every elements of X Insert image

Linkage-Based Algorithm • Create a leaf node for every elements of X • Repeat the following until a single tree remains: • Consider clusters represented by the remaining root nodes.

Linkage-Based Algorithm ? • Create a leaf node for every elements of X • Repeat the following until a single tree remains: • Consider clusters represented by the remaining root nodes. Merge the closest pair of clusters by assigning them a common parent node.

Examples of Linkage-Based Algorithms • The choice of linkage function distinguishes between different linkage-based algorithms. • Examples of common linkage-functions • Single-linkage: shortest between-cluster distance • Average-linkage: average between-cluster distance • Complete-linkage: maximum between-cluster distance X1 X2

Locality Informal Definition D = A(X,d) D’ = A(X’,d) X’={x1, …, x6} If we select a set of disjoint clusters from a dendrogram, and run the algorithm on the union of these clusters, we obtain a result that is consistent with the original dendrogram.

Outer Consistency A(X,d) C C on dataset (X,d’) C on dataset (X,d) Outer-consistent change If A is outer-consistent, then A(X,d’) will also include the clustering C.

Characterization of Linkage-Based Clustering Theorem(Ackerman & Ben-David, IJCAI 2011): A hierarchical clustering algorithm is Linkage-Based if and only if it is Local and Outer-Consistent.

Outline • Previous work • Clustering algorithm selection • Characterization of Linkage-Based clustering • Sketch of proof • Heirarchical algorithms that are not linkage-based • Conclusions and future work

Easy Direction of Proof Every Linkage-Based hierarchical clustering algorithm is Local and Outer-Consistent. The proof is quite straightforward.

Interesting Direction of Proof If A is Local and Outer-Consistent, then A is Linkage-Based. To prove this direction we first need to formalize Linkage-Based clustering, by formally defining what is a Linkage Function.

What Do We Expect From Linkage Functions? A Linkage Functionis a function l:{(X1, X2 ,d): d is a distance function over X1uX2 }→ R+ that satisfies the following: • Representation independence: Doesn’t change if we re-label data • Monotonicity: if we increase edges that go between X1 and X2, • then l(X1, X2 ,d) doesn’t decrease. X1 X2 (X1uX2,d)

Sketch of proof Recall direction: If A satisfies Outer-Consistency and Locality, then A is Linkage-Based. Goal: Define a linkage function l so that the linkage-based clustering based on loutputs A(X,d) (for every Xand d).

Sketch of proof • Define an operator <A: (X,Y,d1) <A(Z,W,d2)if when we run A on (XuYuZuW,d), where d extends d1and d2, X and Y are merged before Z and W. A(X,d) • Prove that <Acan be extended to a partial ordering • Use the ordering to define l Z W X Y

Sketch of proof continue:Show that <Ais a partial ordering We show that<Ais cycle-free. Lemma: Given a hierarchical algorithm A that is Local and Outer-Consistent, there exists no finite sequence so that (X1,Y1,d1) <A …. <A(Xn,Yn,dn) <A (X1,Y1,d1).

Sketch of proof (continued…) • By the above Lemma, the transitive closure of <Ais a partial ordering. • This implies that there exists an order preserving function lthat maps pairs of data sets to R+. • It can be shown that lsatisfies the properties of a Linkage Function.

Hierarchical but Not Linkage-Based P -Divisive algorithms construct dendrogramstop-down using a partitional 2-clustering algorithm P to split nodes. Apply partitional clustering P Ex. k-means for k=2

Hierarchical but Not Linkage-Based Theorem[Ackerman & Ben-David, IJCAI ’11]: If P is context-sensitive, then the P –divisive algorithm fails the locality property. • A partitional 2-clustering algorithm Pis • Context Sensitive if there exist d⊂d’ so that • P({x,y,z},d) = {x, {y,z}} and P({x,y,z,w} ,d’)= {{x,y}, {z,w}}. Ex. K-means, min-sum, min-diameter.

Hierarchical but Not Linkage-Based • The input-output behaviour of some natural divisive algorithms is distinct from that of all linkage-based algorithms. • The bisecting k-means algorithm, and other natural divisive algorithms, cannot be simulated by any linkage-based algorithm.

Conclusions • We present a new framework for clustering algorithm selection • Provide a property-based classification of common clustering algorithms • Characterize linkage-based clustering in terms of two natural properties • Show that no linkage-based algorithm can simulate some natural divisive algorithms

What’s Next? • Our approach to selecting clustering algorithms can be applied to any clustering application (ex. phylogeny). • Classify applications in terms of their clustering needs • Target research on common clustering needs or specific applications • Identify when results are relevant to specific applications • Bridging the gap in other clustering settings (ex. clustering with a “noise cluster”) • Axioms of clustering algorithms

Towards Theoretical Foundations of Clustering Margareta Ackerman University of Waterloo