Download Presentation
## Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Clustering**• “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99] • Instances within a cluster are very similar • Instances in different clusters are very different Text Clustering**.**. . t e r m 2 . . . . . . . . . . . . . . . . . . . term1 Example Text Clustering**Applications**• Faster retrieval • Faster and better browsing • Structuring of search results • Revealing classes and other data regularities • Directory construction • Better data organization in general Text Clustering**Cluster Searching**• Similar instances tend to be relevant to the same requests • The query is mapped to the closest cluster by comparison with the cluster-centroids Text Clustering**Notation**• N: number of elements • Class: real world grouping – ground truth • Cluster: grouping by algorithm • The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members Text Clustering**Problems**• How many clusters ? • Complexity? N is usually large • Quality of clustering • When a method is better than another? • Overlapping clusters • Sensitivity to outliers Text Clustering**.**. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Text Clustering**Clustering Approaches**• Divisive: build clusters “top down” starting from the entire data set • K-means, Bisecting K-means • Hierarchical or flat clustering • Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level • Hierarchical clustering • Combinations of the above • Buckshot algorithm Text Clustering**Hierarchical – Flat Clustering**• Flat: all clusters at the same level • K-means, Buckshot • Hierarchical: nested sequence of clusters • Single cluster with all data at the top & singleton clusters at the bottom • Intermediate levels are more useful • Every intermediate level combines two clusters from the next lower level • Agglomerative, Bisecting K-means Text Clustering**.**. . . . . . . . . . . . . . . . . . . . . Flat Clustering Text Clustering**.**. . . . 1 1 . . . . . . 4 . 6 . 2 2 3 . 3 . . . . . . . . 5 . . . . . 7 4 5 6 7 . . Hierarchical Clustering Text Clustering**Text Clustering**• Finds overall similarities among documents or groups of documents • Faster searching, browsing etc. • Needs to know how to compute the similarity (or equivalently the distance) between documents Text Clustering**d1**d2 θ Query – Document Similarity • Similarity is defined as the cosine of the angle between document and query vectors Text Clustering**Document Distance**• Consider documents d1, d2 with vectors u1, u2 • Theirdistance is defined as the length AB Text Clustering**Normalization by Document Length**• The longer the document is, the more likely it is for a given term to appear in it • Normalize the term weights by document length (so terms in long documents are not given more weight) Text Clustering**Evaluation of Cluster Quality**• Clusters can be evaluated using internal or external knowledge • Internal Measures: intra cluster cohesion and cluster separability • intra cluster similarity • inter cluster similarity • External measures: quality of clusters compared to real classes • Entropy (E), Harmonic Mean (F) Text Clustering**Intra Cluster Similarity**• A measure of cluster cohesion • Defined as the average pair-wise similarity of documents in a cluster • Where : cluster centroid • Documents (not centroids) have unit length Text Clustering**Inter Cluster Similarity**• Single Link: similarity of two most similar members • Complete Link: similarity of two least similar members • Group Average: average similarity between members Text Clustering**complete link**. . S’ S group average . . c’ c single link Example Text Clustering**Entropy**• Measures the quality of flat clusters using external knowledge • Pre-existing classification • Assessment by experts • Pij: probability that a member of cluster j belong to class i • The entropy of cluster j is defined as Ej=-ΣiPijlogPij Text Clustering**Entropy (con’t)**• Total entropy for all clusters • Where nj is the size of cluster j • m is the number of clusters • N is the number of instances • The smaller the value of E is the better the quality of the algorithm is • The best entropy is obtained when each cluster contains exactly one instance Text Clustering**Harmonic Mean (F)**• Treats each cluster as a query result • F combines precision (P) and recall (R) • Fijfor cluster j and class i is defined as nij: number of instances of class i in cluster j, ni: number of instances of class i, nj: number of instances of cluster j Text Clustering**Harmonic Mean (con’t)**• The F value of any class i is the maximum value it achieves over all j Fi = maxj Fij • The F value of a clustering solution is computed as the weighted average over all classes • Where N is the number of data instances Text Clustering**Quality of Clustering**• A good clustering method • Maximizes intra-cluster similarity • Minimizes inter cluster similarity • Minimizes Entropy • Maximizes Harmonic Mean • Difficult to achieve all together simultaneously • Maximize some objective function of the above • An algorithm is better than an other if it has better values on most of these measures Text Clustering**K-means Algorithm**• Select K centroids • Repeat I times or until the centroids do not change • Assign each instance to the cluster represented by its nearest centroid • Compute new centroids • Reassign instances • Compute new centroids • ……. Text Clustering**K-Means demo (1/7):**http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html Nikos Hourdakis, MSc Thesis**K-Means demo (2/7)**Nikos Hourdakis, MSc Thesis**K-Means demo (3/7)**Nikos Hourdakis, MSc Thesis**K-Means demo (4/7)**Nikos Hourdakis, MSc Thesis**K-Means demo (5/7)**Nikos Hourdakis, MSc Thesis**K-Means demo (6/7)**Nikos Hourdakis, MSc Thesis**K-Means demo (7/7)**Nikos Hourdakis, MSc Thesis**Comments on K-Means (1)**• Generates a flat partition of K clusters • K is the desired number of clusters and must be known in advance • Starts with K random cluster centroids • A centroid is the mean or the median of a group of instances • The mean rarely corresponds to a real instance Text Clustering**Comments on K-Means (2)**• Up to I=10 iterations • Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations • Complexity O(IKN) • A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering Text Clustering**Choosing Centroids for K-means**• Quality of clustering depends on the selection of initial centroids • Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good initial centroids using a heuristic or the results of another method • Buckshot algorithm Text Clustering**Incremental K-Means**• Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration • Reassign instances to clusters at the end of each iteration • Converges faster than simple K-means • Usually 2-5 iterations Text Clustering**Bisecting K-Means**• Starts with a single cluster with all instances • Select a cluster to split: larger cluster or cluster with less intra similarity • The selected cluster is split into 2 partitions using K-means (K=2) • Repeat up to the desired depth h • Hierarchical clustering • Complexity O(2hN) Text Clustering**Agglomerative Clustering**• Compute the similarity matrix between all pairs of instances • Starting from singleton clusters • Repeat until a single cluster remains • Merge the two most similar clusters • Replace them with a single cluster • Replace the merged cluster in the matrix and update the similarity matrix • Complexity O(N2) Text Clustering**Similarity Matrix**Text Clustering**Update Similarity Matrix**merged merged Text Clustering**New Similarity Matrix**Text Clustering**Single Link**• Selecting the most similar clusters for merging using single link • Can result in long and thin clusters due to “chaining effect” • Appropriate in some domains, such as clustering islands Text Clustering**Complete Link**• Selecting the most similar clusters for merging using complete link • Results in compact, spherical clusters that are preferable Text Clustering**Group Average**• Selecting the most similar clusters for merging using group average • Fast compromise between single and complete link Text Clustering**complete link**. . B A group average . . c2 c1 single link Example Text Clustering**Inter Cluster Similarity**• A new cluster is represented by its centroid • The document to cluster similarity is computed as • The cluster-to-cluster similarity can be computed as single, complete or group average similarity Text Clustering**Buckshot K-Means**• Combines Agglomerative and K-Means • Agglomerative results in a good clustering solution but has O(N2) complexity • Randomly select a sample Ninstances • Applying Agglomerative on the sample which takes (N) time • Take the centroids of the cluster as input to K-Means • Overall complexity is O(N) Text Clustering**1**2 3 4 5 6 7 11 12 13 14 15 8 9 10 Example initial cetroids for K-Means Text Clustering**More on Clustering**• Sound methods based on the document-to-document similarity matrix • graph theoretic methods • O(N2) time • Iterative methods operating directly on the document vectors • O(NlogN),O(N2/logN), O(mN) time Text Clustering**Soft Clustering**• Hard clustering: each instance belongs to exactly one cluster • Does not allow for uncertainty • An instance may belong to two or more clusters • Soft clustering is based on probabilities that an instance belongs to each of a set of clusters • probabilities of all categories must sum to 1 • Expectation Minimization (EM) is the most popular approach Text Clustering