Clustering

Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. Clustering Basic concepts with simple examples Categories of clustering methods Challenges CSE572, CBS572: Data Mining by H. Liu

2. What is clustering? • The process of grouping a set of physical or abstract objects into classes of similar objects. • It is also called unsupervised learning. • It is a common and important task that finds many applications • Examples where we need clustering? CSE572, CBS572: Data Mining by H. Liu

3. Clusters and representations • Examples of clusters • Different ways of representing clusters • Division with boundaries • Venn diagrams or spheres • Probabilistic • Dendrograms • Trees • Rules 1 2 3 I1 I2 … In 0.5 0.2 0.3 CSE572, CBS572: Data Mining by H. Liu

4. Differences from Classification • How different? • Which one is more difficult as a learning problem? • Do we perform clustering in daily activities? • How do we cluster? • How to measure the results of clustering? • With/without class labels • Between classification and clustering • Semi-supervised clustering CSE572, CBS572: Data Mining by H. Liu

5. Major clustering methods • Partitioning methods • k-Means (and EM), k-Medoids • Hierarchical methods • agglomerative, divisive, BIRCH • Similarity and dissimilarity of points in the same cluster and from different clusters • Distance measures between clusters • minimum, maximum • Means of clusters • Average between clusters CSE572, CBS572: Data Mining by H. Liu

6. How to evaluate • Without labeled data, how can one know one clustering result is good? • Basic or intuitive idea of clustering for clustered data points • Within a cluster - • Between clusters – • The relationship between the two? • Evaluation methods • Labeled data – another assumption: instances in the same clusters are of the same class • Is it reasonable to use class labels in evaluation? • Unlabeled data – we will see below CSE572, CBS572: Data Mining by H. Liu

7. Clustering -- Example 1 • For simplicity, 1-dimension objects and k=2. • Objects: 1, 2, 5, 6,7 • K-means: • Randomly select 5 and 6 as centroids; • => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 • => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 • => no change. • Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 CSE572, CBS572: Data Mining by H. Liu

8. Issues with k-means • A heuristic method • Sensitive to outliers • How to prove it? • Determining k • Trial and error • X-means, PCA-based • Crisp clustering • EM, Fuzzy c-means • Should not be confused with k-NN CSE572, CBS572: Data Mining by H. Liu

9. k-Medoids • Medoid – the most centrally located point in a cluster, as a representative point of the cluster. • In contrast, a centroid is not necessarily inside a cluster. • An example Initial Medoids CSE572, CBS572: Data Mining by H. Liu

10. Partition Around Medoids • PAM: • Given k • Randomly pick k instances as initial medoids • Assign each instance to the nearest medoid x • Calculate the objective function • the sum of dissimilarities of all instances to their nearest medoids • Randomly select an instance y • Swap x by y if the swap reduces the objective function for all x • Repeat (3-6) until no change CSE572, CBS572: Data Mining by H. Liu

11. k-Means and k-Medoids • The key difference lies in how they update means or medoids • k-medoids and (N-k) instances pairwise comparison • Both require distance calculation and reassignment of instances • Time complexity • Which one is more costly? • Dealing with outliers Outlier (100 unit away) CSE572, CBS572: Data Mining by H. Liu

12. Agglomerative • Each object is viewed as a cluster (bottom up). • Repeat until the number of clusters is small enough • Choose a closest pair of clusters • Merge the two into one • Defining “closest”: Centroid (mean of cluster) distance, (average) sum of pairwise distance, … • Refer to the Evaluation part • A dendrogram is a tree that shows clustering process. CSE572, CBS572: Data Mining by H. Liu

13. Clustering -- Example 2 • For simplicity, we still use 1-dimension objects. • Objects: 1, 2, 5, 6,7 • agglomerative clustering – a very frequently used algorithm • How to cluster: • find two closest objects and merge; • => {1,2}, so we have now {1.5,5, 6,7}; • => {1,2}, {5,6}, so {1.5, 5.5,7}; • => {1,2}, {{5,6},7}. CSE572, CBS572: Data Mining by H. Liu

14. Issues with dendrograms • How to find proper clusters • An alternative: divisive algorithms • Top down • Comparing with bottom-up, which is more efficient • What’s the time complexity? • How to efficiently divide the data • A heuristic – Minimum Spanning Tree http://en.wikipedia.org/wiki/Minimum_spanning_tree • Time complexity – fastest is about O(e) where e - edges CSE572, CBS572: Data Mining by H. Liu

15. Distance measures • Single link • Measured by the shortest edge between the two clusters • Complete link • Measured by the longest edge • Average link • Measured by the average edge length • An example is shown next. CSE572, CBS572: Data Mining by H. Liu

16. An example to show different Links • Single link • Merge the nearest clusters measured by the shortest edge between the two • (((A B) (C D)) E) • Complete link • Merge the nearest clusters measured by the longest edge between the two • (((A B) E) (C D)) • Average link • Merge the nearest clusters measured by the average edge length between the two • (((A B) (C D)) E) B A E C D CSE572, CBS572: Data Mining by H. Liu

17. Other Methods • Density-based methods • DBSCAN: a cluster is a maximal set of density-connected points • Core points defined using epsilon-neighborhood and minPts • Apply directly density reachable (e.g., P and Q, Q and M) and density-reachable (P and M, assuming so are P and N), and density-connected (any density reachable points, P, Q, M, N) form clusters • Grid-based methods • STING: the lowest level is the original data • statistical parameters of higher-level cells are computed from the parameters of the lower-level cells (count, mean, standard deviation, min, max, distribution) • Model-based methods • Conceptual clustering: COBWEB • Category utility • Intraclass similarity • Interclass dissimilarity CSE572, CBS572: Data Mining by H. Liu

18. Density-based • DBSCAN – Density-Based Clustering of Applications with Noise • It grows regions with sufficiently high density into clusters and can discover clusters of arbitraryshape in spatial databases with noise. • Many existing clustering algorithms find spherical shapes of clusters • DEBSCAN defines a cluster as a maximal set of density-connected points. • Density is defined by an area and # of points CSE572, CBS572: Data Mining by H. Liu

19. Q M R S P O • Defining density and connection • -neighborhood of an object x (core object) (M, P, O) • MinPts of objects within -neighborhood (say, 3) • directly density-reachable (Q from M, M from P) • Only core objects are mutually density reachable • density-reachable (Q from P, P not from Q) [asymmetric] • density-connected (O, R, S) [symmetric] for border points • What is the relationship between DR and DC? CSE572, CBS572: Data Mining by H. Liu

20. Clustering with DBSCAN • Search for clusters by checking the -neighborhood of each instance x • If the -neighborhood of x contains more than MinPts, create a new cluster with x as a core object • Iteratively collect directly density-reachable objects from these core object and merge density-reachable clusters • Terminate when no new point can be added to any cluster • DBSCAN is sensitive to the thresholds of density, but it is fast • Time complexity O(N log N) if a spatial index is used, O(N2) otherwise CSE572, CBS572: Data Mining by H. Liu

21. Grid: STING (STatistical INformation Grid) • Statistical parameters of higher-level cells can easily be computed from those of lower-level cells • Attribute-independent: count • Attribute-dependent: mean, standard deviation, min, max • Type of distribution: normal, uniform, exponential, or unknown • Irrelevant cells can be removed CSE572, CBS572: Data Mining by H. Liu

22. BIRCH using Clustering Feature (CF) and CF tree • A cluster feature is a triplet about sub-clusters of instances (N, LS, SS) • N - the number of instances, LS – linear sum, SS – square sum • Two thresholds: branching factor and the max number of children per non-leaf node • Two phases • Build an initial in-memory CF tree • Apply a clustering algorithm to cluster the leaf nodes in CF tree • CURE (Clustering Using REpresentitives) is another example, allowing multiple centroids in a cluster CSE572, CBS572: Data Mining by H. Liu

23. Taking advantage of the property of density • If it’s dense in higher dimensional subspaces, it should be dense in some lower dimensional subspaces • How to use this property? • CLIQUE (CLustering In QUEst) • With high dimensional data, there are many void subspaces • Using the property identified, we can start with dense lower dimensional data • CLIQUE is a density-based method that can automatically find subspaces of the highest dimensionality such that high-density clusters exist in those subspaces CSE572, CBS572: Data Mining by H. Liu

24. Chameleon • A hierarchical Clustering Algorithm Using Dynamic Modeling • Observations on the weakness of CURE and ROCK • CURE: clustering using representatives • ROCK: clustering categorical attributes • Based on k-nn and dynamic modeling CSE572, CBS572: Data Mining by H. Liu

25. Graph-based clustering • Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. • The nearest neighbors of a point tend to belong to the same class as the point itself. • This reduces the impact of noise and outliers and sharpens the distinction between clusters. CSE572, CBS572: Data Mining by H. Liu

26. Neural networks • Self-organizing feature maps (SOMs) • Subspace clustering • Clique: if a k-dimensional unit space is dense, then so are its (k-1)-d subspaces • More will be discussed later • Semi-supervised clustering http://www.cs.utexas.edu/~ml/publication/unsupervised.html http://www.cs.utexas.edu/users/ml/risc/ CSE572, CBS572: Data Mining by H. Liu

27. Challenges • Scalability • Dealing with different types of attributes • Clusters with arbitrary shapes • Automatically determining input parameters • Dealing with noise (outliers) • Order insensitivity of instances presented to learning • High dimensionality • Interpretability and usability CSE572, CBS572: Data Mining by H. Liu