- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

2. Cluster Analysis: Basic Concepts and Algorithm Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar

3. Inter-cluster distances are maximized Intra-cluster distances are minimized What Is ClusterAnalysis? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

4. Applications of ClusterAnalysis • Description • Group related stocks with similar price fluctuations • Summarization • Reduce the size of large data sets Clustering precipitation in Australia

5. What Is Not ClusterAnalysis? • Supervised classification • Have class label information • Simple segmentation • Dividing students into different registration groups alphabetically, by last name • Results of a query • Groupings are a result of an external specification • Graph partitioning • Some mutual relevance and synergy, but areas are not identical

6. How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster Can Be Ambiguous

7. Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree Obs.: Hierarchical clustering is out of the scope of the discipline

8. A Partitional Clustering Partitional Clustering Original Points

9. Types of Clusters • Well-separated clusters • Center-based clusters • Contiguous clusters • Density-based clusters • Property or Conceptual • Described by an Objective Function

10. Types of Clusters: Well-Separated • Well-Separated Clusters • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster 3 well-separated clusters

11. Types of Clusters: Center-Based • Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters

12. Types of Clusters: Contiguity-Based • Contiguous Cluster (Nearest neighbor or Transitive) • A cluster is a set of points such that a point in a cluster is transitively closer (or more similar) to one or more other points in the cluster than to any point not in the cluster 8 contiguous clusters

13. Types of Clusters: Density-Based • Density-based • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density • Used when the clusters are irregular or intertwined, and when noise and outliers are present 6 density-based clusters

14. Types of Clusters: Conceptual Clusters • Shared Property or Conceptual Clusters • Finds clusters that share some common property or represent a particular concept . 2 Overlapping Circles

15. Points as Representationof Instances • How to represent an instance as a geometric point? • The points P1 and P2 are to each other closer than of P3 Name Dept Course Marcus DSC Computer Science (P1) Cláudio DSC Computer Science (P2) Péricles DEE Electric Engineering (P3)

16. WEKA • Points by instance and by attribute • The most frequent attribute values in the cluster

17. Types of Clusters: Objective Function • Clusters Defined by an Objective Function • Finds clusters that minimize or maximize an objective function • Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function (NP Hard) • Can have global or local objectives • Hierarchical clustering algorithms typically have local objectives • Partitional algorithms typically have global objectives

18. Map the clustering problem to a different domain and solve a related problem in that domain • Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points • Clustering is equivalent to breaking the graph into connected components, one for each cluster. • Want to minimize the edge weight between clusters and maximize the edge weight within clusters

19. Characteristics of the Input Data Are Important • Type of proximity or density measure • This is a derived measure, but central to clustering • Sparseness • Dictates type of similarity • Adds to efficiency • Attribute type • Dictates type of similarity • Type of Data • Dictates type of similarity • Other characteristics, e.g., autocorrelation • Dimensionality • Noise and Outliers • Type of Distribution

20. Measuring the ClusteringPerformance • Notice that Clustering is a descriptive model, instead of a predictive model (see Introduction) • Performance metrics are different from the ones of supervised classification • We will see the metric SSE later • Other metrics • Precision • Recall • Entropy • Purity

21. Clustering Algorithms • K-means and one variant

22. K-means Clustering • Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid  centroid-based objective function • Number of clusters, K, must be specified • The basic algorithm is very simple

23. K-means Clustering – Details • Proximity function • Squared Euclidean L22 • Type of centroid: mean • Example of a centroid or mean • A cluster containing the three points (1,1), (2,3) e (6,2) • Centroid = ((1+2+6)/3, (1+3+2)/3) = (3,2) • A problem of minimization

24. K-means always converges to a solution • Reaches a state in which no points are shifting from one cluster to another, and hence, the centroids don´t change • Initial centroids are often chosen randomly • Clusters produced vary from one run to another • Complexity is O( n * K * I * d ) • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

25. The actions of K-means in Steps 3 and 4 are only guaranteed to find a local minimum with respect to the sum of the squared error (SSE) • Based on optimizing the SSE for specific choices of the centroids and clusters, rather than for all possible choices

26. Optimal Clustering Sub-optimal Clustering Two different K-meansClusterings Original Points

27. Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them. • x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that micorresponds to the center (mean) of the cluster

28. Step 3: forms clusters by assigning points to their nearest centroid, which minimizes the SSE for the given set of centroids • Step 4: recomputes the centroids so as to further minimize the SSE • Given two different set of clusters, we can choose the one with the smallest error

29. Step 1: Choosing InitialCentroids • When random initialization of centroids is used, different runs of K-means typically produce different total SSEs • The resulting clusters are often poor • In next slides, we provide another example of initial centroids, using the same data in the former example • Now, the solution is suboptimal, or the minimum SSE clustering is not found • Or the solution is only local optimal

30. Step 1: Choosing InitialCentroids

31. How to overcome the problem of well choosing the initial centroids? Three approaches • Multiple runs • A variant of K-means that is less susceptible to initialization problems • Bisecting K-means • Using postprocessing to “fixup” the set of clusters produced

32. Multiple Runs of k-means • Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with the smallest SSE error • The centroids of this clustering are a better representation of the points in their cluster • The technique • Perform multiple runs • Each with a different set of randomly chosen initial centroids • Select the clusters with the minimum SSE • May not work very well

33. Reducing the SSE withPostprocessing • An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K • The local SSEs become smaller, and then the global SSE also becomes smaller • However, in many cases, we would like to improve the SSE, but don´t want to increase the number of clusters

34. One strategy that decreases the total SSE by increasing the number of clusters • Split a cluster: the cluster with the largest SSE is usually chosen • One strategy that decreases the number of clusters, while trying to minimize the increase in total SSE • Merge two clusters: the clusters with the closest centroids are typically chosen

35. Bisecting K-means • Bisecting K-means algorithm • Variant of K-means that can produce a partitional or a hierarchical clustering Algorithm Bisecting K-means algorithm 1: Initialize the list of clusters to contain the cluster consisting of all points 2: repeat 3: Remove a cluster from the list of clusters 4: {Perform several “trial” bisections of the chosen cluster.} 5: for i = 1 to number of trialsdo 6: Bisect the selected cluster using basic 2-means 7: end for 8: Select the two clusters from the bisection with the lowest total SSE 9: Add these two clusters to the list of clusters 10:until Until the list of clusters contains K clusters

36. There are a number of different ways to choose which cluster to split at each step • The largest cluster • The cluster with the largest SSE • A criterion based on both size and SSE • One can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm

37. Bisecting K-means Example

38. Limitations of K-means • K-means has problems when clusters are of differing • Sizes • Densities • Non-globular shapes • K-means has problems when the data contains outliers

39. Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

40. Limitations of K-means: Differing Density K-means (3 Clusters) Original Points

41. Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

42. Overcoming K-means Limitations Original Points K-means Clusters • One solution is to use many clusters. • Find parts of clusters, but need to put together.

43. Original Points K-means Clusters

44. Original Points K-means Clusters

45. Rodando o WEKASimpleKMeans • === Run information === • Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10 • Relation: weather.symbolic • Instances: 14 • Attributes: 5 • outlook • temperature • humidity • windy • play • Test mode: evaluate on training data • === Model and evaluation on training set ===