1 / 19

Get into pairs, please!

Get into pairs, please! Person A explain to Person B what datasets are and what clustering is about Person B explain to Person A how the k -means algorithm works. CS26110 AI Toolbox. Clustering 2. Clustering lectures overview. Datasets, data points, dimensionality, distance

sven
Télécharger la présentation

Get into pairs, please!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Get into pairs, please! • Person A explain to Person B what datasets are and what clustering is about • Person B explain to Person A how the k-means algorithm works

  2. CS26110AI Toolbox Clustering 2

  3. Clustering lectures overview • Datasets, data points, dimensionality, distance • What is clustering? • Partitional clustering • k-means algorithm • Extensions (fuzzy) • Hierarchical clustering • Agglomerative/Divisive • Single-link, complete-link, average-link

  4. Today • Investigate the parameters of k-means clustering • Think about the limitations of k-means • Think about how to decide if a clustering is good or not

  5. Exercise • Given the following 1D data: {6, 8, 18, 28, 12, 32, 24} choose your own initial centroids and perform k-means • (stay within the range 6 to 32) Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

  6. Final clustering Initial centroids 18 20 6 8 12 18 24 28 32

  7. Previous clustering Initial centroids 11 20 6 8 13 18 24 26 32

  8. Initial seed choice • Results can vary based on random seed selection • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings • Select good seeds using a heuristic • Try out multiple starting points • Initialize with the results of another method In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}

  9. Exercise • Given the following 1D data: {6, 8, 18, 28, 7, 32, 22}choose your own centroids and perform k-means • Choose a value of k: 2, 3, or 4 Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

  10. What this looks like... 6 7 8 18 22 28 32

  11. How many clusters? • Number of clusters k is required at the start • Finding the “right” number of clusters is part of the problem • Given data, partition into an “appropriate” number of subsets • Trade-off between having more clusters (better focus within each cluster) and having too many clusters

  12. Time complexity • Computing distance between ndata points and centroid is O(nm) • Where m is the dimensionality of the data points • Reassigning clusters • For each k, do the above = O(knm) in total • Computing centroids • Each point gets assigned to one centroid: O(nm) • Assume these steps are each performed once for I iterations: O(Iknm)

  13. Limitations • Must choose parameter k in advance, or try many values • This is a particular problem for k-means as often the optimal number of clusters is not known • Data must be numerical and must be compared via a suitable distance measure

  14. Limitations • The algorithm works best on data which contains spherical clusters; clusters with other geometry may not be found • The algorithm is sensitive to outliers/points which do not belong in any cluster • These can distort the centroid positions and ruin the clustering

  15. Cluster validity 6 7 8 18 22 28 32 6 7 8 18 22 28 32

  16. Cluster validity 6 7 8 18 22 28 32 6 7 8 18 22 28 32

  17. Cluster validity: what we want! • High inter-cluster distances • Large distance between clusters • Otherwise known as good separability • Low intra-cluster distances • Distances between data points within a cluster should be relatively low • Otherwise known as good compactness • Many cluster validity measures have been developed

  18. To think about... • Can GAs be used for partitional clustering? • What does a ‘solution’ to the clustering problem look like? • How would you encode this? • What fitness function would you use?

  19. What to take away • Be able to apply k-means clustering • Understand the issues involved in k-means clustering • Parameters, limitations • Analyse simple clusters for validity • Inter-cluster distance vs intra-cluster distance

More Related