Understanding K-Means Clustering: Concepts, Exercises, and Limitations

Get into pairs, please! • Person A explain to Person B what datasets are and what clustering is about • Person B explain to Person A how the k-means algorithm works

CS26110AI Toolbox Clustering 2

Clustering lectures overview • Datasets, data points, dimensionality, distance • What is clustering? • Partitional clustering • k-means algorithm • Extensions (fuzzy) • Hierarchical clustering • Agglomerative/Divisive • Single-link, complete-link, average-link

Today • Investigate the parameters of k-means clustering • Think about the limitations of k-means • Think about how to decide if a clustering is good or not

Exercise • Given the following 1D data: {6, 8, 18, 28, 12, 32, 24} choose your own initial centroids and perform k-means • (stay within the range 6 to 32) Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

Final clustering Initial centroids 18 20 6 8 12 18 24 28 32

Previous clustering Initial centroids 11 20 6 8 13 18 24 26 32

Initial seed choice • Results can vary based on random seed selection • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings • Select good seeds using a heuristic • Try out multiple starting points • Initialize with the results of another method In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}

Exercise • Given the following 1D data: {6, 8, 18, 28, 7, 32, 22}choose your own centroids and perform k-means • Choose a value of k: 2, 3, or 4 Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

What this looks like... 6 7 8 18 22 28 32

How many clusters? • Number of clusters k is required at the start • Finding the “right” number of clusters is part of the problem • Given data, partition into an “appropriate” number of subsets • Trade-off between having more clusters (better focus within each cluster) and having too many clusters

Time complexity • Computing distance between ndata points and centroid is O(nm) • Where m is the dimensionality of the data points • Reassigning clusters • For each k, do the above = O(knm) in total • Computing centroids • Each point gets assigned to one centroid: O(nm) • Assume these steps are each performed once for I iterations: O(Iknm)

Limitations • Must choose parameter k in advance, or try many values • This is a particular problem for k-means as often the optimal number of clusters is not known • Data must be numerical and must be compared via a suitable distance measure

Limitations • The algorithm works best on data which contains spherical clusters; clusters with other geometry may not be found • The algorithm is sensitive to outliers/points which do not belong in any cluster • These can distort the centroid positions and ruin the clustering

Cluster validity 6 7 8 18 22 28 32 6 7 8 18 22 28 32

Cluster validity: what we want! • High inter-cluster distances • Large distance between clusters • Otherwise known as good separability • Low intra-cluster distances • Distances between data points within a cluster should be relatively low • Otherwise known as good compactness • Many cluster validity measures have been developed

To think about... • Can GAs be used for partitional clustering? • What does a ‘solution’ to the clustering problem look like? • How would you encode this? • What fitness function would you use?

What to take away • Be able to apply k-means clustering • Understand the issues involved in k-means clustering • Parameters, limitations • Analyse simple clusters for validity • Inter-cluster distance vs intra-cluster distance

Understanding K-Means Clustering: Concepts, Exercises, and Limitations

Understanding K-Means Clustering: Concepts, Exercises, and Limitations

Presentation Transcript

PHONOLOGY: THE SOUND PATTERNS OF LANGUAGE See also “Phonetics,” “Spelling,” and Writing Systems”

Neisseria and Moraxella catarrhalis

Streptococcal Serology

The Coordinate Plane

Tamhane's post-hoc test

Complex Ion Formation

Clostridium

The Question Matrix Q Matrix

CH2. Molecules and covalent bonding Lewis Structures VSEPR MO Theory

All-Pairs shortest paths via fast matrix multiplication

6.MOLECULAR BASIS OF INHERITANCE

Unit 1

Warm Up 1. Evaluate x 2 + 5 x for x = 4 and x = –3.

Bell Ringer

What Do Molecules Look Like?

E A R S

Grammars

RNA/Protein Structures

MOLECULAR SHAPES

3.1 Relations 3.2 Graphs

STREPTOCOCCUS

Single Slider Crank Chain