260 likes | 641 Vues
2. Outline. IntroductionK-means AlgorithmExampleK-means DemoRelevant IssuesConclusion. 3. Introduction. Partitioning Clustering Approacha typical clustering analysis approach via partitioning data set iterativelyconstruct a partition of a data set to produce several non-empty clusters (u
E N D
1. K-means Clustering Ke Chen
2. 2 Outline Introduction
K-means Algorithm
Example
K-means Demo
Relevant Issues
Conclusion
3. 3 Introduction Partitioning Clustering Approach
a typical clustering analysis approach via partitioning data set iteratively
construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance)
in principle, partitions achieved via minimising the sum of squared distance in each cluster
Given a K, find a partition of K clusters to optimise the chosen partitioning criterion
global optimal: exhaustively enumerate all partitions
heuristic method: K-means algorithm
K-means algorithm (MacQueen’67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centres of clusters.
4. 4 K-mean Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps:
5. 5 Problem
6. 6 Example Step 1: Use initial seed points for partitioning
7. 7 Example Step 2: Compute new centroids of the current partition
8. 8 Example Step 2: Renew membership based on new centroids
9. 9 Example Step 3: Repeat the first two steps until its convergence
10. 10 Example Step 3: Repeat the first two steps until its convergence
11. 11 K-means Demo
12. 12 K-means Demo
13. 13 K-means Demo
14. 14 K-means Demo
15. 15 K-means Demo
16. 16 K-means Demo
17. 17 K-means Demo
18. 18 Relevant Issues Efficient in computation
O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n.
Local optimum
sensitive to initial seed points
converge to a local optimum that may be unwanted solution
Other problems
Need to specify K, the number of clusters, in advance
Unable to handle noisy data and outliers (K-Medoids algorithm)
Not suitable for discovering clusters with non-convex shapes
Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)
19. 19 Relevant Issues Cluster Validity
With different initial conditions, the K-means algorithm may result in different partitions for a given data set.
Which partition is the “best” one for the given data set?
In theory, no answer to this question as there is no ground-truth available in unsupervised learning
Nevertheless, there are several cluster validity criteria to assess the quality of clustering analysis from different perspectives
A common cluster validity criterion is the ratio of the total between-cluster to the total within-cluster distances
Between-cluster distance (BCD): the distance between means of two clusters
Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster
A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!
20. 20 Conclusion K-means algorithm is a simple yet popular method for clustering analysis
Its performance is determined by initialisation and appropriate distance measure
There are several variants of K-means to overcome its weaknesses
K-Medoids: resistance to noise and/or outliers
K-Modes: extension to categorical data clustering analysis
CLARA: dealing with large data sets
Mixture models (EM algorithm): handling uncertainty of clusters