Chapter 4 Clustering

Chapter 4Clustering

What is Clustering? • The process of organizing objects into groups whose members are similar in some way • Statistics, machine learning, and database researchers have studied data clustering • Recent emphasis on large datasets

Approaches to Clustering • Two main approaches to clustering: • PartitionalClustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

Problem Statement • N objects to be grouped in kclusters • Different possibilities • If we have 5 objects, to be classified into 2 clusters, what are the number of possibilities? 25 / 2!= 32/2=16 • The objective is to find a grouping such that the distances between objects in a group is minimum

Types • Statistical methods • K-means algorithm • Probabilistic clustering • The agglomerative algorithm • Neural network based approaches • Kohonen’s self organizing maps (SOM) • Evolutionary computing (GA) • Text Clustering

K-means Algorithm • Randomly select k points to be the starting points for the centroids of the k clusters. • Assign each object to the centroid closest to the object, forming k exclusive clusters of examples. • Calculate new centroids of the clusters. Take the average of all the attribute values of the objects belonging to the same cluster. • Check if the cluster centroids have changed their coordinates. If yes, repeat from Step 2. • If no, cluster detection is finished, and all objects have their cluster memberships defined.

K-Means Flowchart

Numerical Example • One-dimensional database with N = 9 • Objects labeled z1…z9 • Let k = 2 • Let us start with z1 to z2 as the initial centroids: z1=2 z2=4 • Compute distance to centroids.

Example - Clustering

Example- Re-compute the Means

Example • Reassign each object to the two clusters based on the new calculations: Centroid-1= 2.5 Centriod-2= 16

Clustering- iteration-2

Example- Re-compute the Means

Clustering- iteration 3 • Reassign each object to the two clusters based on the new calculations: Centroid-1= 3 Centriod-2= 18

Example • No Change in clusters, so the algorithm stops, • The means have converged to their optimal values.

Chapter 4 Clustering