Chapter 5 Data Mining: Clustering

Chapter 5Data Mining: Clustering L.MalakBagais [textbook]: Chapter 5

What is Clustering? The process of organizing objects into groups whose members are similar in some way Statistics, machine learning, and database researchers have studied data clustering Recent emphasis on large datasets

Approaches to Clustering • Two main approaches to clustering: • PartitionalClustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

Problem Statement • N objects to be grouped in kclusters • Different possibilities • If we have 5 objects, to be classified into 2 clusters, what are the number of possibilities? 25 / 2!= 32/2=16 • The objective is to find a grouping such that the distances between objects in a group is minimum

Types • Statistical methods • K-means algorithm • Probabilistic clustering • The agglomerative algorithm • Neural network based approaches • Kohonen’s self organizing maps (SOM) • Evolutionary computing (GA) • Text Clustering

K-means Algorithm • Randomly select k points to be the starting points for the centroids of the k clusters. • Assign each object to the centroid closest to the object, forming k exclusive clusters of examples. • Calculate new centroids of the clusters. Take the average of all the attribute values of the objects belonging to the same cluster. • Check if the cluster centroids have changed their coordinates. If yes, repeat from Step 2. • If no, cluster detection is finished, and all objects have their cluster memberships defined.

K-Means Flowchart

Numerical Example • One-dimensional database with N = 9 • Objects labeled z1…z9 • Let k = 2 • Let us start with z1 to z2 as the initial centroids: z1=2 z2=4 • Compute distance to centroids.

Example - Clustering

Example- Re-compute the Means

Example • Reassign each object to the two clusters based on the new calculations: Centroid-1= 2.5 Centriod-2= 16

Clustering- iteration-2

Example- Re-compute the Means

Clustering- iteration 3 • Reassign each object to the two clusters based on the new calculations: Centroid-1= 3 Centriod-2= 18

Example No Change in clusters, so the algorithm stops, The means have converged to their optimal values.

Neural Network Based Approaches Figure 5.12 Single artificial neuron with three inputs Table 5.4 All possible input patterns from Figure 5.12

Finding the output value

Finding the Output Value w1=2 w2=-4 w3=1 Table 5.5 Input patterns for the neural network

Kohonen Self-Organizing Map Invented by TeuvoKohonen, a professor of the Academy of Finland, in 1979. Provides a data visualization technique which helps to understand high dimensional data by reducing the dimensions of data to a map SOM also represents clustering concept by grouping similar data together.

Kohonen Self-Organizing Map SOM reduces data dimensions and displays similarities among data.

Kohonen SOM The self-organizing map describes a mapping from a higher dimensional input space to a lower dimensional map space. The procedure for placing a vector from data space onto the map is to first find the node with the closest weight vector to the vector taken from data space. Once the closest node is located it is assigned the values from the vector taken from the data space.

How the algorithm works • Initialize the weights • Get best matching unit • Scale neighbors • Determining neighbors • Learning

Training a SOM Training occurs in several steps and over many iterations: • Each node's weights are initialized. • A vector is chosen at random from the set of training data and presented to the lattice. • Every node is examined to calculate which one's weights are most like the input vector. The winning node is commonly known as the Best Matching Unit (BMU). • The radius of the neighbourhood of the BMU is now calculated. This is a value that starts large, typically set to the 'radius' of the lattice, but diminishes each time-step. Any nodes found within this radius are deemed to be inside the BMU's neighbourhood. • Each neighbouring node's (the nodes found in step 4) weights are adjusted to make them more like the input vector. The closer a node is to the BMU, the more its weights get altered. • Repeat step 2 for N iterations.

Components of a SOM 1. Data Colorsare represented in three dimensions (red, blue, and green.) The idea of the self-organizing maps is to project the n-dimensional data (here it would be colors and would be 3 dimensions) into something that be better understood visually (in this case it would be a 2 dimensional image map).

Components of a SOM 2. Weight Vectors • Each weight vector has two components. • The first part of a weight vector is its data (R,G,B). • The second part of a weight vector is its natural location (x,y)

The problem With these two components (the dataand weight vectors), how can one order the weight vectors in such a way that they will represent the similarities of the sample vectors?

Algorithm Initialize Map For t from 0 to 1 Randomly select a sample Get best matching unit Scale neighbours Increase t a small amount End for

Scaling & Neighborhood function Equation (5.1) wi(t+1)= wi(t) +hck(t) [x(t)-wi(t)]

Numerical Demonstration Figure 5.14 SOM Table 5.6 Values of the weights for SOM

Numerical Demonstration

Text Clustering • An increasingly popular technique to group similar documents • A search engine may retrieve documents and present as groups • For example, a keyword “Cricket” may retrieve two clusters: • Related to cricket sports • Related to cricket insect

References http://davis.wpi.edu/~matt/courses/soms/ http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf

Chapter 5 Data Mining: Clustering