The Clustering Problem

The Clustering Problem Yongsub Lim Applied Algorithm Laboratory KAIST

Contents • The Clustering Problem • Basic Algorithms • K-Means • K-Clustering of Max. Spacing • Two-Phase Algorithms • Other Algorithms The Clustering Problem

The Clustering Problem • Given data, it is to discover “meaningful” groups • Data in same group are similar, and • Data between different groups are not similar The Clustering Problem

Example of clustering The Clustering Problem

Applications of Clustering • The image segmentation problem can be considered as a clustering of pixels of an image • In unsupervised learning, before making a decision rule, we classify unlabeled training data through clustering The Clustering Problem

Applications of Clustering • In a network or a graph, we can do grouping vertices which are highly connected within one group • Clustering is also useful in biology to classify genes The Clustering Problem

Basic Algorithms • Two algorithms will be introduced • K-Means computes iteratively centers of K clusters • K-Clustering of Max. Spacing uses a minimum spanning tree • Objective functions of theses are different The Clustering Problem

K-Means • Determine means of K clusters randomly • At each iteration, • Every data belongs to a cluster whose mean is the nearest one among K means • Re-compute means of all clusters The Clustering Problem

K-Means • Objective is to minimize the sum of distance of centers of clusters and their members • It is clustering for high density in one cluster The Clustering Problem

K-Means Algorithm • Worst case Initial two centers randomly chosen This may be not what we want!!! The Clustering Problem

K-Clustering of Max. Spacing • Given data, find K clusters which maximize the minimum distances between all pairs of clusters • spacing: min. distance between any pair of data in different clusters The Clustering Problem

K-Clustering of Max. Spacing The Clustering Problem

K-Clustering of Max. Spacing • Consider given data to a complete graph with Euclidean distance • Compute a MST • Delete the K-1 most expensive edges of a MST The Clustering Problem

K-Clustering of Max. Spacing Copt Calg ≤ spacing of Calg The Clustering Problem

K-Clustering of Max. Spacing • It is no randomness • Objective seems to be better or more reasonable than K-means The Clustering Problem

K-Means vs. Max. Spacing • Good clustering is • High density in one cluster (K-Means) • Long dist. between clusters (Max. Spacing) > The Clustering Problem

K-Means vs. Max. Spacing The Clustering Problem

Two-Phase Algorithms • Two algorithms will be introduced • In the first phase, both do clustering without restriction on K • In second phase, if # of clusters are larger than K, merge using Max. Spacing The Clustering Problem

OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms Hierarchical EMST The Clustering Problem

OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms Hierarchical EMST • HEMST removes all edges with weights greater than the threshold (mean+std. of edges) • If # of clusters is less than a given K, same with Max. Spacing • If not, it runs Max. Spacing on data set each of which is nearest to the center of its cluster The Clustering Problem

Hierarchical EMST The Clustering Problem OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms

OleksandrGrygorash, Yan zhou, Zach Jorgensen, Minimum spanning Tree Based Clustering Algorithms Hierarchical EMST The Clustering Problem

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process • MKF, in the first phase, is similar to K-Means • The difference is that if data is far enough from all clusters, it becomes the center of the new cluster • While running, if # of clusters is larger than a threshold, the two nearest clusters are merged • In the second phase, apply Max. Spaing The Clustering Problem

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process • This scheme can identify outliers by using Max. Spacing The Clustering Problem

M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering process for outliers detection Modified K-Means Process The Clustering Problem

Two-Phase Algorithms • Both give more weights to members in small sets in the first phase • A small set will be the most likely clustered data, so it is reasonable to decrease distances between them The Clustering Problem

ErezHartuv, Ron Shamir, a clustering algorithm based on graph connectivity Other Algorithms • HCS uses min-cut of a graph • It recursively separate data to disjoint two subsets (min-cut) until all clusters are highly connected • A graph is highly connected if the min. # of edges whose removal disconnects the graph is greater than |V|/2 The Clustering Problem

Ana L.N. Fred, Anil K. Jain, Data Clustering Using Evidence accumulation Other Algorithms • Voting • Apply K-Means N times • If any pair of data belonged to same cluster greater than threshold t times, they are grouped The Clustering Problem

Thanks The Clustering Problem

The Clustering Problem

The Clustering Problem

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

A Heuristic Search Approach to Solving the Software Clustering Problem

Clustering

Clustering

Clustering

Clustering: Partition Clustering

The geometric GMST problem with grid clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

A Heuristic Approach Towards Solving the Software Clustering Problem

A genetic approach to the automatic clustering problem