1 / 25

A genetic approach to the automatic clustering problem

A genetic approach to the automatic clustering problem. Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao. Outline. Motivation Objective Introduction The basic concept of the genetic strategy The genetic clustering algorithm

Télécharger la présentation

A genetic approach to the automatic clustering problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao

  2. Outline • Motivation • Objective • Introduction • The basic concept of the genetic strategy • The genetic clustering algorithm • The heuristic to find a good clustering • Conclusion • Personal Opinion

  3. Motivation • Some clustering algorithms require the user to provide the number of clusters as input • It is not easy for the user to guess how many clusters should be there. • The user in general has no idea about the number of clusters. • The clustering result may be no good • Especially when the number of clusters is large and not easy to guess

  4. Objective • Propose a genetic clustering algorithm • Will automatically search for a proper number • Classify the objects into these clusters

  5. Introduction • The clustering methods • Hierarchical • The agglomerative methods • The divisive methods • Non-Hierarchical • The K-means algorithm • Is an iterative hill-climbing algorithm • the solution obtained depends on the initial clustering

  6. The basic concept of the genetic strategy

  7. The genetic clustering algorithm • The algorithm CLUSTERINGconsists of two stages • The nearest-neighbor algorithm. • To group those data that are close to one another. • To reduce the size of the data to a moderate one that is suitable for the genetic clustering algorithm. • Genetic clustering algorithm. • To group the small clusters into larger cluster. • A heuristic strategy is then used to find a good clustering.

  8. The nearest-neighbor algorithm. • The distance • Base on the average of the nearest-neighbor distances • Steps • For each object Oi , find the distance between Oi and its nearest neighbor.

  9. The nearest-neighbor algorithm • Steps • Compute dav, the average of the nearest-neighbor distance by using step 1 • View the n objects as nodes of a graph. Compute the adjacency matrix An*n

  10. The nearest-neighbor algorithm • Steps • Find the connected components of this graph. • The data sets represented by these connected components be denoted by • B1, B2, …, Bm • The center of each set be denoted by • Vi , 1 ≤i≤ m

  11. The genetic algorithm • Initialization step • Iterative generations • Reproduction phase • Crossover phase • Mutation phase

  12. The genetic algorithm • Initialization step • A population of N strings is randomly generated • The length of each string is m • m is the number of the sets obtained in the first stage. • If Bi is in this subset, the ith position of the string will be 1; otherwise, it will be 0 • Each Bi in the subset is used as a seed to generate a cluster.

  13. The genetic algorithm

  14. The genetic algorithm • How to generate a set of clusters from the seeds • Let T = {T1, T2,…, Ts} be the subset corresponding to a string. • The initial clusters Ci’s are Ti’s and initial centers Si’s of clusters are Vi’s for i = 1, 2,…,s. • The size of cluster Ci is ‌ Ci ‌ = ‌ Ti ‌ for i = 1, 2,…,s, where ‌ Ti ‌ denotes the number of objects belonging to Ti

  15. The genetic algorithm • The Bi’s in {B1, B2, …, Bm} – T are taken one by one and the distance between the center Vi of the taken Bi. • the center Sj of each cluster Cj is calculated • If Bi is classified as in the cluster Cj, the center Sj and the size of the cluster Cj will be recomputed

  16. The genetic algorithm • Reproduction phase • The intra-distance in the center Ci • The inter-distance between this cluster Ci and the set of all other clusters. • The fitness function of a string R

  17. The genetic algorithm • Crossover phase • Two random number p and q in [1, m] are generated to decide which pieces of the string are to be interchanged. • The crossover operator is done with probability pc • Mutation Phase • Each chosen bit will be changed from 0 to 1 or from 1 to 0.

  18. The heuristic strategy to find a good clustering • D1(w) estimates the closeness of the clusters in the clustering • D2(w) estimates the compactness of the clusters in the clustering

  19. The heuristic strategy to find a good clustering • The value of w’s are chosen from [w1, w2] by some kind of binary search • To finds the greatest jump on the values of D1(w)’s and the greatest jump on the values of D2(w)’s. • Based on these jumps, it then decides which a good clustering is

  20. Experiments • The population size is 50 • The crossover rate is 80 % • The mutation rate is 5 % • [w1, w2] = [1, 3] • w1 is the smallest value, w2 is the largest value • Three sets of data were used

  21. Fig. (a) • The first set of data • consists of three groups of points on the plane. • The densities of three groups are not the same • Fig. (b), (c) • K-mean algorithm • Fig. (d) • Complete-link method • Fig. (e) • Single-link method

  22. Fig. (a) • The original data set with five groups of points • Fig. (b), (c) and (d) • K-mean algorithm • Fig. (e) • By CLUSTERING, complete-link, single-link and K-mean

  23. Conclusion and Personal Opinion • The experimental results show that CLUSTERING is effective. • Can automatically search for a proper number as the number of clusters.

More Related