1 / 29

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

Learn about the revolutionary CHAMELEON hierarchical clustering algorithm that overcomes limitations of existing methods. Discover its dynamic modeling approach to capture diverse shapes, densities, and sizes in data clusters.

nharrington
Télécharger la présentation

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

  2. Outline • Motivation • Objective • Research restrict • Literature review • An overview of related clustering algorithms • The limitations of clustering algorithms • CHAMELEON • Concluding remarks • Personal opinion

  3. Motivation • Existing clustering algorithms can breakdown • Choice of parameters is incorrect • Model is not adequate to capture the characteristics of clusters • Diverse shapes, densities, and sizes

  4. Objective • Presenting a novel hierarchical clustering algorithm – CHAMELEON • Facilitating discovery of natural and homogeneous • Being applicable to all types of data

  5. Research Restrict • In this paper, authors ignored the issue of scaling to large data sets that cannot fit in the main memory

  6. Literature Review • Clustering • An overview of related clustering algorithms • The limitations of the recently proposed state of the art clustering algorithms

  7. Clustering • The intracluster similarity is maximized and the intercluster similarity is minimized [Jain and Dubes, 1988] • Serving as the foundation for data mining and analysis techniques

  8. Clustering(cont’d) • Applications • Purchasing patterns • Categorization of documents on WWW [Boley, et al., 1999] • Grouping of genes and proteins that have similar functionality[Harris, et al., 1992] • Grouping if spatial locations prone to earth quakes[Byers and Adrian, 1998]

  9. An Overview of Related Clustering Algorithms • Partitional techniques • Hierarchical techniques

  10. Partitional Techniques • K means[Jain and Dubes, 1988]

  11. Hierarchical Techniques • CURE [Guha, Rastogi and Shim, 1998] • ROCK [Guha, Rastogi and Shim, 1999]

  12. Limitations of Existing Hierarchical Schemas • CURE • Fail to take into account special characteristics

  13. Limitations of Existing Hierarchical Schemas(cont’d) • ROCK • Irrespective of densities and shapes

  14. CHAMELEON • Overview • Modeling the data • Modeling the cluster similarity • A two-phase clustering algorithm • Performance analysis • Experimental Results

  15. Overall Framework CHAMELEON

  16. Modeling the Data • K-nearest graphs from an original data in 2D

  17. Modeling the Cluster Similarity • Relative inter-connectivity

  18. Modeling the Cluster Similarity(cont’d) • Relative closeness

  19. A Two-phase Clustering Algorithm • Phase I: Finding initial sub-clusters

  20. A Two-phase Clustering Algorithm(cont’d) • Phase I: Finding initial sub-clusters • Multilevel paradigm[Karypis & Kumar, 1999] • hMeT|s [Karypis & Kumar, 1999]

  21. A Two-phase Clustering Algorithm(cont’d) • Phase II: Merging sub-clusters using a dynamic framework TRI, TRC: user specified threshold

  22. A Two-phase Clustering Algorithm(cont’d) • Phase II: Merging sub-clusters using a dynamic framework

  23. Performance Analysis • The amount of time required to compute • K-nearest neighbor graph • Two-phase clustering

  24. Performance Analysis(cont’d) • The amount of time required to compute • K-nearest neighbor graph • Low-dimensional data sets = O(n log n) • High-dimensional data sets = O(n2)

  25. Performance Analysis(cont’d) • The amount of time required to compute • Two-phase clustering • Computing internal inter-connectivity and closeness for each cluster: O(nm) • Selecting the most similar pair of cluster: O(n log n + m2 log m) • Total time = O(nm + n log n + m2 log m)

  26. Experimental Results • Program • DBSCAN: a publicly available version • CURE: a locally implemented version • Data sets • Qualitative comparison

  27. Data Sets • Five clusters • Different size, shape, and density • Noise point • Two clusters • Close to each other • Different region, different densities • Six clusters • Different size, shape, and orientation • Random noise point • Special artifacts • Eight clusters • Different size, shape, density, and orientation • Random noise point • Eight clusters • Different size, shape, and orientation • Random noise and special artifacts

  28. Concluding remarks • CHAMELEON can discover natural clusters of different shapes and sizes • It is possible to use other algorithms instead of k-nearest neighbor graph • Different domains may require different models for capturing closeness and inter-connectivity

  29. Personal Opinion • Without further work

More Related