1 / 24

Clustering Categorical Data

Clustering Categorical Data. Pasi Fränti. 18.2.2016. K-means clustering. Definitions and data. Set of N data points:. X ={ x 1 ,  x 2 , …,  x N }. Partition of the data:. P ={ p 1 , p 2 , …,  p M },. Set of M cluster prototypes (centroids):. C ={ c 1 , c 2 , …,  c M },.

jjefferson
Télécharger la présentation

Clustering Categorical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Categorical Data Pasi Fränti 18.2.2016

  2. K-means clustering

  3. Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

  4. Distance and cost function Euclidean distance of data vectors: Mean square error:

  5. Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

  6. Duality of partition and centroids Partition of data Cluster prototypes Partition by nearestprototype mapping Centroids as prototypes

  7. Categorical data

  8. Categorical clustering Three attributes

  9. Categorical clustering Sample 2-d data: color and shape Model A Model B Model C

  10. Hamming Distance(Binary and categorical data) • Number of different attribute values. • Distance of (1011101) and (1001001) is 2. • Distance (2143896) and (2233796) • Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

  11. K-means variants Histogram-based methods: Methods: • k-modes • k-medoids • k-distributions • k-histograms • k-populations • k-representatives

  12. Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:

  13. Iterative algorithms

  14. K-modes clusteringDistance function

  15. K-modes clusteringPrototype of cluster

  16. K-medoids clusteringPrototype of cluster Vector with minimal total distance to every other 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

  17. K-medoidsExample

  18. K-medoidsCalculation

  19. K-histograms D 2/3 F 1/3

  20. K-distributionsCost function with ε addition

  21. Example of cluster allocationChange of entropy

  22. Problem of non-convergenceNon-convergence

  23. Results with Census dataset

  24. Literature Modified k-modes + k-histograms:M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007. ACE:K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005. ROCK:S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345-366, 200x. K-medoids:L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes:Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998. K-distributions:Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007. K-histograms:Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.

More Related