1 / 41

Non-parametric Methods for Clustering Continuous and Categorical Data

Non-parametric Methods for Clustering Continuous and Categorical Data. Steven X. Wang Dept . of Math. and Stat. York University May 13, 2010. What is Clustering. Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

clarew
Télécharger la présentation

Non-parametric Methods for Clustering Continuous and Categorical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

  2. What is Clustering • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes

  3. Good Clustering • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation.

  4. Part I: Clustering Continuous Data • There are many algorithms K-means PAM Hierarchical Clustering Model-based Clustering …..

  5. Why non-parametric clustering Our goal is to develop algorithms that • do not assume the functional form, • do not depend on the parameters, • have less dependence on the distance function, • achieve better accuracy.

  6. Particle Movement

  7. Gravitational Field

  8. Shrink Method 1 • The convergence can be achieved if only one point is moved while the rest of data points are not. • It can be proved by using the previous theorem together with Fixed Point Theorem. • We also derived an optimization algorithm based on this idea.

  9. Shrinking Method 2 • K-nearest neighbor method is slow. • To achieve faster convergence, we propose to update each data point after each iteration. • This is less accurate but faster. • It corresponds to the following stochastic process:

  10. CLUES ALGORITHM • Given k=2 j+1, j=0,1,2, … • For each data point, perform the shrinking procedure. • Replace each data point with the median of its K-nearest neighbors. • Repeat the procedure until convergence. • Detect cluster centers and partition. Since the result is dependent for K, we need to search for the K giving optimal results.

  11. Convergence If one point is moved while keeping other points fixed, the convergence can be proved by Fixed Point Theorem. If all points are moving at the same time, the process corresponds to a random stochastic process, anti-diffusion process.

  12. Partition by PAM

  13. Partition by CLUES

  14. Relationship with Similar algorithms • Mean Shift algorithm • Data Sharpening by Heckman and Hall. • Gravitational clustering. What is our advantage? • K-nearest neighbor is highly adaptive. • Our algorithm is parameter-free.

  15. References • Wang, X., Qiu, W. and Zamar, R. (2006). An Iterative Non-parametric Clustering Algorithm Based on Local Shrinking.Computational Statistics and Data Analysis. • Wang, X., Liang, D, Feng, X. and Ye, L. (2007). A Derivative Free Optimization Algorithm based on Conditional Moments. Journal of Mathematical Analysis and Applications. Vol. 331, No. 2, 1337 - 1360.

  16. Clustering Categorical Data • Review of current literature • Hamming Distance and CD vector • Modified Chi-square test • Description of our Algorithm • Numerical Results • Conclusion and Discussions

  17. Review of existing algorithms K-modes • This algorithm is built on the idea of K-means algorithm. • It demands the number of clusters. • Partition is sensitive to the input order. • Computational Complexity O(n)

  18. AutoClass Algorithm This algorithm can cluster both categorical and numeric data types. 1. It utilizes the EM algorithm. 2. It searches the optimal number of clusters 3. EM algorithm is known to have slow convergence. 4. The computational complexity is O(n).

  19. Categorical Sample Space • Assume that the data set is stored in a n*p matrix, where n is the number of observations and p the number of categorical variables. • The sample space consists of all possible combinations generated by p variables. • The sample space is discrete and has no natural origin.

  20. Hamming Distance and CD vector • Hamming distance measures the number of different attributes between two categorical variables. • Hamming Distance has been used in clustering categorical data in algorithms similar to K-modes. • We construct Categorical Distance (CD) vector to project the sample space into 1-dimesional space.

  21. Example of a CD vector

  22. More on CD vector • The dense region of the CD vector is NOT necessarily a cluster! • The length of the CD vector is p. • We can construct many CD vectors on one data set by choosing different “origin”.

  23. How to detect a cluster ? • The CD vector shows some clustering pattern. But are they statistically significant? • Statistical Hypothesis Testing: Null Hypothesis: Uniformly distributed. Alternative: Not uniformly distributed. • We call the expected CD vector under the null Uniform CD vector (UCD).

  24. UCD: Expected CD vector under Null.

  25. CD Vector UCD Vector

  26. How to compare these 2 vectors? • One is the observed CD vector. • The other is the expected CD vector under null hypothesis. • Chi-square is the most natural tool to test the null hypothesis based on these two vectors. • However clustering patterns are all local features. Thus we are not interested in a comparison at a global level.

  27. Modified Chi-square Test The modified Chi-square is defined as:

  28. Choice of C and Radius of a Cluster

  29. CD Algorithm • Find a cluster center; • Construct the CD vector given the current center ; • Perform modified Chi-square test; • If we reject the null, then determine the radius of the current cluster; • Extract the cluster • Repeat until we do not reject the null.

  30. Numerical Comparison with K-mode and AutoClass CDAutoClassK-mode No. of Clusters 4 4 [3] [4] [5] _____________________________________________________ Classi. Rates 100% 100% 75% 84% 82% “Variations” 0% 0% 6% 15% 10% Inform. Gain 100% 100% 67% 84% 93% “Variations” 0% 0% 10% 15% 11% _____________________________________________________ Soybean Data: n=47 and p=35. No of clusters=4.

  31. Numerical Comparison with K-mode and AutoClass CDAutoClassK-mode No. of Clusters 7 3 [6] [7] [8] _____________________________________________________ Classi. Rates 95% 73% 74% 72% 71% “Variations” 0% 0% 6% 15% 10% Inform. Gain 92% 60% 75% 79% 81% “Variations” 0% 0% 7% 6% 6% _____________________________________________________ Zoo Data: n=101 and p=16. No of clusters=7.

  32. Run Times Comparison K-modes CD ____________________________________ Soybean Average 0.0653 0.0496 S.D 0.0029 0.0010 Zoo Data Average 0.0139 0.0022 S.D 0.0018 0.0001 ____________________________________ Note that AutoClass requires human intervention.

  33. Computational Complexity • The upper bound of the computational complexity of our algorithm is O(kpn) • Note that the sample size shrinks if the CD algorithm detects a cluster • It is less computational intensive than K-modes and AutoClass since both have complexity of O(akpn) where a>1.

  34. Conclusion • Our algorithm requires no convergence criterion. • It automatically estimate the number of clusters. It does not demand or search for the true number of clusters. • The sample size is reduced after one detected cluster is extracted. • The computational complexity of our algorithm is bounded by O(n).

  35. Future Work • Clustering functional data. • Clustering mixed data, continuous and categorical data at the same time. • Clustering data with spatial and temporal structure. • Develop parallel computing algorithms for our methods.

  36. Reference: Zhang, P, Wang, X. and Song, P. (2007) Clustering Categorical Data Based on Distance Vectors. JASA.

  37. Software Download • http://math.yorku.ca/~stevenw • Please go to Software section and you can download the program. • We are also in the process of developing a C program for both methods.

More Related