1 / 42

Метод К-средних в кластер-анализе и его интеллектуализация

Метод К-средних в кластер-анализе и его интеллектуализация. Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ Professor Emeritus, School of Computer Science & Information Systems, Birkbeck College University of London , UK. Outline:.

Télécharger la présentation

Метод К-средних в кластер-анализе и его интеллектуализация

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ МоскваРФ Professor Emeritus, School of Computer Science& Information Systems, Birkbeck College University of London, UK

  2. Outline: • Clustering as empirical classification • K-Means and its issues: • (1) Determining Kand initialization • (2) Weighting variables • Addressing (1): • Data recovery clustering and K-Means (Mirkin 1987, 1990) • One-by-one clustering: Anomalous patterns and iK-Means • Other approaches • Computational experiment • Addressing (2): • Three-stage K-Means • Minkowski K-Means • Computational experiment • Conclusion

  3. WHAT IS CLUSTERING; WHAT IS DATA • K-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation Aids • WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering • DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward;Extensions to Other Data Types; One-by-One Clustering • DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters • GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

  4. Referred recent work: • B.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41 • B.G. Mirkin,Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60 • B.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75

  5. What is clustering? • Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis

  6. Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996) Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006)

  7. Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001)

  8. Clustering algorithms: • Nearest neighbour • Agglomerative clustering • Divisive clustering • Conceptual clustering • K-means • Kohonen SOM • Spectral clustering • ………………….

  9. Batch K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  10. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  11. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  12. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * @ * * * @ * * * * ** * * * @

  13. K-Means criterion: Summary distance to cluster centroids Minimize * * @ * * * @ * * * * ** * * * @

  14. Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’ Shortcomings of K-Means - Initialisation: no advice on K or initial centroids - No deep minima - No defence of irrelevant features

  15. Initial Centroids: Correct Two cluster case

  16. Initial Centroids: Correct Final Initial

  17. Different Initial Centroids

  18. Different Initial Centroids: Wrong Initial Final

  19. (1) To address: • *Number of clusters • Issue: Criterion WK < WK-1 • * Initial setting • * Deeper minimum • The two are interrelated: a good initial setting leads to a deeper minimum

  20. Number K: conventional approach • Take a range RK of K, say K=3, 4, …, 15 • For each KRK • Run K-Means 100-200 times from randomly chosen initial centroids and choose the best of them W(S,c)=WK. • CompareWK for all KRK in a special way and choose the best; such as • Gap statistic (2001) • Jump statistic (2003) • Hartigan (1975): In the ascending order of K, pick the first K at which HK = [ WK / WK+1 - 1 ]/(N-K-1)  10

  21. (1) Addressing • *Number of clusters • * Initial setting • with a PCA-like method in the data recovery approach

  22. Representing a partition Clusterk: Centroid ckv (v - feature) Binary 1/0 membership zik (i - entity)

  23. Basic equations (same as for PCA, but score vectors zk constrained to be binary) y – data entry, z – 1/0 membership, not score c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

  24. Quadratic data scatter decomposition (Pythagorean) K-means: Alternating LS minimisation y – data entry, z – 1/0 membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

  25. Equivalent criteria (1) A. Bilinear residuals squared MIN Minimizing difference between data and cluster structure B. Distance-to-Centre Squared MIN Minimizing difference between data and cluster structure

  26. Equivalent criteria (2) C. Within-group error squared MIN Minimizing difference between data and cluster structure D. Within-group variance Squared MIN Minimizing within-cluster variance

  27. Equivalent criteria (3) E. Semi-averaged within distance squared MIN Minimizing dissimilarities within clusters F. Semi-averaged within similarity squared MAX Maximizing similarities within clusters

  28. Equivalent criteria (4) G. Distant Centroids MAX Finding anomalous types H. Consensus partition MAX Maximizing correlation between sought partition and given variables

  29. Equivalent criteria (5) I. Spectral Clusters MAX Maximizing summary Raileigh quotient over binary vectors

  30. PCA inspired Anomalous Pattern Clustering yiv =cv zi + eiv, where zi = 1 ifiS, zi = 0 ifiS With Euclidean distance squared cS must be anomalous, that is, interesting

  31. Tom Sawyer Initial setting with Anomalous Pattern Cluster

  32. 1 2 Tom Sawyer 3 Anomalous Pattern Clusters: Iterate 0

  33. iK-Means:Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final

  34. Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering

  35. Study of eight Number-of-clustersmethods (joint work with Mark Chiang): • Variance based: • Hartigan (HK) • Calinski & Harabasz(CH) • Jump Statistic(JS) • Structure based: • Silhouette Width(SW) • Consensus based: • Consensus Distribution area(CD) • Consensus Distribution mean (DD) • Sequential extraction of APs (iK-Means): • Least Square(LS) • Least Moduli(LM)

  36. Experimental results at 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size Two winners counted each time 1-time winner 2-times winner 3-times winner

  37. (2) Address: Weighting features according to relevance • w: feature weights=scale factors • 3-step K-Means: • Given s, c, find w (weights) • Given w, c, find s (clusters) • Given s,w, find c (centroids) • till convergence

  38. Minkowski’s centers • Minimize over c • At >1, d(c) is convex • Gradient method

  39. Minkowski’s metric effects • The more uniform distribution of the entities over a feature, the smaller its weight • Uniform distribution  w=0 • The best Minkowski power  is data dependent • The best  can be learnt from data in a semi-supervised manner (with clustering of all objects) • Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record)

  40. Conclusion: Data recovery K-Means-wise model of clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects Further work: Extending the approach to other data types – text, sequence, image, web page Upgrading K-Means to address the issue of interpretation of the results Coder Model Decoder Data clustering Clusters Data recovery

  41. HEFCE survey of students’ satisfaction • HEFCE method: ALL 93 of highest mark • STRATA: 43 best, ranging 71.8 to 84.6

More Related