Download
clustering tackling challenges with data recovery approach n.
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering: Tackling Challenges with Data Recovery Approach PowerPoint Presentation
Download Presentation
Clustering: Tackling Challenges with Data Recovery Approach

Clustering: Tackling Challenges with Data Recovery Approach

87 Vues Download Presentation
Télécharger la présentation

Clustering: Tackling Challenges with Data Recovery Approach

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Clustering: Tackling Challenges with Data Recovery Approach B. Mirkin School of Computer Science Birkbeck University of London Advert of a Special Issue: The ComputerJournal, Profiling Expertise and Behaviour: Deadline 15 Nov. 2006. To submit, http:// www.dcs.bbk.ac.uk/~mark/cfp_cj_profiling.txt

  2. WHAT IS CLUSTERING; WHAT IS DATA • K-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation Aids • WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering • DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward;Extensions to Other Data Types; One-by-One Clustering • DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters • GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

  3. What is clustering? • Finding homogeneous fragments, mostly sets of entities, in data for further analysis

  4. Example: W. Jevons (1857) planet clusters, updated (Mirkin, 1996) Pluto doesn’t fit in the two clusters of planets

  5. Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001)

  6. Clustering algorithms • Nearest neighbour • Ward • Conceptual clustering • K-means • Kohonen SOM • Etc………………….

  7. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  8. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  9. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  10. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * @ * * * @ * * * * ** * * * @

  11. Advantages of K-Means Models typology building Computationally effective Can be utilised incrementally, `on-line’ Shortcomings of K-Means Instability of results Convex cluster shape

  12. Initial Centroids: Correct Two cluster case

  13. Initial Centroids: Correct Final Initial

  14. Different Initial Centroids

  15. Different Initial Centroids: Wrong Initial Final

  16. Clustering issues: • K-Means gives no advice on: • *Number of clusters • * Initial setting • * Data normalisation • * Mixed variable scales • * Multiple data sets • K-Means gives limited advice on: • *Interpretation of results

  17. Type of Data Similarity Temporal Entity-to-feature Co-occurrence Type of Model Regression Principal components Clusters Data recovery for data mining (=“discovery of patterns in data”) • Model: • Data = Model_Derived_Data + Residual • Pythagoras: • Data2 = Model_Derived_Data2 + Residual2 • The better fit, the better the model

  18. Pythagorean decomposition in Data recovery approach, provides for: • Data scatter – a unique data characteristic (A perspective at data normalisation) • Additive contributions of entities or features to clusters (A perspective for interpretation) • Feature contributions are correlation/association measures affected by scaling (Mixed scale data treatable) • Clusters can be extracted one-by-one (Data mining perspective, incomplete clustering, number of clusters) • Multiple data can be approximated as well as single sourced ones (not talked of today)

  19. Example: Mixed scale data table

  20. Conventional quantitative coding + … data standardisation

  21. Standardisation of features • Yik = (Xik –Ak)/Bk • X - original data • Y – standardised data • i – entities • k – features • Ak – shift of the origin, typically, the average • Bk – rescaling factor, traditionally the standard deviation, but range may be better in clustering

  22. No standardisation Tom Sawyer

  23. Z-scoring (scaling by std) Tom Sawyer

  24. Standardising by range & weight Tom Sawyer

  25. K-Means as a data recovery method

  26. Representing a partition Clusterk: Centroid ckv (v - feature) Binary 1/0 membership zik (i - entity)

  27. Basic equations (analogous to PCA, with score vectors zk constrained to be binary) y – data entry, z – membership, not score c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

  28. Meaning of Data scatter • The sum of contributions of features – the basis for feature pre-processing (dividing by range rather than std) • Proportional to the summary variance

  29. Contribution of a feature Fto a partition • Proportional to • correlation ratio 2 if F is quantitative • a contingency coefficient between cluster partition and F, if F is nominal: • Pearson chi-square (Poisson normalised) • Goodman-Kruskal tau-b (Range normalised) Contrib(F) =

  30. Contribution of a quantitative feature to a partition • Proportional to • correlation ratio 2 if F is quantitative

  31. Contribution of a nominal feature to a partition • Proportional to a contingency coefficient • Pearson chi-square (Poisson normalised) • Goodman-Kruskal tau-b (Range normalised) • Bj=1

  32. Pythagorean Decomposition of data scatter forinterpretation

  33. Contribution based description of clusters • C. Dickens: FCon = 0 • M. Twain: LenD < 28 • L. Tolstoy: NumCh > 3 or Direct = 1

  34. PCA based Anomalous Pattern Clustering yiv =cv zi + eiv, where zi = 1 ifiS, zi = 0 ifiS With Euclidean distance squared cS must be anomalous, that is, interesting

  35. Tom Sawyer Initial setting with Anomalous Pattern Cluster

  36. 1 2 Tom Sawyer 3 Anomalous Pattern Clusters: Iterate 0

  37. iK-Means:Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final

  38. Features: Corrupt office (1) Client (1) Rendered service (6) Mechanism of corruption (2) Environment (1) Example of iK-Means: Media Mirrored Russian Corruption (55 cases) with M. Levin and E. Bakaleinik

  39. A schema for Bribery Environment Interaction Office Client Service

  40. Categories as one/zero variables Subtracting the average All features: Normalising by range Categories, sometimes by the number of them Data standardisation

  41. 13 clusters found with AC, of which 8 do not fit (4 singletons, 4 doublets) 5 clusters remain, to get initial seeds from Cluster elements are taken as seeds iK-Means:Initial Setting with Iterative Anomalous Pattern Clustering

  42. Patterns in centroid values of salient features Salience of feature v at cluster k : ~ (grand mean - within-cluster mean)2 Interpretation II: Patterning(Interpretation I: Representatives Interpretation III: Conceptual description)

  43. Cluster 1 (7 cases): Other branch (877%) Improper categorisation (439%) Level of client (242%) Cluster 2 (19 cases): Obstruction of justice (467%) Law enforcement (379%) Occasional (251%) InterpretationII III Branch = Other Branch = Law Enforc. & Service: No Cover-Up & Client Level  Organisation

  44. Cluster 3 (10 cases): Extortion (474%) Organisation(289%) Government (275%) InterpretationII (pattern) III (appcod) 0 <= Extort - Obstruct <= 1 & 2 <= Extort + Bribe <=3 & No Inspection & No Protection NO ERRORS

  45. Government Extortion for free services (Cluster 3) Protection (Cluster 4) Law enforcement Obstruction of justice (Cluster 2) Cover-up (Cluster 5) Other Category change (Cluster 1) Is this knowledge enhancement? Overall Description: It is Branch that matters

  46. Data recovery clustering of similarities • Example: Similarities between algebraic functions in an experimental method for knowledge evaluation lnx x² x³ x½ x¼ lnx - 1 1 2.5 2.5 x² 1 - 6 2.5 2.5 X³ 1 6 - 3 3 x½ 2.5 2.5 3 - 4 x¼ 2.5 2.5 3 4 - Scoring similarities between algebraic functions by a 6th grade student in scale 1 to 7

  47. Additive clustering Similarities are the sum of intensities of clusters Cl. 0: “All are funcrtions”,{lnx, x², x³, x½, x¼} Intensity 1 (upper sub-matrix) lnx x² x³ x½ x¼ lnx - 1 1 1 1 x² 1 - 1 1 1 X³ 1 6 - 1 1 x½ 2.5 2.5 3 - 1 x¼ 2.5 2.5 3 4 - Scoring similarities between algebraic functions by a 6th grade student in scale 1 to 7 (lower sub-matrix)

  48. Additive clustering Similarities are the sum of intensities of clusters Cl. 1: “Power functions”,{x², x³, x½, x¼} Intensity 2 (upper sub-matrix) lnx x² x³ x½ x¼ lnx - 0 0 0 0 x² 1 - 2 2 2 X³ 1 6 - 2 2 x½ 2.5 2.5 3 - 2 x¼ 2.5 2.5 3 4 - Scoring similarities between algebraic functions by a 6th grade student in scale 1 to 7 (lower sub-matrix)

  49. Additive clustering Similarities are the sum of intensities of clusters Cl. 2: “Sub-linear functions”,{lnx, x½, x¼} Intensity 1 (upper sub-matrix) lnx x² x³ x½ x¼ lnx - 0 0 1 1 x² 1 - 0 0 0 X³ 1 6 - 0 0 x½ 2.5 2.5 3 - 1 x¼ 2.5 2.5 3 4 - Scoring similarities between algebraic functions by a 6th grade student in scale 1 to 7 (lower sub-matrix)

  50. Additive clustering Similarities are the sum of intensities of clusters Cl. 3: “Fast growing functions”,{x², x³} Intensity 3 (upper sub-matrix) lnx x² x³ x½ x¼ lnx - 0 0 0 0 x² 1 - 3 0 0 X³ 1 6 - 0 0 x½ 2.5 2.5 3 - 0 x¼ 2.5 2.5 3 4 - Scoring similarities between algebraic functions by a 6th grade student in scale 1 to 7 (lower sub-matrix)