Data Mining Techniques and Applications The University of Nottingham

Clustering Alvaro Garcia-Piquer Research Group in Intelligent Systems (GRSI) La Salle – Ramon Llull University alvarog@salle.url.edu Data Mining Techniques and Applications The University of Nottingham

Outline 1. Introduction 2. Clustering Taxonomy 3. Some Algorithms 4. Validation of Clustering Solutions 5. Summary

Grouping Data • Data Mining • Clustering • To group data according to a set of criteria, providing to experts a possible classification or categorization of the elements [Kaufman, 2005] 1 [Han, 2006] Data mining-searching for knowledge (interesting patterns) in your data We have rich data, but poor information

Clustering example 1 by family by age

Applications • Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records • Biology: classification of living organisms according to their DNA • Image segmentation: identifying objects in images according to the features of each pixel (position, color…) 1

Clustering steps To choose the number of clusters To choose the type of clustering • Search typology 2 • Relationship of the clusters • Cases distribution into the clusters Clustering algorithm Clustering process Convergence? (optimization of clusters) Yes No Validation of clustering solution

Number of clusters • The determination of the number of clusters to find can be: • Manual • The search space of the algorithm is reduced • Automatic • The search space is not delimited and is more difficult to the algorithm to converge 2 Can you group these data in two clusters according to the colour? Can you group these data according to the colour? How many groups can you identify? - Cluster 1: blue data - Cluster 2: green data

Relationships of the clusters • Partitional • There are not relationships between the clusters • Hierarchical • All the clusters have some relationships between them • Two types • Agglomerative • Divisive [Gan, 2007; Duda, 2000] 2

Partitional 2

Hierarchical agglomerative 2

Hierarchical divisive 2

Cases distribution into the clusters • Hard • Each data element belongs to exactly one cluster • Fuzzy (soft) • Data elements can belong to more than one cluster • Associated with each of the objects are membership grades which indicate the degree to which the objects belong to the different clusters • The sum of all the membership grades of each object have to be the same (normally 1) [Gan, 2007; Duda, 2000] 2

Hard clustering 2

Fuzzy clustering 2 Red cluster: 1 Green cluster: 0 Red cluster: 0.7 Green cluster: 0.3 Red cluster: 0.6 Green cluster: 0.4 Red cluster: 0.2 Green cluster: 0.8

Search typology (1) • Centre-based algorithms [Gan, 2007] • Each cluster is defined by a prototype, and the instances are assigned to the closest prototype • The clusters have convex shapes and each cluster is represented by a centre • They can not find clusters of arbitrary shapes • They are sensible to the initialization and they may fall in a local optimal solution y y 2 x x prototype

Search typology (2) • Graph-based algorithms [Gan, 2007] • They construct a graph or hypergraph and then apply some heuristic to partition it • They can find arbitrarily shaped clusters • They are sensible to the initialization and they may fall in a local optimal solution y y 2 x x Eliminating edges: the edges that are longer than a threshold are eliminated Graph construction: each instance is related with the nearest neighbour not visited

Search typology (3) • Model-based algorithms [Gan, 2007] • Is assumed that the data are generated by a mixture of probability distributions in which each one represents a different cluster • The distributions are estimated from the data and each data instance is assigned to each one • They are sensible to the initialization and they may fall in a local optimal solution y 2 x Gaussian distributions µ1σ1 µ2σ2 µ2 µ1 σ1 σ2

Search typology (4) • Search-based algorithms [Gan, 2007] • They are a complement of the previous strategies • The previous strategies may not be able to find the globally optimal clustering that fits the data set • This strategy tries to search in the overall solution space and find a globally optimal clustering that fits the data set • Genetic algorithms • Ant colony optimization • Simulated annealing • They are very time expensive 2

Search typology (5) • Density-based algorithms [Gan, 2007] • Clusters are defined as dense regions separated by low-density regions • They need only one scan of the original data set and can handle noise • The number of clusters is not required • They can find arbitrarily shaped clusters y y 2 x x noise

Search typology (6) • Subspace-based algorithms [Gan, 2007] • They are applied to high dimensional data sets • They consist on finding clusters in each dimension identifying dense units • The final clusters are found overlapping the clusters of each dimension y y 2 x x

Optimization of the clusters • Several clustering algorithms are iterative, and consists on optimize the evaluation of the clusters according to one or several objectives • Single objective • The clustering process consists on optimize a single objective • Several objectives • The clustering process consists on optimize several objectives obtaining a trade-off between them 2 [Law, 2004]

Single objective (1) • The clusters are obtained taking into account the attributes ‘x’ and ‘y’ y 2 Criterion to optimize: 1) Each cluster has to contain elements of the same shape These two criteria are considered as a single objective due to optimize a criterion doesn’t affect to the other criterion x y Criteria to optimize: 1) Each cluster has to contain elements of the same shape 2) The number of clusters has to be minimized x

Single objective (2) • The clusters are obtained taking into account the attributes ‘x’ and ‘y’ y y 2 Criteria to optimize: 1) Minimize intra-cluster variance 2) Maximize inter-cluster variance Intra-clustervarianceoptimized x x Inter-clustervarianceoptimized Is impossible to optimize both criteria at the same time

Single objective (3) • Validation indexes [Halkidi, 2002] • They evaluate a clustering solution according to the quality of the clusters (shape) using the inter-cluster and intra-cluster variance simultaneously. • Some indexes • Davies-Bouldin index • Dunn’s index • Silhouette index • ... • Example: Davies-Bouldin index [Dunn, 1974] 2

Several objectives (1) • Ensembleclustering[Law, 2004] y y y 2 Criteria to optimize: 1) Minimize intra-cluster variance 2) Maximize inter-cluster variance Optimization of intra-cluster variance x x x Combination of the results ? Optimization of inter-cluster variance

Several objectives (2) • Multi-objectiveclustering y y y y y 2 Criteria to optimize: 1) Minimize intra-cluster variance 2) Maximize inter-cluster variance dominatedsolution x x x x x intra-cluster variance 1-inter-cluster variance

Taxonomy Summary • Search typology • Centre-based • Search-based • Graph-based • Density-based • Model-based • Subspace-based • ... • Number of clusters • Manual • Automatic • Relationships of the clusters • Partitional • Hierarchical 2 • Cases distribution into the clusters • Optimization of the clusters • Fuzzy(soft) • Hard

k-means • MacQueen, 1967 [MacQueen, 1967] • Partitional • Centre-based • Hard clustering • Number of clusters manual • Single objective • It consists on group the instances into k circular clusters according to the distance between them and the centre of the cluster, updating the centres with the new assignments. This process is repeated until convergence has been reached • Similar algorithms: x-means (Number of clusters automatic), fuzzy C-means (fuzzy clustering) 3

Single-link • Johnson, 1967 [Johnson, 1967] • Hierarchical agglomerative • Centre-based • Hard clustering • Number of clusters automatic • Single objective • In each step the two clusters whose two closest members have the smallest distance are merged • Similar algorithms: Complete-link, Average-link • They follow other heuristic to merge the instances 3

Clustering validation (1) • How to validate a clustering solution? • The data is not labelled • External criteria [Halkidi, 2002] • Expert in the domain of the problem as judge • Comparing with an intuitive solution • F-Measure, Rand Index, Adjusted Rand Index... • Explanations of each cluster to justify them • Main features (attributes) of the elements of each cluster 4

Clustering validation (2) • Relative criteria [Halkidi, 2002] • Comparing the clustering results according to a validation index (or a combination of them) • Validation index use only the information of the data set • Normally is used to select the best solution from several clustering results obtained with different clustering algorithms • This does not means that the solution is a good solution to the problem • The selected solution depends on the validation index used 4

Summary • How to solve a clustering problem? • Data analysis • Pre-process the data if it is necessary (noise, unknown values...) • Selection of the clustering algorithm • Is important to know the domain of the problem • Is there a known number of clusters? • Can be overlapping between clusters? • Is necessary a hierarchical relationship between clusters? • Is important to detect arbitrary shapes? • What are the clustering criteria? • ... 5

References • A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, vol. 39, pp. 1-38, 1977. • G. Corral, A. Garcia-Piquer, A. Orriols-Puig, A. Fornells, and E. Golobardes. Analysis of VulnerabilityAssessment Results based on CAOS. Applied Softcomputing Journal, in press, 2010. • R. O. Duda, P. E. Hart and D. G. Stork. Pattern classification. John Wiley & Sons, Inc, 2000. • J. C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 95-104, 1974. • G. Gan, M. Chaoqun, and J. Wu. Data Clustering Theory, Algorithms, and Applications. ASA-SIAM, 2007. • M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster validity methods: part I. ACM SIGMOD Record, 31(2):40-45, 2002. • J. Han, M. Kamber. Data Mining. Concepts and techniques. Morgan Kaufmann, 2006. • S. C. Johnson. HierarchicalClusteringSchemes. Psychometrika, 2:241-254, 1967. • L. Kaufman, and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc, 2005. • M. Law, A. Topchy, and A. Jain. Multiobjective data clustering. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2:424-430, 2004. • M. Matteucci. A Tutorial onClusteringAlgorithms, Politecnico di Milano. <http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html> • J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, 1:281-297, 1967. • I. H. Witten and E. Frank. DataMining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers, 2005.

Data Mining Techniques and Applications The University of Nottingham

Data Mining Techniques and Applications The University of Nottingham

Presentation Transcript

The University of Nottingham

Data Mining: Concepts and Techniques Mining Text Data

Data Mining Techniques

Dimensionality Reduction for Data Mining - Techniques, Applications and Trends

Applications of Sketch Based Techniques to Data Mining Problems

Data Mining in Practice: Techniques and Practical Applications

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

Data Mining: Concepts and Techniques Mining data streams

Sequence Data Mining: Techniques and Applications

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

Data Mining: Applications

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

Data Mining: Applications

Data Mining Techniques

Data Mining: Concepts and Techniques Mining data streams

Data Mining: Applications

Data Mining Techniques

Applications, Techniques and Trends of Data Mining and Knowledge Discovery Database