1 / 24

Clustering algorithms and methods

Clustering algorithms and methods. Andreas Held. - Review and usage -. 28.June.2007. Content. What is a cluster and the clustering process Proximity measures Hierarchical clustering Agglomerative Divisive Partitioning clustering K-means Density-based Clustering DBSCAN. The Cluster.

matherne
Télécharger la présentation

Clustering algorithms and methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering algorithms and methods Andreas Held - Review and usage - 28.June.2007

  2. Content • What is a cluster and the clustering process • Proximity measures • Hierarchical clustering • Agglomerative • Divisive • Partitioning clustering • K-means • Density-based Clustering • DBSCAN

  3. The Cluster • A cluster is a group or accumulation of objects with similar attributes • conditions to clusters: (i) Homogeneity within a cluster (ii) Heterogeneity to other clusters • possible objects in biology: - genes (transcriptomics) - individuals (plant systematics), - sequences (sequence analysis) Ruspini-dataset: artifical generated dataset

  4. Objectives of Clustering • Generation of preferably homogeneous and heterogeneous clusters • Identification of categories, classes or groups in the data • Recognition of relations within the data • Concise Structuring of the data (e.g. dendogram)

  5. The clustering process • the expression levels of genes under • different conditions experimental data • take only the expression levels under • the conditions which interest • => attribute vectors: xi = (y1, …, ym) preprocessing • create the raw-data-matrix by writing the • attribute-vectors among each other raw-data-matrix • define the distance or similarity functions • and build the distance-matrix so that on • their rows and columns the objects are • confronted. proximity measures • choose a clustering algorithm and use it • on the data clustering algorithm

  6. condition 2 Euclidian distance: 4 Euclidian distance 3 Manhattan distance: 2 Manhattan distance Maximum distance: 1 d(x, y) = maxi ( | xi - yi |) Maximum distance 1 2 3 4 condition 1 Distance functions for objects - d(x, y) calculates the distance between the two objects x and y - Distance measures: - Example:

  7. Distance measures for cluster Calculating the distance between two clusters is important for some algorithms (e.g. hierarchical algorithms) Condition 2 Single Linkage: min {d(a, b) : a  A, b  B} Cluster Y 5 D Complete Linkage 4 Complete Linkage: max {d(a, b) : a  A, b  B} Average Linkage 3 C Single Linkage 2 B Average Linkage: 1 Cluster X A Condition 1 1 3 4 5 2

  8. Differentiability of clustering algorithms

  9. Hierarchical Clustering • Two methods of hierarchical Clustering: • agglomerative (bottom-up) • divise (top-down) • agglomerative vs. divisive: • divise and agglomerative methods produce the same results • divise algorithms need much more computing power so in practical only agglomerative methods are used. • Agglomrative algorithm example: UPGMA used in phylogenetics • Conditions: • given distance or similarity measure for objects • given distance measure for cluster • result is a dendrogram

  10. Agglomerative hierarchical clustering Algorithm: Find the two Clusters with the closest distance and put those two Clusters into one. Compute the new Distance-Matrix. Construct the finest Partition and compute Distance matrix D Start with objects and given distance measure between clusters Until all clusters are agglomerated d Distance measures: - Manhattan Distance - Single Linkage 9 D c 8 E 7 C 6 b b 5 a c 4 3 B 2 a A 1 A B C D E 1 2 3 4 5 6 7 Clusters Dendrogram Distance-Matrix

  11. Hierarchical clustering- conclusions - • Advantages: • Dendrogram allows interpretation • depending on the level of the dendogram the different clustering grades can be explored. • Usage on all data spaces if a distance measure can be defined • Disadvantages: • The user has to identify the clusters by himself • recalculations of the great distance-matrix makes the algorithm resource intensive • Higher runtimes vs. Non-hierarchic-methods

  12. Partioning Clustering -k-means algorithm- • merge n objects into k cluster • calculate centroids from given clustering: ,ci centroid of cluster Ci • calculate clustering from given centroids: => merge objects into the cluster with minimum distance to its centroid

  13. k-means algorithm principle • In general neither the centroids nor the clustering is known • Guessing Cluster center (centroid) Clustering

  14. k-means Algorithm - euclidian distance - k = 3 0) Init: place randomly 3 cluster-centroids 1.0 0.9 1) Join every object into the cluster with the nearest cluster-centroid 0.8 0.7 2) Compute the new cluster-centroids from the given clustering 0.6 0.5 3) Repeat 1) and 2) until all centroids stop moving 0.4 0.3 0.2 • in each step the values get better for the centroids and the clustering 0.1 0.2 0.4 0.6 0.8 1.0

  15. k-means algorithm- problems - • Not every run achieves the same result, because the result depends on random initiation of the clusters • Run the algorithm a couple of times and take the best result • fixed number of clusters need to be known before starting the algorithm => try different values for k and take the best result • the problem to compute the optimal number of clusters is not trivial. An approach is the elbow criterion.

  16. k-means algorithm - advantages - • easy to implement • linear runtime allows execution on large databases • For example the clustering of microarray data: • depending on the experiment: 20.000 dimensional vectors

  17. Partioning Clustering- density-based method - • Condition on the data space: data space where objects are closely together separated from areas where the objects are less closely together => Cluster with arbitrary shape get found

  18. Density-based clustering - parameters - • : the environment around an object (o): all objects in the -environment of object o • MinPts: minimum number of objects, that have to be in an object-environment, so that this object is core object  o o

  19. Density-based clustering - definitions - o • object o O is core object, if: | (o) | ≥ MinPts • object p  O is directly density-reachable from q  O, if p  (q) • A object p is density-reachable from an object q, if there is a chain of directly density-reachable objects between p and q. q p q p

  20. Density-based clustering- example DBSCAN- Parameter: Algorithm: MinPts = 4 1) Explore incremental all objects : see below 2) find core object (e(o) ≥ MinPts = 4) 3) Start with a new cluster and merge the object to this cluster 4) Search for all density-reachable objects and merge them also to the cluster

  21. Density-based clustering - conclusions - • Advantages: • Minimal requirements of domain knowledge to determine the input parameters. • Discovery of clusters with arbitrary shape • Good efficiency on large databases • Disadvantages: • problems on data spaces with strongly different densities within different ranges • Bad efficiency on high dimensional databases

  22. More clustering methods • Hierarchical methods (agglomerative, divisive) • Partitioning methods (i.e. k-means) • Density-based methods (i.e. dbscan) • Fuzzy clustering • Grid-based methods • Constraint based methods • High dimensional clustering

  23. Clustering algorithms- conclusions - • Choosing a clustering algorithm for a particular problem is not trivial. • single algorithms cover only a part of the given requirements. (corresponding runtime-behavior, precision, influence of runaways...) • => there has no algorithm been found (yet), that would have an optimal usability for every purpose, and mankind is still waiting for the one to develop such an algorithm.

  24. End

More Related