1 / 35

Clustering

Clustering. Petter Mostad. Clustering vs. class prediction. Class prediction: A learning set of objects with known classes Goal: put new objects into existing classes Also called: Supervised learning, or classification Clustering: No learning set, no given classes

Samuel
Télécharger la présentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Petter Mostad

  2. Clustering vs. class prediction • Class prediction: • A learning set of objects with known classes • Goal: put new objects into existing classes • Also called: Supervised learning, or classification • Clustering: • No learning set, no given classes • Goal: discover the ”best” classes or groupings • Also called: Unsupervised learning, or class discovery

  3. Overview • General clustering theory • Steps, methods, algorithms, issues... • Clustering microarray data • Recommendations for this kind of data • Programs for clustering • Some other visualization techniques

  4. Issues in clustering • Used to explore and visualize data, with few preconceptions • Many subjective choices must be made, so a clustering output tends to be subjective • It is difficult to get truly statistically ”significant” conclusions • Algorithms will always produce clusters, whether any exist in the data or not

  5. Steps in clustering • Feature selection and extraction • Defining and computing similarities • Clustering or grouping objects • Assessing, presenting, and using the result

  6. 1. Feature selection and extraction • Deciding which measurements matter for similarity • Data reduction • Filtering away objects • Normalization of measurements

  7. The data matrix • Every row contains the measurements for one object. • Similarities are computed between all pairs of rows • If measurements are of same type, one can instead cluster them! measurements objects

  8. 2. Defining and computing similarities • Similarity measures for continuous data vectors: • Euclidean distance • Minkowski distance (including Manhattan metric) • Mahalanobis distance where S is a covariance matrix

  9. Centered and non-centered (absolute) Pearson correlation • centered: • non-centered: where • Spearman rank correlation • Compute the ranking of the numbers in each vector • Find correlation between ranking numbers • ....

  10. Geometrical view of clustering • If measurements are coordinates, objects become points in some space • If the simiarity measure is Euclidean distance, the goal is to group nearby points • Note: When we have only 2 or 3 measurements per object, we can do better than most algorithms using visual inspection

  11. Similarity measures for discrete data • Comparing two binary vectors, count the numbers a,b,c,d of 1-1’s, 1-0’s, 0-1’s, and 0-0’s, respectively • Construct different similarity measurements based on these numbers: • Similarity of for example trees or other objects can be defined in reasonable ways

  12. Similarities using contexts • Mutual Neighbour Distance: where is the neighbour number of x with respect to y • This is not a metric, but similarities do not need to be based on metrics.

  13. 3. Clustering or grouping • Hierarchical clusterings • Divisive: Starts with one big cluster and subdivides on cluster in each step • Agglomerative: Starts with each object in separate cluster. In each step, joins the two closest clusters • Partitional clusterings • Probabilistic or fuzzy clusterings

  14. Hierarchical clustering • Agglomerative clustering depends on type of linkage, i.e., how to compute the distance between merged cluster (UV) and old cluster (W): • d(UV, W) = min(d(U, W), d(V,W)) (single linkage) • d(UV, W) = max(d(U,W), d(V,W)) (complete linkage) • d(UV, W) = average over all distances between objects in (UV) and objects in W (average linkage, or UPGMA: Unweighted Pair Group Method with Arithmetic mean) • The output is a dendrogram • A simplification of average linkage is often implemented (“average group linkage”): It may lead to inverted dendrograms!

  15. Dendrograms, visualizations • The data matrix is often visualized using three colors, representing positive, negative, and zero values. • Hierarchical clustering results often represented with a dendrogram. The similarity at which clusters merge should correspond to height of corresponding horizontal line in dendrogram! • To display the dendrogram, the objects (lines or columns) need to be sorted, this can be done in two ways at every time when two clusters are merged.

  16. Ward’s hierarchical clustering • Agglomerative. • Goal: minimize ”Error Sum of Squares” (ESS) at every step. • ESS = The sum over all clusters, of the sum of the squares of the distances from the objects to the cluster centroid. • When joining two clusters, find the pair that results in the smallest increase in ESS.

  17. Partitional clusterings • The number of desired clusters is fixed at the start • K-means clustering: • Partition into k initial clusters • Iteratively, reassign points to groups with the closest centroid. Recompute centroids. • Repeat until stability • The result may depend on initial clusters • May include a procedure joining or splitting clusters according to size • The choice of number of clusters may not be obvious

  18. Probabilistic or fuzzy clustering • The output is, for each object and each cluster, a probability or weight that the object belongs to the cluster • Example: The observations are modelled as produced by drawing from a number of probability densities (often multivariate normal). Parameters are then estimated with Maximum Likelihood (for example using EM algorithm). • Example: A ”fuzzy” version of k-means, where weights for objects are changed iteratively

  19. Neural networks for clustering • Neural networks are mathematical models made to be similar to actual neural networks • They consist of layers of nodes that send out ”signals” based probabilistically on input signals • Most known uses are classifications, i.e., with learning sets

  20. Self-Organising Maps (SOM)

  21. Clustering as optimization • Given similarity definition and definition of what is an ”optimal” clustering, it can often be a huge algorithmic challenge to find the optimum. • Example: Subdivide many thousand objects into 50 clusters, minimizing e.g. the sum of the squared distances to centroids. • Then, algorithms for optimization are central.

  22. Genetic algorithms • Tries to use ”evolution” to obtain good solutions to a problem • A number of solutions are kept at every step: They may then mate or mutate, to produce new solutions. The ”fittest” solutions are kept. • Can be seen as an optimization algorithm • A great challenge to design ways of mating and mutating that produce an efficient algorithm

  23. Simulated annealing • A general optimization technique • Iterative: At every step, nearby solutions are chosen with probabilities depending on their optimality (so even less optimal solutions may be chosen) • As the algorithm proceeds, and the ”temperature” sinks, the probability of choosing less optimal solutions also sinks. • Is a good general way to avoid local optima.

  24. 4. Assessing and using the result • Visualization and summarization of the clusters • Note: You should always investigate the dependence of your results on the choices you have made for the clustering!

  25. Examples of applications of clustering • Image analysis • Speech recognition • Data mining • ....

  26. Clustering microarray data samples • Samples are columns, genes are rows, in data matrix • What values to cluster? • What is a biologically relevant measure of similarity? • One can cluster genes and/or samples genes

  27. Clustering microarray data • Use logged data, usually • Data should be on same scale (but usually is if you use data that is already normalized) • You may have to filter away genes that show too little variation over samples. • Use an appropriate distance measure for the question you want to focus on (Pearson correlation often works OK). • Use appropriate clustering algorithm (Hierarchical average linkage usually works OK). • If you draw some conclusion from the clustering results, try to vary your clustering choices to see how stable these results are. • Clustering works best as a tool to generate hypotheses and ideas, which may then be tested in other ways.

  28. Clustering tumor samples

  29. Clustering to confirm or reject hypotheses? • A clustering may appear to validate, or be validated by, a grouping derived by using other data • Caution: The many different ways to do a clustering may make it possible to tweak it to produce the clusters you want • There is a huge and complex multiple testing problem • Note that small changes in data can change result dramatically • If you insist on trying to get ”significance”: • Using permutations of data • Using resampling of data (bootstrapping)

  30. How to do clustering: Programs • A good program for clustering and visualization: HCE • Great visualization options • Adapted to microarray data • http://www.cs.umd.edu/hcil/hce/ • Can import similarity matrices • Classic for microarray data: Cluster & TreeView (Eisen) • R/BioConductor: package cluster, hclust function, heatmap function, ... • Many other programs/packages

  31. Other visualization techniques: Principal Components • The principal components can be viewed as the axes of a “better” coordinate system for the data. • “Better” in the sense that the data is maximally spread out along the first principal components. • The principal components correspond to eigenvectors of the covariance matrix of the data. • The eigenvalues represent the part of the total variance explained by each of the principal components.

  32. Principal component analysis of expression data

  33. Principal component analysis of expression data

  34. Other visualization techniques: Multidimensional scaling • Start with some points in a very high dimension. • Goal: Display these points in a lower dimension, so that distances between them are similar to distances in original dimension. • May also try to preserve only the ranking of the pairwise distances. • Makes it possible to use powerful visual inspection, in 2 or 3 dimensions. • Can sometimes give very convincing pictures separating samples in a predicted way.

More Related