1 / 30

Clustering

Clustering. Log 2 transformation Row centering and normalization Filtering. Clustering Preliminaries. Log 2 -transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.

shubha
Télécharger la présentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering

  2. Log2 transformation Row centering and normalization Filtering Clustering Preliminaries

  3. Log2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values. We would like dist(100,200)=dist(1000,2000). Log2 Transformation Advantages of log2 transformation:

  4. Row Centering & Normalization x y=x-mean(x) z=y/stdev(y)

  5. Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis. Filtering genes All genes Supervised AnalysisMarker Selection Clustering

  6. Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”. Challenge: Not well defined. No single objective function / evaluation criterion Example:How many clusters? 2+noise, 3+noise, 20, Hierarchical: 23 + noise One has to choose: Similarity/distance measure Clustering method Evaluate clusters Clustering/Class Discovery

  7. Representative based: Find representatives/centroids K-means: KMeansClustering Self Organizing Maps (SOM): SOMClustering Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters single linkage analysis complete linkage analysis average linkage analysis Clustering-like: NMFConsensus PCA (Principal Components Analysis) Clustering in GenePattern No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses

  8. Aim: Partition the data points into K subsets and associate each subset with a centroid such that the sum of squared distances between the data points and their associated centroid is minimal. K-means Clustering

  9. Initialize centroids atrandom positions Iterate: Assign each data point toits closestcentroid Move centroids to center of assigned points Stop when converged Guaranteed to reach a local minimum Iteration = 0 Iteration = 1 Iteration = 2 Iteration = 2 Iteration = 1 K-means: Algorithm K=3

  10. Result depends on initial centroids’ position Fast algorithm: needs to compute distances from data points to centroids Must preset number of clusters. Fails for non-spherical distributions K-means: Summary

  11. Distance between joined clusters 1 3 2 4 5 Dendrogram Hierarchical Clustering 2 4 5 3 1

  12. 2 4 5 3 1 1 3 2 4 5 Hierarchical Clustering Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Distance between joined clusters The dendrogram induces a linear ordering of the data points (up to left/right flip in each split) Dendrogram

  13. Average Linkage Leukemia samples and genes

  14. Single and Complete Linkage Leukemia samples and genes Complete-linkage Single-linkage

  15. Decide: which samples/genes should be clustered together Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula Pearson correlation - a parametric measure of the strength of linear dependence between two variables. Absolute Pearson correlation - the absolute value of the Pearson correlation Spearman rank correlation - a non-parametric measure of independence between two variables Uncentered correlation - same as Pearson but assumes the mean is 0 Absolute uncentered correlation - the absolute value of the uncentered correlation Kendall’s tau - a non-parametric similarity measure used to measure the degree of correspondence between two rankings City-block/Manhattan - the distance that would be traveled to get from one point to the other if a grid-like path is followed Similarity/Distance Measures

  16. Reasonable Distance Measure Euclidean distance on samples and genes on row-centered and normalized data. Gene 1 Genes: Close -> Correlated Samples: Similar profile givingGene 1 and 2 a similar contribution to the distance between sample 1 and 5 Gene 2 Gene 3 Gene 4 Sample 1 Sample 5

  17. Elongated clusters Filament Clusters of different sizes Pitfalls in Clustering

  18. All methods work Compact Separated Clusters Adapted from E. Domany

  19. Elongated Clusters • Single linkage succeeds to partition • Average linkage fails

  20. Single linkage not robust Filament Adapted from E. Domany

  21. Single linkage not robust Filament with Point Removed Adapted from E. Domany

  22. Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering): Two-way Clustering

  23. Results depend on distance update method Single Linkage: elongated clusters Complete Linkage: sphere-like clusters Greedy iterative process NOT robust against noise No inherent measure to choose the clusters – we return to this point in cluster validation Hierarchical Clustering Summary

  24. Clustering Protocol

  25. Validating Number of Clusters How do we know how many real clusters exist in the dataset?

  26. Consensus matrix: counts proportion of times two samples are clustered together. • (1) two samples always cluster together • (0) two samples never cluster together RED WHITE s1 s2 … sn Generate “perturbed” datasets s1 s2 … sn ... Dn D1 D2 compute consensus matrix dendogram based on matrix Apply clustering algorithm to each Di Clustering1 Clustering2 .. Clusteringn Consensus Clustering Original Dataset The Broad Institute of MIT and Harvard

  27. ... Dn D1 D2 consensus matrix ordered according to dendogram compute consensus matrix dendogram based on matrix Apply clustering algorithm to each Di Clustering1 Clustering2 .. Clusteringn Consensus Clustering • Consensus matrix: counts proportion of times two samples are clustered together. • (1) two samples always cluster together • (0) two samples never cluster together Original Dataset RED WHITE s1 s3 … si s1 s3 … si C1 C2 C3

  28. Aim: Measure agreement between clustering results on “perturbed” versions of the data. Method: Iterate N times: Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noise Cluster the perturbed dataset Calculate fraction of iterations where different samples belong to the same cluster Optimize the number of clusters K by choosing the value of K which yields the most consistent results Validation Consistency / Robustness Analysis

  29. Consensus Clustering in GenePattern

  30. Reduce number of genes by variation filtering Use stricter parameters than for comparative marker selection Choose a method for cluster discovery (e.g. hierarchical clustering) Select a number of clusters Check for sensitivity of clusters against filtering and clustering parameters Validate on independent data sets Internally test robustness of clusters with consensus clustering Clustering Cookbook

More Related