Cluster Analysis III

Cluster Analysis III 10/5/2012

Outline • Estimate the number of clusters. • Evaluation of clustering results.

Estimate the number of clusters Milligan & Cooper(1985) compared over 30 published rules. None is particularly better than all the others. The best method is data dependent. 1. Calinski & Harabasz (1974) Where B(k) and W(k) are the between-and within-cluster sum of squares with k clusters. 2. Hartigan (1975) , Stop when H(k)<10

Estimate the number of clusters 3. Gap statistic: Tibshirani, Walther & Hastie (2000) • Within sum of squares W(k) is a decreasing function of k. • Normally look for a turning point of elbow-shape to identify the number of clusters, k.

Estimate the number of clusters • Instead of the above arbitrary criterion, the paper proposes to maximize the following Gap statistics. The background expectation is calculated from random sampling from .

The background expectation is calculated from random sampling from uniform distribution. Bounding Box (aligned with feature axes) Bounding Box (aligned with principle axes) Monte Carlo Simulations Monte Carlo Simulations Observations Observations

Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the s.d. of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that

Estimate the number of clusters 4. Resampling methods: Tibshirani et al. (2001); Dudoit and Fridlyand (2002) Treat as underlying truth. Compare

: training data : testing data : K-means clustering cetroids from training data : comembership matrix Prediction strenth: Ak1, Ak2,…, Akk be the K-means clusters from test data Xte nkj=#(Akj) Find k that maximizes ps(k).

Estimate the number of clusters • Conclusions: • There’s no dominating “good” method for estimating the number of clusters. Some are good only in some specific simulations or examples. • Imagine in a high-dimensional complex data set. There might not be a clear “true” number of clusters. • The problem is also about the “resolution”. In “coarser” resolution few loose clusters may be identified, while in “refined” resolution many small tight clusters will stand out.

Cluster Evaluation • Evaluation and comparison of clustering methods is always difficult. • In supervised learning (classification), the class labels (underlying truth) are known and performance can be evaluated through cross validation. • In unsupervised learning (clustering), external validation is usually not available. • Ideal data for cluster evaluation: • Data with class/tumor labels (for clustering samples) • Cell cycle data (for clustering genes) • Simulated data

Rand Index Y={(a,b,c), (d,e,f)} Y'={(a,b), (c,d,e), (f)} • Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance) • 1c(Y, Y')0 • Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth available.

Adjusted Rand index: (Hubert and Arabie 1985) Adjusted Rand index = The adjusted Rand index will take maximum value at 1 and constant expected value 0 (when two clusterings are totally independent)

Comparison

Simulation: • 20 time-course samples for each gene. • In each cluster, four groups of samples with similar intensity. • Individual sample and gene variation are added. • # of genes in each cluster ~ Poisson(10) • Scattered (noise) genes are added. • The simulated data well assembles real data by visualization. 20 samples 15 clusters Thalamuthu et al. 2006

Different types of perturbations • Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly simulated scattered genes are added. E.g. For sample j in a scattered gene, the expression level is randomly sampled from the empirical distribution of expressions of all clustered genes in sample j. • Type II: For each element of the log-transformed expression matrix, a small random error from normal distribution (SD = 0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to evaluate robustness of the clustering against potential random errors. • Type III: combination of type I and II.

Different degree of perturbation in the simulated microarray data

Simulation schemes performed in the paper. In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.

Adjusted Rand index: a measure of similarity of two clustering; • Compare each clustering result to the underlying true clusters. Obtain the adjusted Rand index (the higher the better). T: tight clustering M: model-based P: K-medoids K: K-means H: hierarchical S: SOM

Consensus Clustering (Montiet al, 2003) Simpson et al.BMC Bioinformatics 2010 11:590 doi:10.1186/1471-2105-11-590

Consensus Clustering • The consensus matrix can be used as distance matrix for clustering. • Alternatively, people also can cluster on the original data and attach a measurement of robustness for each cluster.

Cluster and membership robustness • Cluster robustness: average connectivity in a cluster. • Membership robustness: average connectivity between and one element and all of the other element of the cluster.

Consensus clustering with PAM (blue) • Consensus clustering with hierarchical clustering (red) • HOPACH (black) • Fuzzy c-means (green)

Evaluate the clustering results

Comparison in real data sets: (see paper for detailed comparison criteria)

Despite many sophisticated methods for detecting regulatory interactions (e.g. Shortest-path and Liquid Association), cluster analysis remains a useful routine in high dimensional data analysis. • We should use these methods for visualization, investigation and hypothesis generation. • We should not use these methods inferentially. • In general, methods with resampling evaluation, allowing scattered genes and related to model-based approach are better. • Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.

Common mistakes or warnings: • Run K-means with large k and get excited to see patterns without further investigation. • K-means can let you see patterns even in randomly generated data and besides human eyes tend to see “patterns”. • Identify genes that are predictive to survival (e.g. apply t-statistics to long and short survivors). Cluster samples based on the selected genes and find the samples are clustered according to survival status. • The gene selection procedure is already biased towards the result you desire.

Common mistakes (con’d): • Cluster samples into k groups. Perform F-test to identify genes differentially expressed among subgroups. • Data has been re-used for both clustering and identifying differentially expressed genes. You always obtain a set of differentially expressed genes but not sure it’s real or by random.

Cluster Analysis III

Cluster Analysis III

Presentation Transcript

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis III

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

CLUSTER ANALYSIS

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis