1 / 62

Microarray Data Analyisis: Clustering and Validation Measures

Microarray Data Analyisis: Clustering and Validation Measures. Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy. What we want (tipically). Genes. Expression Levels. Genes Expression Matrix. Group functionally related genes together

winter-wise
Télécharger la présentation

Microarray Data Analyisis: Clustering and Validation Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy Raffaele Giancarlo

  2. What we want (tipically) Genes Expression Levels Genes Expression Matrix • Group functionally related genes together • Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always • Clustering Raffaele Giancarlo

  3. What we want (tipically) Clustering Solution Raffaele Giancarlo

  4. Limitations in the Analysis Process Raffaele Giancarlo

  5. Limitations: Microarray Technology • MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006 • …no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself • A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak Raffaele Giancarlo

  6. Limitations: Visualization Tools • One of those two Clusters is random noise … Which One ??? Raffaele Giancarlo

  7. Limitations: Statistics • Towards sound epistemological foundations of statistical methods for high-dimensional biology- T. Mehta et al, Nature Genetics, 2004 • Many papers for omic research describe development or application of statistical methods— Many of those are questionable Raffaele Giancarlo

  8. Overview Of Remaining Part • Clustering as a three step process • Internal validation Techniques • External Validation Techniques • Experiments • One stop shops software systems • Some Issues I Really Had to Talk About Raffaele Giancarlo

  9. Cluster Analysis as a Three Step Process Raffaele Giancarlo

  10. What is clustering? • Group similar objects together Clustering experiments Clustering genes Raffaele Giancarlo

  11. What is Clustering? • Goal: partition the observations {xi} so that • C(i)=C(j) if xi and xj are “similar” • C(i)C(j) ifxi and xj are “dissimilar” • natural questions: • What is a cluster • How do I choose a good similarity function • How do I choose a good algorithm • APPLICATION and DATA DEPENDENT • How many clusters are REALLY present in the data Raffaele Giancarlo

  12. What’s a Cluster? • No rigorous definition • Subjective • Scale/Resolution dependent (e.g. hierarchy) Raffaele Giancarlo

  13. Step One • Choose a good similarity function- • Euclidean Distance- • capture magnetudo and pattern of expression, i.e., direction • Correlation functions • Captures pattern of expression, i.e. direction • Etc… Raffaele Giancarlo

  14. Step Two • Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize • Compactness: Intra- Cluster Variation Small • They like well separated or spherical clusters but fail on more complex cluster shapes • Kmeans, Average Link Hierarchical Clustering • Connectedness- neighboring items should share the same cluster • Robust with respect to cluster shapes, but fail when separation in the data is poor. • Single Link Hierarchical Clustering, CAST, CLICK • Spatial Separation- Poor performer by itself, usually coupled with other criteria • Simulated Annealing, Tabu Search Raffaele Giancarlo

  15. Step Three • An index that tells us how many clusters are really present in the data: Consistency/Uniformity more likely to be 2 than 3 more likely to be 2 than 36? (depends, what if each circle represents 1000 objects?) Raffaele Giancarlo

  16. Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo

  17. Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo

  18. Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo

  19. Step Three • An index that is • independent of cluster “volume”? • independent of cluster size? • independent of cluster shape? • sensitive to outliers? • etc… • Theoretically Sound-Gap Statistics • Data Driven andValidated-Many Raffaele Giancarlo

  20. Internal Validation Measures How many clusters are really present in the data Assess Cluster Quality Internal: No external knowledge about the dataset is given Raffaele Giancarlo

  21. The Basic Scheme • Given an Index F – a function of clustering solution • black box producing clustering solutions with k=2,…,m clusters • Compute F( ) to decide which k is best Raffaele Giancarlo

  22. Internal Validation Measures • Within-Cluster Sum of Squares [Folklore] • Gap Statistics [Tibshirani, Walther, Hastie 2001] • FOM [Yeung, Haynor, Ruzzo 2001] • Consensus Clustering [Monti et al., 2003] • Etc… Raffaele Giancarlo

  23. Within-Cluster Sum of Squares xj xi Raffaele Giancarlo

  24. Within-Cluster Sum of Squares Measure of compactness of clusters Raffaele Giancarlo

  25. Using Wk to determine # clusters Idea of L-Curve Method: use the k corresponding to the “elbow” (the most significant increase in goodness-of-fit) Raffaele Giancarlo

  26. Example • Yeast Cell Cycle Dataset, 698 genes and 72 conditions • Five functional classes-The gold solution • Algorithm, K-means with Av. Link input and Euclidean Distance • We want to know how many clusters are predicted by Wk , with K-means as an “oracle” Raffaele Giancarlo

  27. Example Raffaele Giancarlo

  28. Problems with Use of Wk • No reference clustering solution to compare against, i.e., no model • The values ofWk are not normalized and therefore cannot be compared • In a nutshell: we get values of Wk but we do not quite know how far we are from randomness • Gap Statistics takes care of those problems Raffaele Giancarlo

  29. The Gap Statistics • Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for Wk • Extended to work in higher dimensions – No Theory • Validated experimentally Raffaele Giancarlo

  30. SampleUniformly and at Random • Align with feature axes (data-geometry independent) Bounding Box (aligned with feature axes) Monte Carlo Simulations Observations Raffaele Giancarlo

  31. Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the s.d. of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that Raffaele Giancarlo

  32. Example • The same experimental setting as for Within-Sum of Squares • We want to know whether the Gap Statistics predicts 5 clusters, with K-means as an “oracle” Raffaele Giancarlo

  33. Example Raffaele Giancarlo

  34. Figure of Merit • A purely experimental approach, designed and validated specifically for microarray data Raffaele Giancarlo

  35. FOM Experiments 1 e m 1 Cluster C1 genes g Cluster Ci n Cluster Ck R(g,e) Raffaele Giancarlo

  36. FOM Raffaele Giancarlo

  37. Example • Same experimental setting as in the Within Sum of Squares • We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle” • Hint: look for the elbow in the FOM plot, exactly as for the Wk curve. Raffaele Giancarlo

  38. Example Raffaele Giancarlo

  39. External Validation Measures Given two partitions of the same dataset, how close they are ? Assess Quality of a partition against a given gold standard External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes Raffaele Giancarlo

  40. Some External Validation Measures • The two partitions must have the same number of classes • Jaccard Index • Minkowski score • Rand Index [Rand 71] • The two partitions can have a different number of classes • The Adjusted Rand Index [Hubert and Arabie 85] • The F measure [van Rijsbergen 79] Raffaele Giancarlo

  41. Some External Validation Measures • Problem with the mentioned indexes: • What is their expected value ? • In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had withGap Statistics. Raffaele Giancarlo

  42. The Adjusted Rand Index • It takes in input two partitions, not necessarely having the same number of classes. • Value 1, its maximum, means perfect agreement • The expected value of the index, i.e., its value on two randomly correlated partitions, is zero • Note1: the index may take negative values • Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index • The index must be maximased • We will see some of its uses later Raffaele Giancarlo

  43. The Adjusted Rand Index • It takes in input two partitions, not necessarely having the same number of classes. • Value 1, its maximum, means perfect agreement • The expected value of the index, i.e., its value on two randomly correlated partitions, is zero • Note1: the index may take negative values • Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index • The index must be maximased • We will see some of its uses later Raffaele Giancarlo

  44. Adjusted Rand index • Compare clusters to classes • Consider # pairs of objects Raffaele Giancarlo

  45. Example (Adjusted Rand) Closed form in the paper by Handl et al. (supplementary material) Raffaele Giancarlo

  46. Some Experiments or on the Need of Benchmark Data Set Raffaele Giancarlo

  47. How Do I Pick: • Distance and Similarity Functions, given algorithm and data set • algorithm, given data set • Internal Validation Measures, given data set Raffaele Giancarlo

  48. Different Distances-Same Algorithm and implementation (k-means) Raffaele Giancarlo

  49. Same Distance-Two Different Implementations of the Same Algorithm: not all k-means are equal Raffaele Giancarlo

  50. Performance of Different Algorithms- precision Raffaele Giancarlo

More Related