1 / 108

BINF636 Clustering and Classification

BINF636 Clustering and Classification. Jeff Solka Ph.D. Fall 2008. Gene Expression Data. samples. Genes. x gi = expression for gene g in sample i. The Pervasive Notion of Distance.

tablita
Télécharger la présentation

BINF636 Clustering and Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BINF636Clustering and Classification Jeff Solka Ph.D. Fall 2008 BINF636 CLUSTERING AND CLASSIFICATION

  2. Gene Expression Data samples Genes xgi = expression for gene g in sample i BINF636 CLUSTERING AND CLASSIFICATION

  3. The Pervasive Notion of Distance • We have to be able to measure similarity or dissimilarity in order to perform clustering, dimensionality reduction, visualization, and discriminant analysis. • How we measure distance can have a profound effect on the performance of these algorithms. BINF636 CLUSTERING AND CLASSIFICATION

  4. Distance Measures and Clustering • Most of the common clustering methods such as k-means, partitioning around medoid (PAM) and hierarchical clustering are dependent on the calculation of distance or an interpoint distance matrix. • Some clustering methods such as those based on spectral decomposition have a less clear dependence on the distance measure. BINF636 CLUSTERING AND CLASSIFICATION

  5. Distance Measures and Discriminant Analysis • Many supervised learning procedures (a.k.a. discriminant analysis procedures) also depend on the concept of a distance. • nearest neighbors • k-nearest neighbors • Mixture-models BINF636 CLUSTERING AND CLASSIFICATION

  6. BINF636 CLUSTERING AND CLASSIFICATION

  7. BINF636 CLUSTERING AND CLASSIFICATION

  8. BINF636 CLUSTERING AND CLASSIFICATION

  9. Two Main Classes of Distances • Consider two gene expression profiles as expressed across I samples. Each of these can be considered as points in RI space. We can calculate the distance between these two points. • Alternatively we can view the gene expression profiles as being manifestations of samples from two different probability distributions. BINF636 CLUSTERING AND CLASSIFICATION

  10. BINF636 CLUSTERING AND CLASSIFICATION

  11. A General Framework for Distances Between Points • Consider two m-vectors x = (x1, …, xm) and y = (y1, …, ym). Define a generalized distance of the form • We call this a pairwise distance function as the pairing of features within observations is preserved. BINF636 CLUSTERING AND CLASSIFICATION

  12. Minkowski Metric • Special case of our generalized metric BINF636 CLUSTERING AND CLASSIFICATION

  13. Euclidean and Manhattan Metric BINF636 CLUSTERING AND CLASSIFICATION

  14. Correlation-based Distance Measures • Championed for use within the microarray literature by Eisen. • Types • Pearson’s sample correlation distance. • Eisen’s cosine correlation distance. • Spearman sample correlation distance. • Kendall’s t sample correlation. BINF636 CLUSTERING AND CLASSIFICATION

  15. Pearson Sample Correlation Distance (COR) BINF636 CLUSTERING AND CLASSIFICATION

  16. Eisen Cosine Correlation Distance (EISEN) BINF636 CLUSTERING AND CLASSIFICATION

  17. Spearman Sample Correlation Distance (SPEAR) BINF636 CLUSTERING AND CLASSIFICATION

  18. Tau Kendall’s t Sample Correlation (TAU) BINF636 CLUSTERING AND CLASSIFICATION

  19. Some Observations - I • Since we are subtracting the correlation measures from 1, things that are perfectly positively correlated (correlation measure of 1) will have a distance close to 0 and things that are perfectly negatively correlated (correlation measure of -1) will have a distance close to 2. • Correlation measures in general are invariant to location and scale transformations and tend to group together genes whose expression values are linearly related. BINF636 CLUSTERING AND CLASSIFICATION

  20. Some Observations - II • The parametric methods (COR and EISEN) tend to be more negatively effected by the presence of outliers than the non-parametric methods (SPEAR and TAU) • Under the assumption that we have standardized the data so that x and y are m-vectors with zero mean and unit length then there is a simple relationship between the Pearson correlation coefficient r(x,y) and the Euclidean distance BINF636 CLUSTERING AND CLASSIFICATION

  21. Mahalanobis Distance • This allows data directional variability to come into play when calculating distances. • How do we estimate S? BINF636 CLUSTERING AND CLASSIFICATION

  22. Distances and Transformations • Assume that g is an invertible possible non-linear transformation g: x  x’ • This transformation induces a new metric d’ via BINF636 CLUSTERING AND CLASSIFICATION

  23. Distances and Scales • Original scanned fluorescence intensities • Logarithmically transformed data • Data transformed by the general logarithm BINF636 CLUSTERING AND CLASSIFICATION

  24. Experiment-specific Distances Between Genes • One might like to use additional experimental design information in deterring how one calculates distances between the genes. • One might wish to used smoothed estimates or other sorts of statistical fits and measure distances between these. • In time course data distances that honor the time order of the data are appropriate. BINF636 CLUSTERING AND CLASSIFICATION

  25. Standardizing Genes BINF636 CLUSTERING AND CLASSIFICATION

  26. Standardizing Arrays (Samples) BINF636 CLUSTERING AND CLASSIFICATION

  27. Scaling and Its Implication to Data Analysis - I • Types of gene expression data • Relative (cDNA) • Absolute (Affymetrix) • xgi is the expression of gene g on sample I as measured on a log scale • Let ygi = xgi – xgA; patient A is our reference • The distance between patient samples BINF636 CLUSTERING AND CLASSIFICATION

  28. Scaling and Its Implication to Data Analysis - II BINF636 CLUSTERING AND CLASSIFICATION

  29. Scaling and Its Implication to Data Analysis - III The distance between two genes are given by BINF636 CLUSTERING AND CLASSIFICATION

  30. Summary of Effects of Scaling on Distance Measures • Minkowski distances • Distance between samples is the same for relative and absolute measures • Distance between genes is not the same for relative and absolute measures • Pearson correlation-based distance • Distances between genes is the same for relative and absolute measures • Distances between samples is not the same for relative and absolute measures BINF636 CLUSTERING AND CLASSIFICATION

  31. What is Cluster Analysis? • Given a collection of n objects each of which is described by a set of p characteristics or variables derive a useful division into a number of classes. • Both the number of classes and the properties of the classes are to be determined. (Everitt 1993) BINF636 CLUSTERING AND CLASSIFICATION

  32. Why Do This? • Organize • Prediction • Etiology (Causes) BINF636 CLUSTERING AND CLASSIFICATION

  33. How Do We Measure Quality? • Multiple Clusters • Male, Female • Low, Middle, Upper Income • Neither True Nor False • Measured by Utility BINF636 CLUSTERING AND CLASSIFICATION

  34. Difficulties In Clustering • Cluster structure may be manifest in a multitude of ways • Large data sets and high dimensionality complicate matters BINF636 CLUSTERING AND CLASSIFICATION

  35. Clustering Prerequisites • Method to measure the distance between observations and clusters • Similarity • Dissimilarity • This was discussed previously • Method of normalizing the data • We discussed this previously • Method of reducing the dimensionality of the data • We discussed this previously BINF636 CLUSTERING AND CLASSIFICATION

  36. The Number of Groups Problem • How Do We Decide on the Appropriate Number of Clusters? • Duda, Hart and Stork (2001) • Form Je(2)/Je(1) where Je(M) is the sum of squares error criterion for the m cluster model. The distribution of this ratio is usually not known. BINF636 CLUSTERING AND CLASSIFICATION

  37. Optimization Methods • Minimizing or Maximizing Some Criteria • Does Not Necessarily Form Hierarchical Clusters BINF636 CLUSTERING AND CLASSIFICATION

  38. Clustering Criteria The Sum of Squared Error Criteria BINF636 CLUSTERING AND CLASSIFICATION

  39. Spoofing of the Sum of Squares Error Criterion BINF636 CLUSTERING AND CLASSIFICATION

  40. Related Criteria • With a little manipulation we obtain • Instead of using average squared distances betweens points in a cluster as indicated above we could use perhaps the median or maximum distance • Each of these will produce its own variant BINF636 CLUSTERING AND CLASSIFICATION

  41. Scatter Criteria BINF636 CLUSTERING AND CLASSIFICATION

  42. Relationship of the Scattering Criteria BINF636 CLUSTERING AND CLASSIFICATION

  43. Measuring the Size of Matrices • So we wish to minimize SW while maximizing SB • We will measure the size of a matrix by using its trace of determinant • These are equivalent in the case of univariate data BINF636 CLUSTERING AND CLASSIFICATION

  44. Interpreting the Trace Criteria BINF636 CLUSTERING AND CLASSIFICATION

  45. The Determinant Criteria • SB will be singular if the number of clusters is less than or equal to the dimensionality • Partitions based on Je may change under linear transformations of the data • This is not the case with Jd BINF636 CLUSTERING AND CLASSIFICATION

  46. Other Invariant Criteria • It can be shown that the eigenvalues of SW-1SB are invariant under nonsingular linear transformation • We might choose to maximize BINF636 CLUSTERING AND CLASSIFICATION

  47. k-means Clustering • Begin initialize n, k, m1, m2, …, mk • Do classify n samples according to nearest mi Recompute mi • Until no no change in mi • Return m1, m2, .., mk • End • Complexity of the algorithm is O(ndkT) • T is the number of iterations • T is typically << n BINF636 CLUSTERING AND CLASSIFICATION

  48. Example Mean Trajectories BINF636 CLUSTERING AND CLASSIFICATION

  49. Optimizing the Clustering Criterion • N(n,g) = The number of partitions of n individuals into g groups N(15,3)=2,375,101 N(20,4)=45,232,115,901 N(25,8)=690,223,721,118,368,580 N(100,5)=1068 Note that the 3.15 x 10 17 is the estimated age of the universe in seconds BINF636 CLUSTERING AND CLASSIFICATION

  50. Hill Climbing Algorithms 1 - Form initial partition into required number of groups 2 - Calculate change in clustering criteria produced by moving each individual from its own to another cluster. 3 - Make the change which leads to the greatest improvement in the value of the clustering criterion. 4 - Repeat steps (2) and (3) until no move of a single individual causes the clustering criterion to improve. • Guarantees local not global optimum BINF636 CLUSTERING AND CLASSIFICATION

More Related