1 / 58

Computational Biology

Computational Biology. Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar. Lecture Slides Week 9. Clustering. A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering

Télécharger la présentation

Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

  2. Clustering • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

  3. Partitional Clustering A Partitional Clustering Original Points

  4. Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

  5. Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters

  6. What is Cluster Analysis? Inter-cluster distances are maximized Intra-cluster distances are minimized • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

  7. 10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

  8. 10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

  9. 10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

  10. 10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

  11. Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

  12. Limitations of K-means: Differing Density K-means (3 Clusters) Original Points

  13. Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

  14. Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits

  15. Starting Situation p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . • Start with clusters of individual points and a proximity matrix Proximity Matrix

  16. Intermediate Situation C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 • After some merging steps, we have some clusters C3 C4 Proximity Matrix C1 C5 C2

  17. Intermediate Situation C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 Proximity Matrix C1 C5 C2

  18. After Merging • The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5

  19. How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  20. How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  21. How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  22. How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  23. How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . .   • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  24. Cluster Similarity: MIN or Single Link 1 2 3 4 5 • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph.

  25. Hierarchical Clustering: MIN 5 1 3 5 2 1 2 3 6 4 4 Nested Clusters Dendrogram

  26. Measuring Cluster Validity Via Correlation • Two matrices • Proximity Matrix • “Incidence” Matrix • One row and one column for each data point • An entry is 1 if the associated pair of points belong to the same cluster • An entry is 0 if the associated pair of points belongs to different clusters • Compute the correlation between the two matrices • Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated. • High correlation indicates that points that belong to the same cluster are close to each other. • Not a good measure for some density or contiguity based clusters.

  27. Measuring Cluster Validity Via Correlation • Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235 Corr = -0.5810

  28. Using Similarity Matrix for Cluster Validation • Order the similarity matrix with respect to cluster labels and inspect visually.

  29. End Theory I • 5 min mindmapping • 10 min break

  30. Practice I

  31. Clustering Dataset • We will use the same datasets as last week • Have fun clustering with Orange • Try K-means clustering, hierachial clustering, MDS • Analyze differences in results

  32. K? • What is the k in k-means again? • How many clusters are in my dataset? • Solution: iterate over a reasonable number of ks • Do that and try to find out how many clusters there are in your data

  33. End Practice I • 15 min break

  34. Theory II Microarrays

  35. Microarrays Gene Expression: We see difference between cels because of differential gene expression, Gene is expressed by transcribing DNA intosingle-stranded mRNA, mRNA is later translated into a protein, Microarrays measure the level of mRNA expression

  36. Microarrays Gene Expression: mRNA expression represents dynamic aspects of cell, mRNA is isolated and labeled using a fluorescent material, mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

  37. Microarrays

  38. Microarrays

  39. Microarrays

  40. Processing Microarray Data Differentiating gene expression: R = G not differentiated R > G  up-regulated R < G  down regulated

  41. Processing Microarray Data Problems: Extract data from microarrays, Analyze the meaning of the multiple arrays.

  42. Processing Microarray Data

  43. Processing Microarray Data Problems: Extract data from microarrays, Analyze the meaning of the multiple arrays.

  44. Processing Microarray Data Microarray data:

  45. Processing Microarray Data Clustering: Find classes in the data, Identify new classes, Identify gene correlations, Methods: K-means clustering, Hierarchical clustering, Self Organizing Maps (SOM)

  46. Processing Microarray Data Distance Measures: Euclidean Distance: Manhattan Distance:

  47. Processing Microarray Data K-means Clustering: Break the data into K clusters, Start with random partitioning, Improve it by iterating.

  48. Processing Microarray Data Agglomerative Hierarchical Clustering:

  49. Processing Microarray Data Self-Organizing Feature Maps: by Teuvo Kohonen, a datavisualization technique which helps to understand highdimensional data by reducing the dimensions of data to amap.

  50. Processing Microarray Data Self-Organizing Feature Maps: humans simply cannot visualize high dimensional data as is, SOM help us understand this highdimensional data.

More Related