190 likes | 348 Vues
Bioinformatics: Spectral Clustering. Mentee: Joonoh Lim Mentor: Sanketh Shetty. Background. Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets.
E N D
Bioinformatics:Spectral Clustering Mentee: Joonoh Lim Mentor: SankethShetty
Background • Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets. • In general, cluster analysis is a common technique for statistical data analysis used in many fields: machine learning, data mining, pattern recognition, image analysis and bioinformatics. • In bioinformatics, it is used to study gene and gene expression.
Types of Clustering Algorithms • Partitional Methods • K-means Clustering • Affinity Propagation • Spectral Clustering • Mean-shift Clustering • Normalized-cuts • Gaussian Mixture Models • Hierarchical Methods • Single linkage • Complete linkage • Average Linkage
Advantage of Spectral Clustering • Very simple to implement • Can be solved efficiently by standard linear algebra • Invariant to cluster shapes and densities
Spectral Clustering Vs. K-means Spectral Clustering Result apo.enseeiht.fr/pub/Jdoc09/sandrine.pdf
Overview of clustering processfrom algorithm created by Ng, Jorda, and Weiss • 1. Given n data points, construct n-by-n distance matrix. • 2. Form similarity matrix (W) from the distance matrix. • 3. Form Laplacian matrix (Lsym) from the similarity matrix • 4. Compute the smallest k eigenvectors u1, u2,…,uk of Lsym. • 5. Form a n-by-k matrix U containing the vectors u1, u2,…,uk as columns. • 6. Form a matrix Y by normalizing each of the U’s rows • 7. Treat each row of Y as a point on the data set and cluster them in k clusters via k-means clustering method.
How Spectral Clustering works:Graph cut point of view • Graph: abstract representation of a set of objects where some pairs of the objects are connected by links. • In spectral clustering, we want to find a partition of the graph such that the edges between different groups have a very low weight and the edges within a group have high weight wikipedia.com
Objective of Spectral Clustering Algorithm • In order to get distinct clusters, we want to minimize :or, equivalently, fTLsymfwhere f is eigenvector. • So that we have very low weight for the edges between different groups and high weight for the edges within a group
Details of Clustering Algorithm • 1. Given n data points, construct n-by-n distance matrix • 2. Form similarity matrix W from the distance matrix by applying Gaussian similarity function element-wise: • 3. Form Laplacian matrix from the similarity matrix by calculating: Lsym = I – D-1/2WD-1/2, where D is a degree matrix which is a diagonal matrix with • 4. Compute the first (smallest) k eigenvectors u1, u2,…,uk of L.
Details of Clustering Algorithm • 5. Form a n-by-k matrix U containing the vectors u1, u2,…,uk as columns. • 6. Normalize the rows of U to norm 1, by setting for each element uij • 7. For i=1,2,…,n, let yi be the vector corresponding to the i-th row of U • 8. Using k-means, cluster the points yi for i=1,...,n into clusters * Note that the inputs of this algorithm are k and similarity matrix (and thus σ) * In addition, outputs are clusters C1,…,Ck and data points coupled with the index.
Role of parameter σ (sigma) σ = 1 σ = 5 σ = 10 • Parameter σ assigns high weight to data points located within a circle of radius σ centered at each data point. • The greater σ is, the more points are assigned high weight, resulting in much less distinct clusters.
Clustering results with different k Different color means different cluster *note that k = the number of clusters
Applications • Cluster analysis is being used in many fields: • In biology: sequence analysis, gene analysis • In medicine: PET scans • And in market research, social network analysis, image segmentation, data mining, crime analysis, and so on.
Acknowledgment • Mentor: • SankethV ShettyGraduate Research Assistant Computer Vision and Robotics Laboratory, Beckman Institute of Advanced Science and Technology
Questions? • Thank you!