Clustering Applications

Clustering Applications Reminder Applications Spectral Clustring Assignment Clustering

clustering methods Clustering non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap. hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains.

Non-hierarchical Methods non-hierarchical methods partitioning methods - classes are mutually exclusive clumping method, - overlap is allowed.

hierarchical methods Hierarchical methods Agglomerative or - The hierarchy is build up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered dataset. Divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster.

Partitioning Methods • Partitioning methods are divided acording to the number of passes over the data. • Single pass • Basic partitioning methods • Multiple passes • K –means (Very widely used)

K-means: Sample Application • Gene clustering. • Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line. • Normalization allows comparisons across microarrays. • Produce clusters of genes which vary in similar ways over time. • Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway. Sample Array. Rows are genes and columns are time points. A cluster of co-regulated genes.

samples samples Expression profile of the gene. samples samples Clustering gene expression data Samples Genes

Samples samples samples Expression profile of the gene. Genes Cluster genes with similar expression profiles samples samples Clustering gene expression data

Sample 2 Gene g eg3 eg1 Sample 3 eg2 Sample 1 Sample 2 Sample 3 Sample 1 Clustering genes on expression profiles • The expression profile of each gene • is a point in ‘sample space’. • All genes together form • a scatter in this space

Representation of expression data T2 T3 T1 Gene 1 Time-point 1 Time-point 3 dij Gene N Time-point 2 . Normalized Expression Data from microarrays Gene 1 Gene 2

Identifying prevalent expression patterns 1.5 1 0.5 0 1 2 3 -0.5 -1 -1.5 1.5 1 1.2 0.5 0.7 0 0.2 1 2 3 -0.5 -0.3 1 2 3 -1 -0.8 -1.5 -2 -1.3 -1.8 Time-point 1 Normalized Expression Time-point 3 Time -point Time-point 2 Normalized Expression Normalized Expression Time -point Time -point

Evaluate Cluster contents Genes MIPS functional category Glycolysis Nuclear Organization Ribosome Translation Unknown

Hierarchical Agglomerative methods • The hierarchical agglomerative clustering methods are most commonly used. The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm. • Find the 2 closest objects and merge them into a cluster • Find and merge the next two closest points, where a point is either an individual object or a cluster of objects. • If more than one cluster remains , return to step 2

Sample 2 Sample 3 Sample 1 Clustering genes on expression profiles • Define a distance/similarity measure between points. Euclidean: Manhattan: • Define a distance between clusters of points. • 1) Distance between closest pair between two clusters. (single-linkage) • 2) Distance between the furthest pair of points (total linkage). • 3) Average distance between points from both clusters. • 4) Distance between the clusters’ centroids.

Sample 2 Sample 3 Sample 1 Clustering genes on expression profiles • Hierarchical clustering: • Start with each point its own cluster. • At each iteration, merge the two clusters • with the smallest distance.

Sample 2 Sample 3 Sample 1 Clustering genes on expression profiles • Hierarchical clustering: • Start with each point its own cluster. • At each iteration, merge the two clusters • with the smallest distance. Eventually all points will be linked into a single cluster.

Sample 2 Sample 3 Sample 1 Clustering genes on expression profiles The sequence of mergers can be represented in a hierarchical tree. g f b a c e d a b c d e f g

Clusteringgenes on expression profiles Eisen et al. PNAS 1998. Green = Expression level low with respect to reference sample. Red = Expression level high with respect to reference sample. Black = Expression level comparable to reference sample. The columns are ordered such that similar expression profiles neighbor each other.

Samples samples samples Expression profile of the sample. Genes samples Instead of genes one may cluster samples with similar expression profiles. samples Clustering gene expression data

Clusteringsamples on expression profiles Alizadeh et al. Nature 2003 Identifying different tumor types through sample clustering.

Alizadeh et al., Nature 403:503-11, 2000

Samples Samples Cluster genes with similar sample expression-profile. Cluster samples with similar gene expression-profile. Genes Genes Samples Combination model Each color corresponds to some “cause”. The cause affects a subset of genes in a subset of the samples. Genes e.g. Ihmels et al. Nature genetics 2002 Combinations of samples/genes

Combinations of samples/genes Ihmels et al. Nature genetics 2002

Clustering genes: Clusters of homologous genes • A set of protein or DNA sequences. • Use alignment algorithm (e.g. BLAST) to score the similarity of each pair of sequences. Graph of similarities of proteins in Methanococcus Jannaschii. The length of the links reflects similarity (short link = high similarity). Enright and Ouzounis Bioinformatics 2001 • Task: Detect clusters in the graph.

Clustering genes: Clusters of homologous genes Example solution: Put ‘random walkers’ on graph and let them follow links at random. Look at the density of walkers and strengthen ‘high flow’ links, and weaken ‘low flow’ links. Stijn van Dongen, Graph Clustering by Flow simulation (PhD. Thesis, University of Amsterdam).

Clustering DNA sequences:Transcription factor binding sites • Transcription factors recognize ‘fuzzy motifs’. • Alignment of known fruR binding sites: • AAGCTGAATCGATTTTATGATTTGGT • AGGCTGAATCGTTTCAATTCAGCAAG • CTGCTGAATTGATTCAGGTCAGGCCA • GTGCTGAAACCATTCAAGAGTCAATT • GTGGTGAATCGATACTTTACCGGTTG • CGACTGAAACGCTTCAGCTAGGATAA • TGACTGAAACGTTTTTGCCCTATGAG • TTCTTGAAACGTTTCAGCGCGATCTT • ACGGTGAATCGTTCAAGCAAATATAT • GCACTGAATCGGTTAACTGTCCAGTC • ATCGTTAAGCGATTCAGCACCTTACC • **gcTGAAtCG*TTcAg**c****** Task: thousands of such binding sites for hundreds of different TFs. Infer which binding sites bind the same TF.

Clustering DNA sequences:Transcription factor binding sites AAGCACTATATTGGTGCAACATTCACATCGTG GTGATGAACTGTTTTTTTATCCAGTATAATTT ACTCATCTGGTACGACCAGATCACCTTGCGGA AAGCACCATGTTGGTGCAATGACCTTTGGATA AAGCTGAATCGATTTTATGATTTGGTTCAATT AGGCTGAATCGTTTCAATTCAGCAAGAGAGGA CATTAACTCATCGGATCAGTTCAGTAACTATT CCTCTTTACTGTATATAAAACCAGTTTATACT TCCGAACTGATCGGACTTGTTCAGCGTACACG ACTCACAACTGTATATAAATACAGTTACAGAT GTGCTGAAACCATTCAAGAGTCAATTGGCGCG ATCAAGCTGGTATGATGAGTTAATATTATGTT TTCCAATACTGTATATTCATTCAGGTCAATTT GTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGA GCCTTTTGCTGTATATACTCACAGCATAACTG CAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC ACGGTGAATCGTTCAAGCAAATATATTTTTTT AGTAATGACTGTATAAAACCACAGCCAATCAA ATCGTTAAGCGATTCAGCACCTTACCTCAGGC TGGATGTACTGTACATCCATACAGTAACTCAC ATGCACTAAAATGGTGCAACCTGTTCAGGAGA TATTTTACCTGTATAAATAACCAGTATATTCA CAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCAT ACAGACTACTGTATATAAAAACAGTATAACTT TCGCCACTGGTCTGATTTCTAAGATGTACCTC AGTTTATACTGTACACAATAACAGTAATGGTT CTGCTGAATTGATTCAGGTCAGGCCAAATGGC ACTTGATACTGTATGAGCATACAGTATAATTG TTCCAGCTGGTCCGACCTATACTCTCGCCACT TCGTTTTCCTGTATGAAAAACCATTACTGTTA TTACACTCCTGTTAATCCATACAGCAACAGTA CGACTGAAACGCTTCAGCTAGGATAAGCGAAA TGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAG TTCTTGAAACGTTTCAGCGCGATCTTGTCTTT CTGTTACACTGGATAGATAACCAGCATTCGGA ATCCTTCGCTGGATATCTATCCAGCATTTTTT GCACTGAATCGGTTAACTGTCCAGTCGACGGC CCACAATATTGGCTGTTTATACAGTATTTCAG Each line contains a binding site for a transcription factor.

Clustering DNA sequences:Transcription factor binding sites AAGCACTATATTGGTGCAACATTCACATCGTG GTGATGAACTGTTTTTTTATCCAGTATAATTT ACTCATCTGGTACGACCAGATCACCTTGCGGA AAGCACCATGTTGGTGCAATGACCTTTGGATA AAGCTGAATCGATTTTATGATTTGGTTCAATT AGGCTGAATCGTTTCAATTCAGCAAGAGAGGA CATTAACTCATCGGATCAGTTCAGTAACTATT CCTCTTTACTGTATATAAAACCAGTTTATACT TCCGAACTGATCGGACTTGTTCAGCGTACACG ACTCACAACTGTATATAAATACAGTTACAGAT GTGCTGAAACCATTCAAGAGTCAATTGGCGCG ATCAAGCTGGTATGATGAGTTAATATTATGTT TTCCAATACTGTATATTCATTCAGGTCAATTT GTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGA GCCTTTTGCTGTATATACTCACAGCATAACTG CAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC ACGGTGAATCGTTCAAGCAAATATATTTTTTT AGTAATGACTGTATAAAACCACAGCCAATCAA ATCGTTAAGCGATTCAGCACCTTACCTCAGGC TGGATGTACTGTACATCCATACAGTAACTCAC ATGCACTAAAATGGTGCAACCTGTTCAGGAGA TATTTTACCTGTATAAATAACCAGTATATTCA CAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCAT ACAGACTACTGTATATAAAAACAGTATAACTT TCGCCACTGGTCTGATTTCTAAGATGTACCTC AGTTTATACTGTACACAATAACAGTAATGGTT CTGCTGAATTGATTCAGGTCAGGCCAAATGGC ACTTGATACTGTATGAGCATACAGTATAATTG TTCCAGCTGGTCCGACCTATACTCTCGCCACT TCGTTTTCCTGTATGAAAAACCATTACTGTTA TTACACTCCTGTTAATCCATACAGCAACAGTA CGACTGAAACGCTTCAGCTAGGATAAGCGAAA TGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAG TTCTTGAAACGTTTCAGCGCGATCTTGTCTTT CTGTTACACTGGATAGATAACCAGCATTCGGA ATCCTTCGCTGGATATCTATCCAGCATTTTTT GCACTGAATCGGTTAACTGTCCAGTCGACGGC CCACAATATTGGCTGTTTATACAGTATTTCAG van Nimwegen et al. PNAS 2002

Clustering DNA sequences:Transcription factor binding sites AAGCACTATATTGGTGCAACATTCACATCGTG AAGCACCATGTTGGTGCAATGACCTTTGGATA ATGCACTAAAATGGTGCAACCTGTTCAGGAGA GCGCACCAGATTGGTGCCCCAGAATGGTGCAT a*GCAC*A*atTGGTGCaac****t***g** ACTCATCTGGTACGACCAGATCACCTTGCGGA CATTAACTCATCGGATCAGTTCAGTAACTATT TCCGAACTGATCGGACTTGTTCAGCGTACACG ATCAAGCTGGTATGATGAGTTAATATTATGTT TTCCAGCTGGTCCGACCTATACTCTCGCCACT TCGCCACTGGTCTGATTTCTAAGATGTACCTC CAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC ****a*CTGgTc*Gat**GT******t***** AAGCTGAATCGATTTTATGATTTGGTTCAATT AGGCTGAATCGTTTCAATTCAGCAAGAGAGGA CTGCTGAATTGATTCAGGTCAGGCCAAATGGC GTGCTGAAACCATTCAAGAGTCAATTGGCGCG GTGGTGAATCGATACTTTACCGGTTGAATTTG CGACTGAAACGCTTCAGCTAGGATAAGCGAAA TGACTGAAACGTTTTTGCCCTATGAGCTCCGG TTCTTGAAACGTTTCAGCGCGATCTTGTCTTT ACGGTGAATCGTTCAAGCAAATATATTTTTTT GCACTGAATCGGTTAACTGTCCAGTCGACGGC ATCGTTAAGCGATTCAGCACCTTACCTCAGGC **gcTGAAtCG*TTcAg**c************ GTGATGAACTGTTTTTTTATCCAGTATAATTT TGGATGTACTGTACATCCATACAGTAACTCAC ACAGACTACTGTATATAAAAACAGTATAACTT CCTCTTTACTGTATATAAAACCAGTTTATACT ACTCACAACTGTATATAAATACAGTTACAGAT AGTTTATACTGTACACAATAACAGTAATGGTT ACTTGATACTGTATGAGCATACAGTATAATTG TTCCAATACTGTATATTCATTCAGGTCAATTT CAGCATAACTGTATATACACCCAGGGGGCGGA GCCTTTTGCTGTATATACTCACAGCATAACTG TATTTTACCTGTATAAATAACCAGTATATTCA CAGCAAATCTGTATATATACCCAGCTTTTTGG TCGTTTTCCTGTATGAAAAACCATTACTGTTA TTACACTCCTGTTAATCCATACAGCAACAGTA CATATTTACTGATGATATATACAGGTATTTAG CTGTTACACTGGATAGATAACCAGCATTCGGA ATCCTTCGCTGGATATCTATCCAGCATTTTTT CCACAATATTGGCTGTTTATACAGTATTTCAG AGTAATGACTGTATAAAACCACAGCCAATCAA ****t*tACTGTATATa*A*ACAG********

Array batch 1 Array batch 2 Similarity/distance matrices Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor.

Clustering DNA sequences:Transcription factor binding sites • Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG CTGCTGAATTGATTCAGGTCAGGCCA GTGCTGAAACCATTCAAGAGTCAATT GTGGTGAATCGATACTTTACCGGTTG CGACTGAAACGCTTCAGCTAGGATAA TGACTGAAACGTTTTTGCCCTATGAG TTCTTGAAACGTTTCAGCGCGATCTT ACGGTGAATCGTTCAAGCAAATATAT GCACTGAATCGGTTAACTGTCCAGTC ATCGTTAAGCGATTCAGCACCTTACC **gcTGAAtCG*TTcAg**c******

Probability Evaluation Probability that a sequence s is a binding site for the factor represented by w:

Kohonen Self-organizing maps • K = r*s clusters are arranged as nodes of a two dimensional grid. Nodes represent cluster centers/prototype vectors. • This allows to represent similarity between clusters. • Algorithm: Initialize nodes at random positions. Iterate: - Randomly pick one data point (gene) x. - Move nodes towards x, the closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations. from Tamayo et al. 1999

Self-organizing maps from Tamayo et al. 1999 (yeast cell cycle data)

d1,1 d1,2 …. d1,k-1 d1,k d2,1 d2,2 …. d2,k-1 d2,k - -- - - - - - - - - - - - - - - - - - - dk,1 dk,2 …. dk,k-1 dk,k MST-method : Graph Representation of data • Representation of a set of n-dimensional “k” points as a graph • each data point is represented as a node V (a vertex) • Edge between i-th and j-th points is a connection evaluated by the “distance” between the two points V(i) and V(j) • d i,j -matrix of distances

V(5) V(4) V(6) V(7) V(3) Edges Vertices V(8) V(2) V(1) d1,1 d1,2 …. d1,7 d1,8 d2,1 d2,2 …. d2,7 d2,8 - -- - - - - - - - - - - - - - - - - - - d8,1 d8,2 …. d8,7 d8,8 Graph Representation di,j –distance between V(i) and V(j)

Intuitive Requirement for a Cluster

For any partition : Closest point for among all non is Closest point for among all non is Intuitive requirement for a cluster (IR)

If subset C has IR points of C form subtree of MST In other words deleting a few edges one will get a tree consisting only of points of C Set of don’t have IR ! Cluster Versus MST

9 8 10 3 2 Root 0 7 4 5 1 6 Sequential Presentation Step index PRIM Algorithm for Cluster Identification 5 2 10 9 0 6 3 1 8 4 7 Data points with indices

Intuitive Requirement for a Cluster Sequential Representation Valley

Cluster analysis & graph theory • Graph Formulation • View data set as a set of vertices V={1,2,…,n} • The similarity between objects i and j is viewed as the weight of the edge connecting these vertices Aij. A is called the affinity matrix • We get a weighted undirected graph G=(V,A). • Clustering (Segmentation)is equivalent topartition of G into disjoint subsets. The latter could be achieved by simply removing connecting edges.

Nature of the Affinity Matrix “closer” vertices will get larger weight Weight as a function ofs

Spectral Clustering • Algorithms that cluster points using eigenvectors of matrices derived from the data • Obtain data representation in the low-dimensional space that can be easily clustered • Variety of methods that use the eigenvectors differently

Spectral Clustering Algorithm Ng, Jordan, and Weiss • Given a set of points S={s1,…sn} • Form the affinity matrix • Define diagonal matrix Dii=Skaik • Form the matrix • Stack the k largest eigenvectors of L to form the columns of the new matrix X: • Renormalize each of X’s rows to have unit length. Cluster rows of Y as points in R k

Spectral Clustering Algorithm Ng, Jordan, and Weiss • Motivation • Given a set of points: S={s1,s2,..sn}Rl • We would like to cluster them into k subsets • Form the affinity matrix • Define A Rn*n • Scaling parameter chosen by user • Define D a diagonal matrix whose (i,i) element is the sum of A’s row i

Algorithm • Form the matrix • L=D-1/2AD-1/2 • Find x1,x2…xk , the k largest eigenvectors of L • These form the the columns of the new matrix X • We have reduced dimension from nxn to nxk

Algorithm • Form the matrix Y • Renormalize each of X’s rows to have unit length • Y Rn*k • Treat each row of Y as a point in Rk • Cluster into k clusters via K-means • Final Cluster Assignment • Assign point Si to cluster j if row i of Y was assigned to cluster j

Clustering Applications

Clustering Applications

Presentation Transcript

Clustering

Clustering Applications at Yahoo!

Clustering

Supervised Clustering --- Algorithms and Applications

Clustering in Ratemaking: Applications in Territories Clustering

Clustering

Clustering

Clustering Techniques and their applications at ECMWF

Clustering

Clustering: Partition Clustering

Clustering and Applications to Biodiversity

Clustering Techniques and their applications at ECMWF

Scaling Many-core Applications with COTS Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering Applications in Web Mining and Web Personalization

Clustering