Download Presentation
## Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Inter-cluster distances are maximized**Intra-cluster distances are minimized The Problem of Clustering • Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as close to each other as possible.**Clustering precipitation in Australia**Applications of Cluster Analysis • Clustering for Understanding • Group related documents for browsing • Group genes and proteins that have similar functionality • Group stocks with similar price fluctuations • Segment customers into a small number of groups for additional analysis and marketing activities. • Clustering for Summarization • Reduce the size of large data sets**Examples**Documents • Represent a document by a vector (x1, x2,…, xk), where xi = 1 iff the i th word of a vocabulary appears in the document. • Documents with similar sets of words may be about the same topic. DNAs • Objects are sequences of {C,A,T,G}. • Distance between sequences is edit distance, the minimum number of inserts and deletes needed to turn one into the other. • Note there is a “distance,” but no convenient space in which points “live.”**Distance Measures**• Each clustering problem is based on some kind of “distance” between points. • Two major classes of distance measure: • Euclidean • Non-Euclidean • A Euclidean space has some number of real-valued dimensions. • There is a notion of “average” of two points. • A Euclidean distance is based on the locations of points in such a space. • A Non-Euclidean distance is based on properties of points, but not their “location” in a space.**Axioms of a Distance Measure**• d is a distance measure if it is a function from pairs of points to real numbers such that: • d(x,y) > 0. • d(x,y) = 0 iff x = y. • d(x,y) = d(y,x). • d(x,y) < d(x,z) + d(z,y) (triangle inequality).**y = (9,8)**L2-norm: dist(x,y) = (42+32) = 5 5 3 L1-norm: dist(x,y) = 4+3 = 7 4 x = (5,5) Some Euclidean Distances • L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. • The most common notion of “distance.” • L1 norm : sum of the differences in each dimension. • Manhattan distance = distance if you had to travel along coordinates only.**Non-Euclidean Distances**• Jaccard distance for sets = 1 minus ratio of sizes of intersection and union. • Cosine distance = angle between vectors from the origin to the points in question. • Edit distance = number of inserts and deletes to change one string into another.**Jaccard Distance for Sets (Bit-Vectors)**• Example: p1 = 10111; p2 = 10011. • Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4. • d(x,y) = 1 – (Jaccard similarity) = 1/4.**Why J.D. Is a Distance Measure**• d(x,x) = 0 because xx = xx. • d(x,y) = d(y,x) because union and intersection are symmetric. • d(x,y) > 0 because |xy| < |xy|. • d(x,y) < d(x,z) + d(z,y) trickier... (1 - |x z|/|x z|) + (1 - |y z|/|y z|) 1 - |x y|/|x y| • Remember: • |a b|/|a b| = probability that minhash(a) = minhash(b) • Thus, 1 - |a b|/|a b| = probability that minhash(a) minhash(b). • Claim: prob[mh(x)mh(y)] prob[mh(x)mh(z)] + prob[mh(z)mh(y)] • Proof: whenever mh(x)mh(y), at least one of mh(x)mh(z) and mh(z)mh(y) must be true.**{a,b,d,e}**{d,e,f} {a,b,c} {b,c,e,f} Similarity threshold = 1/3; distance < 2/3 Clustering with JD**p1** p2 p1.p2 |p2| d (p1, p2) = = arccos(p1.p2/|p2||p1|) Cosine Distance • Think of a point as a vector from the origin (0,0,…,0) to its location. • Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors: p1.p2/|p2||p1|. • Example: p1 = 00111; p2 = 10011. • p1.p2 = 2; |p1| = |p2| = 3. • cos() = 2/3; is about 48 degrees.**Why C.D. Is a Distance Measure**• d(x,x) = 0 because arccos(1) = 0. • d(x,y) = d(y,x) by symmetry. • d(x,y) > 0 because angles are chosen to be in the range 0 to 180 degrees. • Triangle inequality: physical reasoning. If I rotate an angle from x to z and then from z to y, I can’t rotate less than from x to y.**Edit Distance**• The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. Equivalently: • d(x,y) = |x| + |y| - 2|LCS(x,y)| • LCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y. Example • x = abcde; y = bcduve. • Turn x into y by deleting a, then inserting u and v after d. • Edit distance = 3. • Or, LCS(x,y) = bcde. • Note: |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3 = edit dist.**Why Edit Distance Is a Distance Measure**• d(x,x) = 0 because 0 edits suffice. • d(x,y) = d(y,x) because insert/delete are inverses of each other. • d(x,y) > 0: no notion of negative edits. • Triangle inequality: changing x to z and then to y is one way to change x to y.**Hierarchical Clustering**• Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits**Agglomerative Clustering Algorithm**Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms**p1**p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation • Start with clusters of individual points and a proximity matrix Proximity Matrix**C1**C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • After some merging steps, we have some clusters C3 C4 Proximity Matrix C1 C5 C2**C1**C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 Proximity Matrix C1 C5 C2**After Merging**• The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5**p1**p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average Proximity Matrix**p1**p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average Proximity Matrix**p1**p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average Proximity Matrix**p1**p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Centroid distance Proximity Matrix**Cluster Similarity: MIN**• Similarity of two clusters is based on the two most similar (closest) points in the different clusters**5**1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram**Two Clusters**Strength of MIN Original Points Can handle non-globular shapes**Original Points**Four clusters Three clusters: The yellow points got wrongly merged with the red ones, as opposed to the green one. Limitations of MIN Sensitive to noise and outliers**Cluster Similarity: MAX**• Similarity of two clusters is based on the two least similar (most distant) points in the different clusters**4**1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram**Original Points**Four clusters Three clusters: The yellow points get now merged with the green one. Strengths of MAX Less susceptible respect to noise and outliers**Two Clusters**Limitations of MAX Original Points Tends to break large clusters**Cluster Similarity: Centroid distance**• Consider the distance between cluster centroids.**And in the Non-Euclidean Case?**• The only “locations” we can talk about are the points themselves. • i.e., there is no “average” of two points. • Approach 1: clustroid = point “closest” to other points. • Treat clustroid as if it were centroid, when computing intercluster distances. • “Closest” Point? • Possible meanings: • Smallest maximum distance to the other points. • Smallest average distance to other points.**Example**clustroid 1 2 6 4 3 clustroid 5 intercluster distance**k – Means Algorithm(s)**• Start by picking k, the number of clusters. • Initialize clusters by picking one point per cluster. • Pick one point at random, • Then k -1 other points, each as far away as possible from the previous points • each with the maximum possible minimum distance to the previously selected points. • Example • Suppose we pick A first. • E is the furthest from A, so we pick it next. • For the third point, the minimum distances to A or E are as follows B: 3.00, C: 2.83, D: 3.16, F: 2.00 The winner is D. Thus, D becomes the third seed.**Populating Clusters**• For each point, place it in the cluster whose current centroid it is nearest. • After all points are assigned, fix the centroids of the k clusters. • Optional: reassign all points to their closest centroid. • Sometimes moves points between clusters. Repeat for a couple of rounds.**Reassigned**points Clusters after first round Example: Assigning Clusters 2 4 x 6 3 1 8 7 5 x**Best value**of k Average distance to centroid k Getting k Right • Try different k, looking at the change in the average distance to centroid, as k increases. • Average falls rapidly until right k, then changes little.**Too few;**many long distances to centroid. Example: Picking k x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x**Just right;**distances rather short. Example: Picking k x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x**Too many;**little improvement in average distance. Example: Picking k x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x**BFR Algorithm**• BFR (Bradley-Fayyad-Reina) is a variant of k -means designed to handle very large (disk-resident) data sets. • The goal is not to assign every point to a cluster, but to determine where the centroids of the clusters are. • We can assign the points to clusters by another pass through the data, assigning each point to its nearest centroid and writing out the cluster number with the point. • Points are read one main-memory-full at a time. • Most points from previous memory loads are summarized by simple statistics. • To begin, from the initial load we select the initial k centroids by some sensible approach.**Initialization: k -Means**Possibilities include: • Take a small random sample and cluster optimally. • Take a sample; pick a random point, and then k – 1 more points, each as far from the previously selected points as possible.**Three Classes of Points**• The discard set : points close enough to a centroid to be summarized. • The compression set : groups of points that are close together but not close to any centroid. They are summarized, but not assigned to a cluster. • The retained set : isolated points.**Summarizing Sets of Points**For each cluster, the discard set is summarized by: • The number of points, N. • The vector SUM, whose ith component is the sum of the coordinates of the points in the ith dimension. • The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension. • 2d + 1 values represent any number of points. (d = number of dimensions). • Averages in each dimension (centroid coordinates) can be calculated easily as SUMi/N. • SUMi = ith component of SUM. • Variance of a cluster’s discard set in dimension i can be computed by: • (SUMSQi /N ) – (SUMi /N )2 • And the standard deviation is the square root of that. • The same statistics can represent any compression set.**Points in**the RS Compressed sets. Their points are in the CS. A cluster. Its points are in the DS. The centroid “Galaxies” Picture**Processing a “Memory-Load” of Points**• Find those points that are “sufficiently close” to a cluster centroid; add those points to that cluster and the DS. • Adjust statistics of the clusters to account for the new points. • Use any main-memory clustering algorithm to cluster the remaining points and the old RS. • Clusters go to the CS; outlying points to the RS. • Consider merging compressed sets in the CS. • If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster.