Clustering

Clustering

Clustering Techniques • Partitioning methods • Hierarchical methods • Density-based • Grid-based • Model-based

Types of Data Data Matrix x11 … x1f … x1p . . . xi1 … xjf … xip . . . xj1 … xjf … xjp Dissimilarity Matrix 0 d(2,1) 0 d(3,1) d(3,2) 0 . . . d(n,1) d(n,2) … 0 d(i, j) – difference or dissimilarity between objects

Binary Variables There are only two states: 0 (absent) or 1 (present). Ex. smoker: yes or no. Computing dissimilarity between binary variables: Dissimilarity matrix (contingency table) if all attributes have the same weight d(i, j) = r+s / q+r+s asymmetric attributes d(i, j) = r+s / q+r+s+t symmetric attributes

Nominal Variables Generalization of binary variable where it can take more than two states. Ex. color: red, green, blue. d(i, j) = p - m / p m – number of matches p – total number of attributes Weights can be used: assign greater weight to the matches in variables having a larger number of states. Ordinal Variables Resemble nominal variables except the states are ordered in meaningful sequence. Ex. medal: gold, silver, bronze. Replace xifby rif {1, …, Mf} The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking 1, …, Mf. Replace each xifby its corresponding rank.

Variables of Mixed Types p (f) (f) ijdij f=1 d(i, j) = p (f) ij f=1 • (f) • where the indicator ij = 0 if either xif or xjf is missing, • or xif =xjf = 0 • (f) • and variable f is asymmetric binary; otherwise ij = 1. The contribution of variable f to the dissimilarity is dependent on its type: (f)(f) • If f is binary or nominal: dij = 0 if xif = xif; otherwise dij = 1. • (f) |xif- xjf| • If f is interval-based: dij = , where h runs • maxhxhf – mixhxhf • over all non missing objects for variable f. • If f is ordinal or ratio-scaled: compute the ranks rif and • rif-1 • zif = • Mf - 1 • and treat zif as interval-scaled.

Clustering Methods • 1. Partitioning (k-number of clusters) • 2. Hierarchical (hierarchical decomposition of objects) • TV – trees of order k • Given: set of N - vectors • Goal: divide these points into maximum I disjoint clusters so that points in each cluster are similar with respect to maximal number of coordinates (called active dimensions). • TV-tree of order 2: (two clusters per node) • Procedure: • Divide set of N points into Z clusters maximizing the total number of active dimensions. • For each cluster repeat the same procedure. • Density-based methods • Can find clusters of arbitrary shape. Can grow (given cluster) as long as density in the neighborhood exceeds some threshold (for each point, neighborhood of given radius contains minimum some number of points).

Partitioning methods • 1. K-means method (n objects to k clusters) • Cluster similarity measured in regard to mean value of objects ina cluster (cluster’s center of gravity) • Select randomly k-points (call them means) • Assign each object to nearest mean • Compute new mean for each cluster • Repeat until criterion function converges • K • E =  | p - mi |2 • i=1 pCi • This method is sensitive to outliers. • 2. K-medoids method • Instead of mean, take a medoid (most centrally located object in a cluster) Squared error criterion We try to minimize

Hierarchical Methods • Agglomerative hierarchical clustering (bottom-up strategy) • Each object placed in a separate cluster, and then we merge these clusters until certain termination conditions are satisfied. • Divisive hierarchical clustering (top-down strategy) • Distance between clusters: • Minimum distance: dmin(Ci, Cj) = minpCi , p’Cj | p – p’ | • Maximum distance: dmax(Ci, Cj) = maxpCi , p’Cj | p – p’ | • Mean distance: dmean(Ci, Cj) = | mi – mj| • Average distance: davg(Ci, Cj) = 1/ninjpCip’Cj | p – p’ |

Cluster: Km = {tm1, … , tmn} • N • Centroid: Cm = tmi / N • i=1 • N • Radius: Rm =  (tmi - Cm)2 / N • i=1 • N N • Diameter: Dm =  (tmi - tmj)2 / N (N-1) • i=1 j=1 • Distance Between Clusters • Single Link (smallest distance) • Dis(Ki , Kj) = min{Dis(ti , tj) : tiKi , tjKj } • Complete Link (largest distance) • Dis(Ki , Kj) = max{Dis(ti , tj) : tiKi , tjKj } • Average • Dis(Ki , Kj) = mean{Dis(ti , tj) : tiKi , tjKj } • Centroid Distance • Dis(Ki , Kj) = Dis(Ci , Cj) Cluster 1 Cluster 2

A 1 B 3 3 2 4 E 2 2 A 1 B A 1 B 3 5 1 D C 2 2 2 1 1 D C D C 3 2 1 A B C D E • Hierarchical Algorithms • Single Link Technique • (find maximal connected components in a graph) • Distances (threshold) Dendogram Threshold level

A 1 B 3 E 1 D C 5 3 1 E A B D C A 1 B E 1 D C • Complete Link Technique • (looks for cliques – maximal graphs in which there is an edge between any two vertices) • Distances (threshold) • 1 2 3 4 …. Dendogram (5, {EABCD}) (3, {EAB}, {DC}) (1, {AB}, {DC}, {E}) (0, {E}, {A}, {B}, {C}, {D})

5 6 10 12 80 • Partitioning Algorithms • Minimum Spanning Tree (MST) • Given: n – points • k – clusters • Algorithm: • Start with complete graph • Remove largest inconsistent edge (its weight is much larger than average weight of all adjacent edges) • Repeat

Squared Error • Cluster: Ki = {ti1, … , tin} • Center of cluster: Ci • N • Squared Error: SEKi =  ||tij – Ci||2 • j=1 • Collection of clusters: K = {K1, … , Kk} • k • Squared Error for K: SEk = SEKi • i=1Given: k – number of clusters • threshold • Algorithm: • Repeat • Choose k points randomly (called centers) • Assign each item to the cluster which has the closest center • Calculate new center for each cluster • Calculate squared error • Until • Difference between old error and new one is below specified threshold

Center • CURE (Clustering Using Representatives) • Idea: handling clusters of different shapes • Algorithm: • Constant number of points are chosen from each cluster • These points are shrunk toward the cluster’s centroid • Clusters with closest pair of representative points are merged

Examples related to clustering

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering