1 / 32

Clustering Methods

Clustering Methods. Moses Charikar Computer Science. Clustering. Partition data items into groups (clusters) Similar data items in the same group Dissimilar items in different groups unsupervised learning: groups not known apriori . Issues . Data representation

domani
Télécharger la présentation

Clustering Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Methods Moses Charikar Computer Science

  2. Clustering • Partition data items into groups (clusters) • Similar data items in the same group • Dissimilar items in different groups • unsupervised learning: • groups not known apriori Clustering Methods, Moses Charikar

  3. Clustering Methods, Moses Charikar

  4. Issues • Data representation • Definition of similarity/distance measure • Clustering procedure • Data abstraction • Cluster validation Clustering Methods, Moses Charikar

  5. Data Representation • Quantitative features • Continuous values • Discrete values • Qualitative features • Nominal (unordered) • Ordinal Clustering Methods, Moses Charikar

  6. Data Representation • d-dimensional points Clustering Methods, Moses Charikar

  7. Distance Measures • Measure of dissimilarity • l1, l2, lp norms Clustering Methods, Moses Charikar

  8. An Impossibility Result • [Kleinberg] • There is no clustering function that satisfies • scale-invariancescaling distances does not change result • richnessall partitions possible • (refinement) consistencyshrinking distances inside cluster, expanding distances across clusters Clustering Methods, Moses Charikar

  9. Clustering techniques • Agglomerative vs. Divisive • Hard vs. Fuzzy • Incremental vs. Non-incremental Clustering Methods, Moses Charikar

  10. Clustering Methods, Moses Charikar

  11. Hierarchical Agglomerative Clustering • Initially, all points in distinct clusters • Maintain distance matrix on clusters • Merge most similar pair of clusters, update distance matrix • Repeat until all points in one cluster • Produces hierarchy of clusters (dendogram) Clustering Methods, Moses Charikar

  12. Classical hierarchical methods • Single linkage clustering • Distance between clusters is minimum distance between points in clusters • Complete Linkage clustering • Distance is maximum distance between points in clusters • Produces more compact clusters Clustering Methods, Moses Charikar

  13. Clustering Methods, Moses Charikar

  14. Clustering Methods, Moses Charikar

  15. Other classical variants • Group Average linkage • Median linkage • Centroid linkage Clustering Methods, Moses Charikar

  16. Modern variants • CURE, ROCK, Chameleon, BIRCH • CURE • find clusters of arbitrary shapes • maintain multiple cluster representatives • originally selected as scattered points • shrunk to cluster centroid by parameter (suppressed effect of outliers) Clustering Methods, Moses Charikar

  17. Divisive methods • Divide points into k clusters Clustering Methods, Moses Charikar

  18. Graph-theoretic • Build minimum spanning tree on points • Remove longest k-1 edges to produce k disjoint connected components • Clusters identical to those produced by single link clustering Clustering Methods, Moses Charikar

  19. Clustering Methods, Moses Charikar

  20. k-means • k: number of clusters • nj: number of points in jth cluster • xij: ith point in jth cluster • min • Note: If clusters are fixed, best choice of center is centroid of cluster Clustering Methods, Moses Charikar

  21. k-means • Pick k cluster centers at random • Assign each point to closest center • Recompute cluster centers • Repeat until convergence • Finds local minimum • Initial choice of cluster centers is important Clustering Methods, Moses Charikar

  22. k-means tutorial slides by Andrew Moore Clustering Methods, Moses Charikar

  23. Mixture models • Hypothesize that points are generated from mixture of k gaussians • Attempt to learn the best mixture of k gaussians that explains the data • Apply EM (Expectation Maximization) • Informally, similar to k-means with fuzzy assignment of points to clusters Clustering Methods, Moses Charikar

  24. Mixture models • Gaussian mixture models tutorial slides by Andrew Moore Clustering Methods, Moses Charikar

  25. Density Based Partitioning • DBSCAN • identify dense connected regions of the space • eps-neighborhood: points within distance eps • core object: point with number of points in neighborhood > threshold • y density reachable from x, if there exists path of core objects, each distance eps from previous Clustering Methods, Moses Charikar

  26. Optimization approaches • Formulate objective function for clustering • View clustering as optimization problem • Find clustering so as to minimize/maximize objective function • Finding optimum solutions is hard ! • Design algorithm with approximation guarantee • -approximation: solution returned is within factor  of optimum solution Clustering Methods, Moses Charikar

  27. Factors affecting complexity • number of clusters • distance function on points • euclidean distances (dimension matters) • arbitrary metric • no triangle inequality • Objective function Clustering Methods, Moses Charikar

  28. Clustering objective functions • k-center • max distance to cluster center • k-median • sum of distances to cluster centers • compare to k-means • minsum k-clustering • sum of distances within clusters Clustering Methods, Moses Charikar

  29. Graph based clustering • Given graph on items • weights on edges represent similarity (dissimilarity) • Graph partitioning • Divide graph into k pieces, minimize (maximize) weight of edges cut Clustering Methods, Moses Charikar

  30. Correlation clustering • Given judgements of similarity/dissimilarity between pairs of items, i.e. graph with edges labeled + and - • Find partitioning into clusters so that + edges inside clusters and - edges across clusters • If labeling is perfect, problem is easy • maximize agreements with labeling(suitable when optimal solution disagrees with large number of labels) • minimize disagreements with labeling(suitable with optimal solution agrees with almost all labels) Clustering Methods, Moses Charikar

  31. Other issues • Cluster abstraction • assigning meaning to clusters • Outliers • High dimensional data • Large data set size Clustering Methods, Moses Charikar

  32. Handling Large Data Sets • Random Sampling • Sample points and cluster sample • Streaming Algorithms • Cluster in one pass over data • Compact data summaries • maintain sketches of clusters • sketches of data points • Dimension reduction • singular value decomposition Clustering Methods, Moses Charikar

More Related