1 / 112

Flat Clustering

Flat Clustering. Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti. Today ’ s Topic: Clustering. Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional (Flat) Hierarchical (Tree).

darva
Télécharger la présentation

Flat Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flat Clustering Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti L16FlatCluster

  2. Today’s Topic: Clustering • Document clustering • Motivations • Document representations • Success criteria • Clustering algorithms • Partitional (Flat) • Hierarchical (Tree) L16FlatCluster

  3. What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects • Documents within a cluster should be similar. • Documents from different clusters should be dissimilar. • The commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given • A common and important task that finds many applications in IR and other places L16FlatCluster

  4. A data set with clear cluster structure • How would you design an algorithm for finding the three clusters in this case?

  5. Classification vs. Clustering • Classification: supervisedlearning • Clustering: unsupervisedlearning • Classification: Classes are human-defined and part of the input to the learning algorithm. • Clustering: Clusters are inferred from the data without human input. • However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representationofdocuments, . . . 5

  6. Applicationsofclustering in IR 7

  7. Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity L16FlatCluster

  8. Global navigation: Yahoo 9

  9. Global navigation: MESH (upperlevel) 10

  10. Global navigation: MESH (lowerlevel) 11

  11. Navigational hierarchies: Manual vs. automatic creation • Note: Yahoo/MESH are not examples of clustering. • But they are well known examples for using a global hierarchy fornavigation. • Some examples for global navigation/exploration based on clustering: • Cartia • Themescapes • Google News 12

  12. Search result clustering for better navigation 13

  13. L16FlatCluster

  14. L16FlatCluster

  15. Google News: automatic clustering gives an effective news presentation metaphor

  16. Selection Metrics • Google News taps into its own unique ranking signals, which include • user clicks, • the estimated authority of a publication in a particular topic (possibly taking location into account), • Freshness/recency, • geography and more. L16FlatCluster

  17. Scatter/Gather: Cutting, Karger, and Pedersen

  18. Scatter/Gather (“Star”)

  19. S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

  20. Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 • How it works • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topicaltermsandtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”

  21. For visualizing a document collection and its themes • Wise et al, “Visualizing the non-visual” PNNL • ThemeScapes, Cartia • [Mountain height = cluster size]

  22. For improving search recall • Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs • Therefore, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Example: The query “car” will also return docs containing automobile • Because clustering grouped together docs containing car with those containing automobile. Why might this happen?

  23. Issues for clustering • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small L16FlatCluster

  24. What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We can use cosine similarity (alternatively, Euclidean Distance). L16FlatCluster

  25. More Applications of clustering … • Image Processing • Cluster images based on their visual content • Web • Cluster groups of users based on their access patterns on webpages • Cluster webpages based on their content • Bioinformatics • Cluster similar proteins together (similarity w.r.t. chemical structure and/or functionality etc.) • …

  26. Outliers • Outliers are objects that do not belong to any cluster or form clusters of very small cardinality • In some applications we are interested in discovering outliers, not clusters (outlier analysis) cluster outliers

  27. Clustering Algorithms • Partitional (Flat) algorithms • Usually start with a random (partial) partition • Refine it iteratively • K means clustering • (Model based clustering) • Hierarchical (Tree) algorithms • Bottom-up, agglomerative • (Top-down, divisive) L16FlatCluster

  28. Hard vs. soft clustering • Hard clustering: Each document belongs to exactly one cluster • More common and easier to do • Soft clustering: A document can belong to more than one cluster. • Makes more sense for applications like creating browsable hierarchies • You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes

  29. Partitioning Algorithms • Partitioning method: Construct a partition of n documents into a set of K clusters • Given: a set of documents and the number K • Find: a partition of K clusters that optimizes the chosen partitioning criterion • Globally optimal: exhaustively enumerate all partitions • Effective heuristic methods: K-means and K-medoids algorithms L16FlatCluster

  30. K-Means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c. • Reassignment of instances to clusters is based on distance to the current cluster centroids. • (Or one can equivalently phrase it in terms of similarities) L16FlatCluster

  31. K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) L16FlatCluster

  32. K-means algorithm

  33. Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x K Means Example(K=2) Reassign clusters Converged! L16FlatCluster

  34. Worked Example: Set to be clustered

  35. Worked Example: Random selection of initial centroids • Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters 41

  36. Worked Example: Assign points to closest center

  37. Worked Example: Assignment

  38. Worked Example: Recompute cluster centroids

  39. Worked Example: Assign points to closest centroid

  40. Worked Example: Assignment

  41. Worked Example: Recompute cluster centroids

  42. Worked Example: Assign points to closest centroid

  43. Worked Example: Assignment

  44. Worked Example: Recompute cluster centroids

  45. Worked Example: Assign points to closest centroid

  46. Worked Example: Assignment

  47. Worked Example: Recompute cluster centroids

  48. Worked Example: Assign points to closest centroid

  49. Worked Example: Assignment

  50. Worked Example: Recompute cluster centroids

More Related