Download
clustering n.
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering PowerPoint Presentation

Clustering

243 Views Download Presentation
Download Presentation

Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

  2. Roadmap • Clustering • Motivation & Applications • Clustering Approaches • Evaluation

  3. Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications:

  4. Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications: • Exploratory data analysis • Document clustering • Language modeling • Generalization for class-based LMs • Unsupervised Word Sense Disambiguation • Automatic thesaurus creations • Unsupervised Part-of-Speech Tagging • Speaker clustering,….

  5. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering:

  6. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment

  7. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire

  8. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering

  9. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters

  10. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters • Topic clustering: documents on the same topic • OWS, debt supercommittee, Seattle Marathon, Black Friday..

  11. Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters

  12. Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters:

  13. Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters: (from NYT) • ballot, polls, Gov, seats • profit, finance, payments • NFL, Reds, Sox, inning, quarterback, scored, score • researchers, science • Scott, Mary, Barbara, Edward

  14. Questions • What should a cluster represent? Due to F. Xia

  15. Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? Due to F. Xia

  16. Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? Due to F. Xia

  17. Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? • How can we improve NLP with clustering? Due to F. Xia

  18. Similarity • Between two instances

  19. Similarity • Between two instances • Between an instance and a cluster

  20. Similarity • Between two instances • Between an instance and a cluster • Between clusters

  21. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

  22. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance:

  23. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance:

  24. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance: • Cosine similarity:

  25. Clustering Algorithms

  26. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters

  27. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy

  28. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster

  29. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster • Soft: Allows degrees of membership and membership in more than one cluster • Often probability distribution over cluster membership

  30. Hierarchical Clustering

  31. Hierarchical Vs. Flat • Hierarchical clustering:

  32. Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive

  33. Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering:

  34. Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering: • Fairly efficient • Simple baseline algorithm: K-means • Probabilistic models use EM algorithm

  35. Clustering Algorithms • Flat clustering: • K-means clustering • K-medoids clustering • Hierarchical clustering: • Greedy, bottom-up clustering

  36. K-Means Clustering • Initialize: • Randomly select k initial centroids

  37. K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing

  38. K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest

  39. K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest • Recompute cluster centroids • Mean of instances in the cluster

  40. K-Means: 1 step

  41. K-Means • Running time:

  42. K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues:

  43. K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues: • Need to pick # clusters k • Can find only local optimum • Sensitive to outliers • Requires Euclidean distance: • What about enumerable classes (e.g. colors)?

  44. Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster

  45. Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute:

  46. Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute: • Select the element with highest f(p)

  47. K-Medoids • Initialize: • Select k instances at random as medoids

  48. K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid

  49. K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid • Recomputemedoid for each cluster

  50. Greedy, Bottom-Up Hierarchical Clustering • Initialize: • Make an individual cluster for each instance