770 likes | 1.07k Vues
Clustering. Shallow Processing Techniques for NLP Ling570 November 30, 2011. Roadmap. Clustering Motivation & Applications Clustering Approaches Evaluation. Clustering. Task: Given a set of objects, create a set of clusters over those objects Applications:. Clustering.
E N D
Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011
Roadmap • Clustering • Motivation & Applications • Clustering Approaches • Evaluation
Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications:
Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications: • Exploratory data analysis • Document clustering • Language modeling • Generalization for class-based LMs • Unsupervised Word Sense Disambiguation • Automatic thesaurus creations • Unsupervised Part-of-Speech Tagging • Speaker clustering,….
Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering:
Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment
Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire
Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering
Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters
Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters • Topic clustering: documents on the same topic • OWS, debt supercommittee, Seattle Marathon, Black Friday..
Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters
Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters:
Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters: (from NYT) • ballot, polls, Gov, seats • profit, finance, payments • NFL, Reds, Sox, inning, quarterback, scored, score • researchers, science • Scott, Mary, Barbara, Edward
Questions • What should a cluster represent? Due to F. Xia
Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? Due to F. Xia
Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? Due to F. Xia
Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? • How can we improve NLP with clustering? Due to F. Xia
Similarity • Between two instances
Similarity • Between two instances • Between an instance and a cluster
Similarity • Between two instances • Between an instance and a cluster • Between clusters
Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance:
Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance:
Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance: • Cosine similarity:
Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters
Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy
Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster
Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster • Soft: Allows degrees of membership and membership in more than one cluster • Often probability distribution over cluster membership
Hierarchical Vs. Flat • Hierarchical clustering:
Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive
Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering:
Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering: • Fairly efficient • Simple baseline algorithm: K-means • Probabilistic models use EM algorithm
Clustering Algorithms • Flat clustering: • K-means clustering • K-medoids clustering • Hierarchical clustering: • Greedy, bottom-up clustering
K-Means Clustering • Initialize: • Randomly select k initial centroids
K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing
K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest
K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest • Recompute cluster centroids • Mean of instances in the cluster
K-Means • Running time:
K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues:
K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues: • Need to pick # clusters k • Can find only local optimum • Sensitive to outliers • Requires Euclidean distance: • What about enumerable classes (e.g. colors)?
Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster
Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute:
Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute: • Select the element with highest f(p)
K-Medoids • Initialize: • Select k instances at random as medoids
K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid
K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid • Recomputemedoid for each cluster
Greedy, Bottom-Up Hierarchical Clustering • Initialize: • Make an individual cluster for each instance