Clustering

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Roadmap • Clustering • Motivation & Applications • Clustering Approaches • Evaluation

Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications:

Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications: • Exploratory data analysis • Document clustering • Language modeling • Generalization for class-based LMs • Unsupervised Word Sense Disambiguation • Automatic thesaurus creations • Unsupervised Part-of-Speech Tagging • Speaker clustering,….

Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering:

Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment

Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire

Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering

Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters

Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters • Topic clustering: documents on the same topic • OWS, debt supercommittee, Seattle Marathon, Black Friday..

Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters

Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters:

Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters: (from NYT) • ballot, polls, Gov, seats • profit, finance, payments • NFL, Reds, Sox, inning, quarterback, scored, score • researchers, science • Scott, Mary, Barbara, Edward

Questions • What should a cluster represent? Due to F. Xia

Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? Due to F. Xia

Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? Due to F. Xia

Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? • How can we improve NLP with clustering? Due to F. Xia

Similarity • Between two instances

Similarity • Between two instances • Between an instance and a cluster

Similarity • Between two instances • Between an instance and a cluster • Between clusters

Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance:

Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance:

Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance: • Cosine similarity:

Clustering Algorithms

Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters

Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy

Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster

Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster • Soft: Allows degrees of membership and membership in more than one cluster • Often probability distribution over cluster membership

Hierarchical Clustering

Hierarchical Vs. Flat • Hierarchical clustering:

Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive

Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering:

Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering: • Fairly efficient • Simple baseline algorithm: K-means • Probabilistic models use EM algorithm

Clustering Algorithms • Flat clustering: • K-means clustering • K-medoids clustering • Hierarchical clustering: • Greedy, bottom-up clustering

K-Means Clustering • Initialize: • Randomly select k initial centroids

K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing

K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest

K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest • Recompute cluster centroids • Mean of instances in the cluster

K-Means: 1 step

K-Means • Running time:

K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues:

K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues: • Need to pick # clusters k • Can find only local optimum • Sensitive to outliers • Requires Euclidean distance: • What about enumerable classes (e.g. colors)?

Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster

Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute:

Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute: • Select the element with highest f(p)

K-Medoids • Initialize: • Select k instances at random as medoids

K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid

K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid • Recomputemedoid for each cluster

Greedy, Bottom-Up Hierarchical Clustering • Initialize: • Make an individual cluster for each instance

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering