Download Presentation
## Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Clustering**Shallow Processing Techniques for NLP Ling570 November 30, 2011**Roadmap**• Clustering • Motivation & Applications • Clustering Approaches • Evaluation**Clustering**• Task: Given a set of objects, create a set of clusters over those objects • Applications:**Clustering**• Task: Given a set of objects, create a set of clusters over those objects • Applications: • Exploratory data analysis • Document clustering • Language modeling • Generalization for class-based LMs • Unsupervised Word Sense Disambiguation • Automatic thesaurus creations • Unsupervised Part-of-Speech Tagging • Speaker clustering,….**Example: Document Clustering**• Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering:**Example: Document Clustering**• Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment**Example: Document Clustering**• Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire**Example: Document Clustering**• Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering**Example: Document Clustering**• Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters**Example: Document Clustering**• Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters • Topic clustering: documents on the same topic • OWS, debt supercommittee, Seattle Marathon, Black Friday..**Example:Word Clustering**• Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters**Example:Word Clustering**• Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters:**Example:Word Clustering**• Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters: (from NYT) • ballot, polls, Gov, seats • profit, finance, payments • NFL, Reds, Sox, inning, quarterback, scored, score • researchers, science • Scott, Mary, Barbara, Edward**Questions**• What should a cluster represent? Due to F. Xia**Questions**• What should a cluster represent? • Similarity among objects • How can we create clusters? Due to F. Xia**Questions**• What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? Due to F. Xia**Questions**• What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? • How can we improve NLP with clustering? Due to F. Xia**Similarity**• Between two instances**Similarity**• Between two instances • Between an instance and a cluster**Similarity**• Between two instances • Between an instance and a cluster • Between clusters**Similarity Measures**• Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn)**Similarity Measures**• Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance:**Similarity Measures**• Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance:**Similarity Measures**• Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance: • Cosine similarity:**Types of Clustering**• Flat vs Hierarchical Clustering: • Flat: partition data into k clusters**Types of Clustering**• Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy**Types of Clustering**• Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster**Types of Clustering**• Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster • Soft: Allows degrees of membership and membership in more than one cluster • Often probability distribution over cluster membership**Hierarchical Vs. Flat**• Hierarchical clustering:**Hierarchical Vs. Flat**• Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive**Hierarchical Vs. Flat**• Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering:**Hierarchical Vs. Flat**• Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering: • Fairly efficient • Simple baseline algorithm: K-means • Probabilistic models use EM algorithm**Clustering Algorithms**• Flat clustering: • K-means clustering • K-medoids clustering • Hierarchical clustering: • Greedy, bottom-up clustering**K-Means Clustering**• Initialize: • Randomly select k initial centroids**K-Means Clustering**• Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing**K-Means Clustering**• Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest**K-Means Clustering**• Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest • Recompute cluster centroids • Mean of instances in the cluster**K-Means**• Running time:**K-Means**• Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues:**K-Means**• Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues: • Need to pick # clusters k • Can find only local optimum • Sensitive to outliers • Requires Euclidean distance: • What about enumerable classes (e.g. colors)?**Medoid**• Medoid: Element in cluster with highest average similarity to other elements in cluster**Medoid**• Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute:**Medoid**• Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute: • Select the element with highest f(p)**K-Medoids**• Initialize: • Select k instances at random as medoids**K-Medoids**• Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid**K-Medoids**• Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid • Recomputemedoid for each cluster**Greedy, Bottom-Up Hierarchical Clustering**• Initialize: • Make an individual cluster for each instance