250 likes | 259 Vues
Learn about clustering, exploratory data analysis, and hierarchical and non-hierarchical clustering algorithms. Understand the differences between clustering and classification and explore key methods such as K-means and the EM algorithm.
E N D
Natural Language ProcessingClustering July, 2002
Clustering • Partition a set of objects into groups or clusters. • Similar objects are placed in the same group and dissimilar objects in different groups. • Objects are described and clustered using a set of features and values.
Exploratory Data Analysis (EDA) • Develop a probabilistic model for a problem. • Understand the characteristics of the data
Generalization • Induce bins from the data. • Monday, Tuesday, …,Sunday • There is no entry for Friday • Learn natural relationships in data. • Group objects into clusters and genaralize from what we know about some members of the cluster
Clustering (vs. Classification) • Clustering does not require training data and is hence called unsupervised. Classification is supervised and requires a set of labeled training instances for each group. • The result of clustering only depends on natural divisions in the data and not on any pre-existing categorization scheme as in classification.
Types of Clustering • Hierarchical • Bottom Up: • Start with objects and group most similar ones. • Top down: • Start with all objects and divide into groups so as to maximize within-group similarity. • Single-link, complete-link, group-average • Non-hierarchical • K-means • EM-algorithm • Hard (1:1) vs soft (1:n – degree of membership)
Hierarchical Clustering • Bottom-up: • Start with a separate cluster for each object • Determine the two most similar clusters and merge into a new cluster. Repeat on the new clusters that have been formed. • Terminate when one large cluster containing all objects has been formed Example of a similarity measure:
Hierarchical Clustering (Cont.) • Top-down • Start from a cluster of all objects • Iteratively determine the cluster that is least coherent and split it. • Repeat until all clusters have one object.
Similarity Measures for Hierarchical Clustering • Single-link • Similarity of two most similar members • Complete-link • Similarity of two least similar members • Group-average • Average similarity between members
Single-Link • Similarity function focuses on local coherence
Complete-Link • Similarity function focuses on global cluster quality
Group-Average • Instead of greatest similarity between elements of clusters or the least similarity the merge criterion is average similarity. • Compromise between single-link and complete-link clustering
An Application: Language Model • Many rare events do not have enough training data for accurate probabilistic modeling. • Clustering is used to improve the language model by way of generalization. • Predictions for rare events more accurate.
Non-Hierarchical Clustering • Start out with a partition based on randomly selected seeds and then refine this initial partition. • Several passes of reallocating objects are needed (hierarchical algorithms need only one pass). • Hierarchical clusterings too can be improved using reallocations. • Stop based on some measure of goodness or cluster quality. • Heuristic: number of clusters, size of clusters, stopping criteria… • No optimal solution.
K-Means • Hard clustering algorithm • Defines clusters by the center of mass of their members • Define initial center of clusters randomly • Assign each object to the cluster whose center is closest • Recompute the center for each cluster • Stop when centers do not change
The EM Algorithm • “Soft” version of K-means clustering.
EM (Cont.) • Determine the most likely estimates for the parameters of the distribution. • The idea is that the data are generated by several underlying causes. • Z : unobserved data set • Z= { vector z1 … vector zn } • Zi = (zi1,zi2 … zik) • Where zij =1 if object i is a member of cluster j otherwise 0 • X : observed data set (data to be clustered) • X = { vector x1 … vector xn } • xi = (xi1,xi2 … xim) • Estimate the model that generated this data.
1 m S = n ( x ; , ) j j p S m ( 2 ) j EM (Cont.) • We assume that the data is generated by k gaussians (k clusters) • Each gaussian with parameters mean mj and covariance Sj is given by: exp[-(x-mj)TS-1j (x-mj)/2]
EM (Cont.) • We find the maximum likelihood model of the form: where pj is the weight for each Gaussian P(xi)=Skj=1 pj nj (xi; mj, Sj)
EM (Cont.) • Parameters are found by maximizing the log likelihood given in the equation: • = (q1,…,qk)Twhere the individual parameters of the gaussian mixture are qj=(mj, Sj, pj) - - n n k å Õ Õ Q = = p m S l ( X | ) log P ( x ) log n ( x ; , ) i j j i j j = = = 1 1 1 j i i - - n k å å = p m S log n ( x ; , ) j j i j j = = 1 1 i j
EM (Cont.) • EM algorithm is an iterative solution to the following circular statements: • Estimate: If we knew the value of we could compute the expected values of the hidden structure of the model. • Maximize: If we knew the expected values of the hidden structure of the model, then we could compute the maximum likelihood value of .
- Q - P ( x | n ; ) i j = Q = h E ( z | x ; ) - ij ij i k å Q P ( x | n ; ) i l = 1 l - - - - - n n n å å å - m - m T h x h ( x )( x ) h - ij i ij i j i j ij ¢ ¢ ¢ m = S = p = = = = i 1 i 1 i 1 j j j n n n å å h h ij ij = = i 1 i 1 EM (Cont.) • Expectation step (E-step): • Maximization step (M-step):
Preferable for detailed data analysis Provides more information than flat No single best algorithm (dependent on application) Less efficient than flat ( for n objects, n X n similarity matrix required) Preferable if efficiency is a consideration or data sets are very large K-means is the conceptually simplest method K-means assumes a simple Euclidean representation space and so can’t be used for many data sets In such case, EM algorithm is chosen Properties of hierarchical and non-hierarchical clustering