1 / 76

Clustering and NLP

Clustering and NLP. Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others. Outline. Clustering Overview Sample Clustering Techniques for NLP K-means Agglomerative Model-based (EM). Clustering Overview. What is clustering?.

tasya
Télécharger la présentation

Clustering and NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering and NLP Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others NLP

  2. Outline • Clustering Overview • Sample Clustering Techniques for NLP • K-means • Agglomerative • Model-based (EM) NLP

  3. Clustering Overview NLP

  4. What is clustering? • Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.

  5. Another example

  6. Why should we care about clustering? • Clustering is a basic step in most data mining procedures: Examples : • Clustering movie viewers for movie ranking. • Clustering proteins by their functionality. • Clustering text documents for content similarity.

  7. Clustering as Data Exploration Clustering is one of the most widely used tool for exploratory data analysis. Social Sciences Biology Astronomy Computer Science . . All apply clustering to gain a first understanding of the structure of large data sets.

  8. There are Many Clustering Tasks “Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:

  9. There are Many Clustering Tasks “Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:

  10. Some more examples

  11. Issues The clustering problem: Given a set of objects, find groups of similar objects • What is similar? Define appropriate metrics • What makes a good group? Groups that contain the highest average similarity between all pairs? Groups that are most separated from neighboring groups? 3. How can you evaluate a clustering algorithm?

  12. Formal Definition Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P). A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S. NLP

  13. Sample Objective Functions • Objective 1: Minimize the average distance between points in the same cluster • Objective 2: Maximize the margin (smallest distance) between neighboring clusters • Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster. NLP

  14. More Issues • Having an objective function f gives a way of evaluating a clustering. But the real f is usually not known! • Efficiency Comparing N points to each other means making O(N2) comparisons. • Curse of Dimensionality The more features in your data, the more likely the clustering algorithm is to get it wrong. NLP

  15. Clustering as “Unsupervised” Learning Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP

  16. Clustering as “Unsupervised” Learning Clustering is just like ML, except ….: Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP

  17. Clustering as “Unsupervised” Learning • Supervised learning has: • Labeled training examples • A space Y of possible labels • Unsupervised learning has: • Unlabeled training examples • No information (or limited information) about the space of possible labels NLP

  18. Some Notes on Complexity • The ML example used a space of Boolean functions of N Boolean variables • 22^N+1 possible functions • But many possibilities are eliminated by training data and assumptions • How many possible clusterings? • ~2N * K / K!, for K clusters (K>1) • No possibilities eliminated by training data • Need to search for a good one efficiently! NLP

  19. Clustering Problem Formulation • General Assumptions • Each data item is a tuple (vector) • Values of tuples are nominal, ordinal or numerical • Similarity (or Distance) function is provided • For pure numerical tuples, for example: • Sim(di,dj) =  di,kdj,k • sim (di,dj) = cos(di,dj) • …and many more (slide after next)

  20. Similarity Measures in Data Analysis • For Ordinal Values • E.g. "small," "medium," "large," "X-large" • Convert to numerical assuming constant …on a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate • E.g. "small"=0, "medium"=0.33, etc. • Then, use numerical similarity measures • Or, use similarity matrix (see next slide)

  21. Similarity Measures (cont.) • For Nominal Values • E.g. "Boston", "LA", "Pittsburgh", or "male", "female", or "diffuse", "globular", "spiral", "pinwheel" • Binary rule: If di, = dj,k, then sim = 1, else 0 • Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) = (|size(Boston) - size(LA)| )/Max(size(cities)) • Or, use similarity Matrix

  22. Similarity Matrix tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • No linearity (value interpolation) assumed • Qualitative Transitive property must hold

  23. Document Clustering Techniques • Similarity or Distance Measure:Alternative Choices • Cosine similarity • Euclidean distance • Kernel functions, e.g., • Language Modeling P(y|modelx) where x and y are documents

  24. Document Clustering Techniques • Kullback Leibler distance ("relative entropy")

  25. Some Clustering Methods • K-Means and K-medoids algorithms: • CLARANS, [Ng and Han, VLDB 1994] • Hierarchical algorithms • CURE, [Guha et al, SIGMOD 1998] • BIRCH, [Zhang et al, SIGMOD 1996] • CHAMELEON, [Kapyris et al, COMPUTER, 32] • Density based algorithms • DENCLUE, [Hinneburg, Keim, KDD 1998] • DBSCAN, [Ester et al, KDD 96] • Clustering with obstacles, [Tung et al, ICDE 2001]

  26. K-Means NLP

  27. K-means and K-medoids algorithms • Objective function: Minimize the sum of square distances of points to a cluster representative (centroid) • Efficient iterative algorithms (O(n))

  28. K-Means Clustering • Select K seed centroidss.t. d(ci,cj) > dmin 2. Assign points to clusters by minimum distance to centroid 3. Compute new cluster centroids: 4. Iterate steps 2 & 3 until no points change clusters

  29. K-Means Clustering: Initial Data Points Step 1: Select k random seeds s.t. d(ci,cj) > dmin Initial Seeds (k=3)

  30. K-Means Clustering: First-Pass Clusters Step 2: Assign points to clusters by min dist. Initial Seeds

  31. K-Means Clustering: Seeds  Centroids Step 3: Compute new cluster centroids: New Centroids

  32. K-Means Clustering: Second Pass Clusters Step 4: Recompute Centroids

  33. K-Means Clustering: Iterate Until Stability New Centroids And so on.

  34. Question If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time? NLP

  35. Problems with K-means type algorithms • Clusters are approximately spherical • High dimensionality is a problem • The value of K is an input parameter

  36. Agglomerative Clustering

  37. Hierarchical Clustering • Quadratic algorithms • Running time can be improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]

  38. Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

  39. Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

  40. Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

  41. Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters • This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5 Information Retrieval and Digital Libraries

  42. High density variations • Intuitively “correct” clustering Information Retrieval and Digital Libraries

  43. High density variations • Intuitively “correct” clustering • HAC-generated clusters Information Retrieval and Digital Libraries

  44. Document Clustering Techniques • Example. Group documents based on similarity Similarity matrix: Thresholding at similarity value of .9 yields: complete graph C1 = {1,4,5}, namely Complete Linkage connected graph C2={1,4,5,6}, namely Single Linkage For clustering we need three things: • A similarity measure for pairwise comparison between documents • A clustering criterion (complete Link, Single Ling,…) • A clustering algorithm

  45. Document Clustering Techniques • Clustering Criterion: Alternative Linkages • Single-link ('nearest neighbor"): • Complete-link: • Average-link ("group average clustering") or GAC):

  46. Hierarchical Agglomerative Clustering Methods • Generic Agglomerative Procedure (Salton '89): • result in nested clusters via iterations • Compute all pairwise document-document similarity coefficients • Place each of n documents into a class of its own • Merge the two most similar clusters into one; - replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster • Repeat the above step until there are only k clusters left (note k could = 1).

  47. Group Agglomerative Clustering 2 1 5 4 3 6 9 7 8

  48. Expectation-Maximization Information Retrieval and Digital Libraries

  49. Clustering as Model Selection Let’s look at clustering as a probabilistic modeling problem: I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points: P(xi | C1), P(xi | C2), P(xi | C3) NLP

  50. Clustering as Model Selection How can I determine which points belong to which cluster? Cluster for xi = argmaxj P(xi | Cj) So, all I need is to figure out what P(xi | Cj) is, for each i and j. But without training data! How can I do that? NLP

More Related