1 / 30

A Probabilistic Framework for Semi-Supervised Clustering

A Probabilistic Framework for Semi-Supervised Clustering. Sugato Basu Mikhail Bilenko Raymond J. Mooney Dept ment of Computer Sciences University of Texas at Austin Presented by Jingting Zeng. Outline. Introduction Background Algorithm Experiments Conclusion.

laquinta
Télécharger la présentation

A Probabilistic Framework for Semi-Supervised Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Probabilistic Framework for Semi-Supervised Clustering Sugato Basu Mikhail Bilenko Raymond J. Mooney Deptment of Computer Sciences University of Texas at Austin Presented by Jingting Zeng

  2. Outline • Introduction • Background • Algorithm • Experiments • Conclusion

  3. What is Semi-Supervised Clustering? • Use human input to provide labels for some of the data • Improve existing naive clustering methods • Use labeled data to guide clustering of unlabeled data • End result is a better clustering of data

  4. Motivation • Large amounts of unlabeled data exists • More is being produced all the time • Expensive to generate Labels for data • Usually requires human intervention • Want to use limited amounts of labeled data to guide the clustering of the entire dataset in order to provide a better clustering method

  5. Semi-Supervised Clustering • Constraint-Based: • Modify objective functions so that it includes satisfying constraints, enforcing constraints during the clustering or initialization process • Distance-Based: • A distance function is used that is trained on the supervised dataset to satisfy the labels or constraint, and then is applied to the complete dataset

  6. Method • Use both constraint-based and distance based approaches in a unified method • Use Hidden Markov Random Fields to generate constraints • Use constraints for initialization and assignment of points to clusters • Use an adaptive distance function to try to learn the distance measure • Cluster data to minimize some objective distance function

  7. Main Points of Method • Improved initialization: • Initial clusters are formed based on constraints • Constraint sensitive assignment of points to clusters • Points are assigned to clusters to minimize a distortion function, while minimizing the number of constraints violated • Iterative Distance Learning • Distortion measure is re-estimated after each iteration

  8. Constraints • Pairwise constraints of Must-link, or Cannot-link labels • Set M of must link constraints • Set C of cannot link constraints • A list of associated costs for violating Must-link or cannot-link requirements • Class labels do not have to be known, but a user can still specify relationship between points.

  9. HMRF

  10. Posterior probability • This problem is an “incomplete-data problem” • Cluster representative as well as class labels are unknown • Popular method for solving this type of problem is Expectation Maximization • K-Means is equivalent to an EM algorithm with hard clustering assignments

  11. Must-Link Violation Cost Function • Ensures that the penalty for violating the must-link constraint between 2 points that are far apart is higher than between 2 points that are close • Punishes distance functions in which must-link points are far apart

  12. Cannot-Link Violation Cost Function • Ensures that the penalty for violating cannot-link constraints between points that are nearby according to the current distance function is higher than between distant points • Punishes distance functions that place 2 cannot link points close together

  13. Objective Function • Goal: find minimum objective function • Supervised data in initialization • Constraints in cluster assignments • Distance learning in M step

  14. Algorithm • EM framework • Initialization step: • Use constraints to guide initial cluster formations • E step: • minimizes objective function over cluster assignment • M step: • Minimizes objective function over cluster representatives • Minimizes objective function over the parameter distortion measure

  15. Initialization • Initialize: • Form transitive closure of the must-link constraints • Set of connected components consisting of points connected by must-link constraints • y connected components • If y < K (number of clusters), y connected neighborhoods used to create y initial clusters • Remaining clusters initialized by random perturbations of the global centroid of the data

  16. What If More Neighborhoods Then Clusters? • If y > K (number of clusters), k initial clusters are selecting using the distance measure • Farthest first traversal is a good heuristic • Weighted variant of farthest first traversal • Distance between 2 centroids multiplied by their corresponding weights • Weight of each centroid is proportional to the size of the corresponding neighborhood. • Biased to select centroids that are relatively far apart and of a decent size

  17. Initialization Continued • Assuming consistency of data: • Augment set M with must-link constraints inferred from transitive closure • For each pair of neighborhoods Np ,Np’ that have at least one cannont link constraint between them, add connot-link constraints between every member of Np and Np’ . • Learn as much about the data through the constraints as possible

  18. Augment set M a b c Must-link Must-link Inferred Must-link

  19. Augment Set C a b c Must-link Cannot-link Inferred Cannot-link

  20. E-step • Assignments of data points to clusters • Since model incorporates interactions between points, computing point assignments to minimize objective function is computationally intractable • Iterated conditional models, belief propagation, and linear programming relaxation • ICM: uses greedy strategy to sequentially update the cluster assignment of each point while keeping other points fixed.

  21. M-step • First, cluster representatives are re-estimated to decrease objective function • Constraints do not factor into this step, so it is equivalent to K-Means • If parameterized variant of a distance measure is used, it is updated here • Parameters can be found through partial derivatives of distance function • Learning step results in modifying the distortion measure so that similar points are brought closer together, while dissimilar points are pulled apart

  22. Results • KMeans – I – C – D : • Complete HMRF- KMeans algorithm, with supervised data in initialization(I) and cluster assignment(C) and distance learning(D) • KMEANS – I- C: • HMRF-KMeans algorithm without distance learning • KMEANS –I : • HMRF-KMeans algorithm without distance learning and supervised cluster assignment

  23. Results

  24. Results

  25. Results

  26. Conclusion • HMRF-KMeans Performs well (compared to naïve K-Means) with a limited number of constraints • The goal of the algorithm was to provide a better clustering method with the use of a limited number of constraints • HMRF-KMeans learns quickly from a limited number of constraints • Should be applicable to data sets where we want to limit the amount of labeling needed to be done by humans, and constraints can be specified in pairwise labels

  27. Questions • Can all types of constraints be captured in pairwise associations? • Hierarchal structure? • Could other types of labels be included in this model? • Use class labels as well as pairwise constraints • How does this model handle noise in the data/labels? • Point A has must link constraint to Point B, Point B has must list constraint to Point C, Point A has Cannot-link constraint to point C

  28. More Questions • How does this apply to other types of Data? • Authors mention wanting to try applying method to other types of data in the future, such as gene representation • Who provides weights for function violations, and how are weights determined? • Only compared with naïve KMeans method • How does it compare with other semi-supervised clustering methods?

  29. Reference • S. Basu, M. Bilenko, and R.J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 2004.

  30. Thank you!

More Related