1 / 55

Semi-Supervised Clustering I

Outline of lecture. Overview of clustering and classification What is semi-supervised learning?Semi-supervised clusteringSemi-supervised classificationSemi-supervised clusteringWhat is semi-supervised clustering?Why semi-supervised clustering?Semi-supervised clustering algorithms. Supervise

crete
Télécharger la présentation

Semi-Supervised Clustering I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Semi-Supervised Clustering I

    2. Outline of lecture Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification Semi-supervised clustering What is semi-supervised clustering? Why semi-supervised clustering? Semi-supervised clustering algorithms

    3. Supervised classification versus unsupervised clustering Unsupervised clustering Group similar objects together to find clusters Minimize intra-class distance Maximize inter-class distance Supervised classification Class label for each training sample is given Build a model from the training data Predict class label on unseen future data points

    4. What is clustering? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

    5. What is Classification?

    6. Clustering algorithms K-Means Hierarchical clustering Graph based clustering (Spectral clustering)

    7. Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Logistic Regression Neural Networks

    8. Supervised Classification Example

    9. Supervised Classification Example

    10. Supervised Classification Example

    11. Unsupervised Clustering Example

    12. Unsupervised Clustering Example

    13. Semi-Supervised Learning Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

    14. Semi-Supervised Classification Example

    15. Semi-Supervised Classification Example

    16. Semi-Supervised Classification Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Graph based algorithms Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories. More discussions next week

    17. Semi-Supervised Clustering Example

    18. Semi-Supervised Clustering Example

    19. Second Semi-Supervised Clustering Example

    20. Second Semi-Supervised Clustering Example

    21. Semi-supervised clustering: problem definition Input: A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) A small amount of domain knowledge Output: A partitioning of the objects into k clusters (possibly with some discarded as outliers) Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge

    22. Why semi-supervised clustering? Why not clustering? The clusters produced may not be the ones required. Sometimes there are multiple possible groupings. Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization

    23. Semi-Supervised Clustering Domain knowledge Partial label information is given Apply some constraints (must-links and cannot-links) Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both

    24. Search-Based Semi-Supervised Clustering Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].

    25. Overview of K-Means Clustering K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

    26. K-Means Objective Function Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers

    27. K Means Example

    28. K Means Example Randomly Initialize Means

    29. K Means Example Assign Points to Clusters

    30. K Means Example Re-estimate Means

    31. K Means Example Re-assign Points to Clusters

    32. K Means Example Re-estimate Means

    33. K Means Example Re-assign Points to Clusters

    34. K Means Example Re-estimate Means and Converge

    35. Semi-Supervised K-Means Partial label information is given Seeded K-Means Constrained K-Means Constraints (Must-link, Cannot-link) COP K-Means

    36. Semi-Supervised K-Means for partially labeled data Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

    37. Seeded K-Means

    38. Seeded K-Means Example

    39. Seeded K-Means Example Initialize Means Using Labeled Data

    40. Seeded K-Means Example Assign Points to Clusters

    41. Seeded K-Means Example Re-estimate Means

    42. Seeded K-Means Example Assign points to clusters and Converge

    43. Constrained K-Means

    44. Constrained K-Means Example

    45. Constrained K-Means Example Initialize Means Using Labeled Data

    46. Constrained K-Means Example Assign Points to Clusters

    47. Constrained K-Means Example Re-estimate Means and Converge

    48. COP K-Means COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

    50. Illustration

    51. Illustration

    52. Illustration

    53. Summary Seeded and Constrained K-Means: partially labeled data COP K-Means: constraints (Must-link and Cannot-link) Constrained K-Means and COP K-Means require all the constraints to be satisfied. May not be effective if the seeds contain noise. Seeded K-Means use the seeds only in the first step to determine the initial centroids. Less sensitive to the noise in the seeds. Experiments show that semi-supervised k-Means outperform traditional K-Means.

    54. Reference Semi-supervised Clustering by Seeding http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf Constrained K-means clustering with background knowledge http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf

    55. Next class Topics Similarity-based semi-supervised clustering Readings Distance metric learning, with application to clustering with side-information http://ai.stanford.edu/~ang/papers/nips02-metric.pdf From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering http://www.cs.berkeley.edu/~klein/papers/constrained_clustering-ICML_2002.pdf

More Related