Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Incorporating User Provided Constraints into Document Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Incorporating User Provided Constraints into Document**Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer Science Wayne State University Detroit, MI48202 {chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu**Outline**• Introduction • Overview of related work • Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Conclusion**Inter-cluster distances are maximized**Intra-cluster distances are minimized What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups**Government**Science Arts Document Clustering • Grouping of text documents into meaningful clusters in an unsupervised manner.**.**. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Clustering Example . . . . . .**Semi-supervised clustering: problem definition**• Input: • A set of unlabeled objects • A small amount of domain knowledge (labels or pairwise constraints) • Output: • A partitioning of the objects into k clusters • Objective: • Maximum intra-cluster similarity • Minimum inter-cluster similarity • High consistency between the partitioning and the domain knowledge**Seeded points**Must-link Cannot-link Semi-Supervised Clustering • According to different given domain knowledge: • Users provide class labels(seeded points) a priori to some of the documents • Users know about which few documents are related (must-link) or unrelated (cannot-link)**Why semi-supervised clustering?**• Large amounts of unlabeled data exists • More is being produced all the time • Expensive to generate Labels for data • Usually requires human intervention • Use human input to provide labels for some of the data • Improve existing naive clustering methods • Use labeled data to guide clustering of unlabeled data • End result is a better clustering of data • Potential applications • Document/word categorization • Image categorization • Bioinformatics (gene/protein clustering)**Outline**• Introduction • Overview of related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical work for SS-NMF • Experiments and results • Conclusion**Clustering Algorithm**• Document hierarchical clustering • Bottom-up, agglomerative • Top-down, divisive • Document partitioning (flat clustering) • K-means • probabilistic clustering using the Naïve Bayes or Gaussian mixture model, etc. • Document clustering based on graph model**Semi-supervised Clustering Algorithm**• Semi-supervised Clustering with labels (Partial label information is given ): • SS-Seeded-Kmeans ( Sugato Basu, et al. ICML 2002) • SS-Constraint-Kmeans ( Sugato Basu, et al. ICML 2002) • Semi-supervised Clustering with Constraints (Pairwise Constraints (Must-link, Cannot-link) is given): • SS-COP-Kmeans (Wagstaff et al. ICML01) • SS-HMRF-Kmeans (Sugato Basu, et al. ACM SIGKDD 2004) • SS-Kernel-Kmeans (Brian Kulis, et al. ICML 2005) • SS-Spectral-Normalized-Cuts (X. Ji, et al. ACM SIGIR 2006)**Overview of K-means Clustering**• K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters. • Objective function: Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: Algorithm: Initialize k cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point xito the cluster fh such that distance of xi from center of fh is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster**Semi-supervised Kernel K-means (SS-KK)**[Brian Kulis, et al. ICML 2005] • Semi-supervised Kernel K-means algorithm: where is kernel function mapping from , is centroid, is the cost of violating the constraint between two points • First term: kernel k-means objective function • Second term: reward function for satisfying must-link constraints • Third term: penalty function for violating cannot-link constraints**Overview of Spectral Clustering**• Spectral clustering is a graph-theoretic clustering algorithm Weighted Graph G=(V, E, A) min between-cluster similarities (weights : Aij)**Spectral Normalized Cuts**• Min similarity between & : Balance weights: Cluster indicator: • Graph partition becomes: • Solution is eigenvector of:**Semi-supervised Spectral Normalized Cuts (SS-SNC)[X. Ji, et**al. ACM SIGIR 2006] • Semi-supervised Spectral Learning algorithm: where , • First term: spectral normalized cut objective function • Second term: reward function for satisfying must-link constraints • Third term: penalty function for violating cannot-link constraints**Outline**• Introduction • Related work • Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • NMF review • Model formulation and algorithm derivation • Theoretical result for SS-NMF • Experiments and results • Conclusion**Non-negative Matrix Factorization (NMF)**• NMF is to decompose matrix into two parts( D. Lee et al., Nature 1999) • Symmetric NMF for clustering (C. Ding et al. SIAM ICDM 2005) X F ~ = G min || X – FGT||2 ~ = x x min || A – GSGT||2**SS-NMF**• Incorporate prior knowledge into NMF based framework for document clustering. • Users provide pairwise constraints: • Must-link constraints CML : two documents di and dj must belong to the same cluster. • Cannot-link constraints CCL : two documents di and dj must belong to the different cluster. • Constraints are defined by associated violation cost matrix W: • W reward : cost of violating the constraint between document di and dj if a constraint exists. • Wpenalty : cost of violating the constraints between document di and dj if a constraint exists.**SS-NMF Algorithm**• Define the objective function of SS-NMF: where is the cluster label of**Outline**• Introduction • Overview of related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Conclusion**Algorithm Correctness and Convergence**• Based on constraint optimization theory, auxiliary function, we can prove SS-NMF: • Correctness:Solution converges to local minimum • 2. Convergence:Iterative algorithm converges (Details in paper [1], [2]) [1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%) [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, to appear, 2008.**Orthogonal SymmetricSemi-supervised NMF is equivalent to**Semi-supervised Kernel K-means (SS-KK) and Semi-supervised Spectral Normalized Cuts (SS-SNC)! SS-NMF: General Framework for Semi-supervised Clustering Proof: (1) (2) (3)**Outline**• Introduction • Overview of related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Artificial Toy Data • Real Data • Conclusion**Experiments on Toy Data**• 1.Artificial toy data: consisting of two natural clusters**Results on Toy Data (SS-KK and SS-NMF)**• Hard Clustering: Each object belongs to a single cluster • Soft Clustering: Each object is probabilisticallyassigned to clusters. Right Table: Difference between cluster indicator G of SS-KK (hard clustering) and SS-NMF (soft clustering) for the toy data**Results on Toy Data (SS-SNC and SS-NMF)**(b) Data distribution in the SS-NMF subspace of two column vectors of G. The data points from the two clusters get distributed along the two axes. (a) Data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.**Time Complexity Analysis**Up Figure: Computational Speed comparison for SS-KK, SS-SNC and SS-NMF ( )**Experiments on Text Data**2. Summary of data sets[1] used in the experiments. [1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz • Evaluation Metric: where n is the total number of documents in the experiment, δis the delta function that equals one if , is the estimated label, is the ground truth.**Results on Text Data (Compare with Unsupervised Clustering)**• (1) Comparison with unsupervised clustering approaches: Note: SS-NMF adds 3% constraints**Results on Text Data(Before Clustering and After Clustering)**(c) Document-document similarity matrix after clustering with SS-NMF (k=5) (b) Document-document similarity matrix after clustering with SS-NMF (k=2) (a) Typical document-document matrix before clustering**Results on Text Data (Clustering with Different Constraints)**Left Table: Comparison of confusion matrix C and normalized cluster centroid matrix S of SS-NMF for different percentage of documents pairwise constrained**Results on Text Data (Compare with Semi-supervised**Clustering) • (2) Comparison with SS-KK and SS-SNC (b) England-Heart (c) Interest-Trade (a) Graft-Phos**Results on Text Data (Compare with Semi-supervised**Clustering) • Comparison with SS-KK and SS-SNC (Fbis2, Fbis3, Fbis4, Fbis5)**Experiments on Image Data**3. Image data sets[2] used in the experiments. Up Figure: Sample images for images categorization. (From up to down: O-Owls, R-Roses, L-Lions, E-Elephants, H-Horses) [2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html**Results on Image Data (Compare with Unsupervised Clustering)**• (1) Comparison with unsupervised clustering approaches: Up Table : Comparison of image clustering accuracy between KK, SNC, NMF and SS-NMF with only 3% pair-wise constraints on the images. It shows that SS-NMF consistently outperforms other well-established unsupervised image clustering methods.**Results on Image Data (Compare with Semi-supervised**Clustering) • (2) Comparison with SS-KK and SS-SNC: Left Figure: Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (a) O-R, (b) L-H, (c) R-L, (d) O-R-L.**Results on Image Data (Compare with Semi-supervised**Clustering) • (2) Comparison with SS-KK and SS-SNC: Left Figure: Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (e) L-E-H, (f) O-R-L-E, (g) O-L-E-H, (h) O-R-L-E-H**Outline**• Introduction • Related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Conclusion**Conclusion**• Semi-supervised Clustering: -many real world applications - outperform the traditional clustering algorithms • Semi-supervised NMF algorithm provides a unified mathematic framework for semi-supervised clustering. • Many existing semi-supervised clustering algorithms can be extended to achieve multi-type objects co-clustering tasks.**Reference**[1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image Clustering from Accumulated User Feedbacks”, Proc. of ACM Multimedia, Germany, 2007. [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%) [3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, invited as a best paper of ICDM 07, to appear 2008.