1 / 28

Document Clustering via Dirichlet Process Mixture Model with Feature Selection

Document Clustering via Dirichlet Process Mixture Model with Feature Selection. Guan Yu The Hong Kong Polytechnic University Ruizhang Huang The Hong Kong Polytechnic Universit Zhaojun Wang NanKai University. Agenda. Introduction Background Method Experiments Conclusion and Future Work.

Télécharger la présentation

Document Clustering via Dirichlet Process Mixture Model with Feature Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Clustering via Dirichlet Process Mixture Model with Feature Selection Guan Yu The Hong Kong Polytechnic University Ruizhang Huang The Hong Kong Polytechnic Universit Zhaojun Wang NanKai University

  2. Agenda Introduction Background Method Experiments Conclusion and Future Work

  3. Introduction A common challenge in document clustering is to determine the number of document clusters K. Determine the number of clusters is a difficult problem. We attempt to group documents into an optimal number of document clusters based on the Dirichlet process mixture (DPM) model.

  4. Introduction (Con’t) Dirichlet process mixture (DPM) model Infinite mixture model Studied in nonparametric Bayesian for a long time Determines the number of clusters automatically . There is no work investigating DPM model for document clustering. The involvement of irrelevant words confuses the process of estimating the optimal number of clusters K which causes poor clustering solution in return.

  5. Introduction (Con’t) We propose an approach, namely Dirichlet process mixture model with feature selection (DPMFS) groups documents into a set of document clusters while K is determined automatically; identifies discriminative words and separates them from irrelevant noise words.

  6. Agenda Introduction Background Method Experiments Conclusion and Future Work 6

  7. DPM Model Finite Mixture Model Each data point is drawn from one of K fixed distributions. General Mixture Model Data point xnfollows a general mixture model in which the parameter θnis generated from a distribution G • D -- the number of data points • F(xn|θn) -- the distribution of xngiven the parameter θn.

  8. DPM Model (Con’t) Assigning a Dirichlet Process prior to G in the general mixture model leads to the DPM model. The DPM model is a mixture model with an infinite number of mixture components. Each component corresponds to a different cluster

  9. Dirichlet Multinomial Allocation (DMA) zn -- latent cluster allocation of the n-th data point N -- the number of mixture components P -- the mixing proportions for the clusters One problem for DPM is that DPM parameters cannot be estimated quickly. The Dirichlet Multinomial Allocation (DMA) is one of the most famous approximations to the DPM model

  10. Agenda Introduction Background Method Experiments Conclusion and Future Work 10

  11. DPMFS We introduce a latent binary vector γ=(γ1,…, γW) to identify words that discriminate between the different clusters. ηi -- the multinomial parameter for the discriminative words in xi η0 -- multinomial parameter for the irrelevant noise words

  12. DMAFS Approximation DPM model can be approximated by the DMA, it is obvious that the DMAFS model is also a good approximation to the DPMFS model.

  13. Agenda Introduction Background Method Experiments Conclusion and Future Work 13

  14. Experiments Two sets of experiments were used to evaluate the performance of the DPMFS approach. Synthetic Dataset Real Datasets

  15. Synthetic Dataset The synthetic data consisted of 300 data points with 1000 features. Data points were generated by two different processes with four multinomial distributions: Generate the discriminative features the first 50 features were regarded as discriminative features generated from a multinomial mixture distribution with three components. Generate the irrelevant noise features the remaining 950 features were regarded as irrelevant noise features generated from a multinomial distribution.

  16. Experiment Results Trace plot for the number of clusters. Trace plot for the number of discriminating features

  17. Real Document Datasets Four standard text datasets were used in our experiments: News-Different-3, News-Similar-3, News-Moderated-6 and Classic400.

  18. Experiment Results Clustering results on News-Similar-3 and News-Moderated-6

  19. Experiment Results Clustering results on News-Different-3 (the third row) and Classic400

  20. Experiment Results Estimated labels of data points in News-Different-3.

  21. Conclusion and Future work Our proposed DPMFS approach handles document clustering and feature selection simultaneously. We constrain the DPM model only to define the cluster structure of the data with discriminative features which are identified by a latent binary vector. Our experiment shows that DPMFS approach groups document dataset into meaningful clusters without requiring the number of clusters known in advance. An interesting direction for future research is to study how to use the DPMFS approach in the semi-supervised document clustering since more and more labeled documents or constraints are available in real-life.

  22. Thank You

  23. ALGORITHM We use the Gibbs sampling method to infer both the latent cluster structure and discriminative words in the context of DMAFS model. Initialize the latent variables γ and z, set the parameter α, ω, λ and N. “burn-in” period: sampling γ, η, z and λ in iterations Use the last H samples of z and γ to infer the latent data label and discriminative words.

  24. Evaluation Metric We used the normalized mutual information (NMI) to evaluate the quality of a clustering solution. measures the amount of statistical information shared by the random variables representing the cluster assignments and the user-labeled class assignments of the data points.

  25. Parameter tuning We investigated the sensitivity of the choices of parameters in our algorithm by large amounts of experiments.

  26. Choice of N In principle, we can choose N to be the number of data points.

  27. Choice of α and ω

  28. Choice of λ A small value for λ performs well though it will require relatively long time for the sampling process to be stable. For the document datasets used in our following experiments, we found that a good choice of λj is 1.0/σj, where σj is the sample standard variance of {x1i, x2i,…, xDi}.

More Related