Document Clustering via Dirichlet Process Mixture Model with Feature Selection

Document Clustering via Dirichlet Process Mixture Model with Feature Selection Guan Yu The Hong Kong Polytechnic University Ruizhang Huang The Hong Kong Polytechnic Universit Zhaojun Wang NanKai University

Agenda Introduction Background Method Experiments Conclusion and Future Work

Introduction A common challenge in document clustering is to determine the number of document clusters K. Determine the number of clusters is a difficult problem. We attempt to group documents into an optimal number of document clusters based on the Dirichlet process mixture (DPM) model.

Introduction (Con’t) Dirichlet process mixture (DPM) model Infinite mixture model Studied in nonparametric Bayesian for a long time Determines the number of clusters automatically . There is no work investigating DPM model for document clustering. The involvement of irrelevant words confuses the process of estimating the optimal number of clusters K which causes poor clustering solution in return.

Introduction (Con’t) We propose an approach, namely Dirichlet process mixture model with feature selection (DPMFS) groups documents into a set of document clusters while K is determined automatically; identifies discriminative words and separates them from irrelevant noise words.

Agenda Introduction Background Method Experiments Conclusion and Future Work 6

DPM Model Finite Mixture Model Each data point is drawn from one of K fixed distributions. General Mixture Model Data point xnfollows a general mixture model in which the parameter θnis generated from a distribution G • D -- the number of data points • F(xn|θn) -- the distribution of xngiven the parameter θn.

DPM Model (Con’t) Assigning a Dirichlet Process prior to G in the general mixture model leads to the DPM model. The DPM model is a mixture model with an infinite number of mixture components. Each component corresponds to a different cluster

Dirichlet Multinomial Allocation (DMA) zn -- latent cluster allocation of the n-th data point N -- the number of mixture components P -- the mixing proportions for the clusters One problem for DPM is that DPM parameters cannot be estimated quickly. The Dirichlet Multinomial Allocation (DMA) is one of the most famous approximations to the DPM model

DPMFS We introduce a latent binary vector γ=(γ1,…, γW) to identify words that discriminate between the different clusters. ηi -- the multinomial parameter for the discriminative words in xi η0 -- multinomial parameter for the irrelevant noise words

DMAFS Approximation DPM model can be approximated by the DMA, it is obvious that the DMAFS model is also a good approximation to the DPMFS model.

Experiments Two sets of experiments were used to evaluate the performance of the DPMFS approach. Synthetic Dataset Real Datasets

Synthetic Dataset The synthetic data consisted of 300 data points with 1000 features. Data points were generated by two different processes with four multinomial distributions: Generate the discriminative features the first 50 features were regarded as discriminative features generated from a multinomial mixture distribution with three components. Generate the irrelevant noise features the remaining 950 features were regarded as irrelevant noise features generated from a multinomial distribution.

Experiment Results Trace plot for the number of clusters. Trace plot for the number of discriminating features

Real Document Datasets Four standard text datasets were used in our experiments: News-Different-3, News-Similar-3, News-Moderated-6 and Classic400.

Experiment Results Clustering results on News-Similar-3 and News-Moderated-6

Experiment Results Clustering results on News-Different-3 (the third row) and Classic400

Experiment Results Estimated labels of data points in News-Different-3.

Conclusion and Future work Our proposed DPMFS approach handles document clustering and feature selection simultaneously. We constrain the DPM model only to define the cluster structure of the data with discriminative features which are identified by a latent binary vector. Our experiment shows that DPMFS approach groups document dataset into meaningful clusters without requiring the number of clusters known in advance. An interesting direction for future research is to study how to use the DPMFS approach in the semi-supervised document clustering since more and more labeled documents or constraints are available in real-life.

Thank You

ALGORITHM We use the Gibbs sampling method to infer both the latent cluster structure and discriminative words in the context of DMAFS model. Initialize the latent variables γ and z, set the parameter α, ω, λ and N. “burn-in” period: sampling γ, η, z and λ in iterations Use the last H samples of z and γ to infer the latent data label and discriminative words.

Evaluation Metric We used the normalized mutual information (NMI) to evaluate the quality of a clustering solution. measures the amount of statistical information shared by the random variables representing the cluster assignments and the user-labeled class assignments of the data points.

Parameter tuning We investigated the sensitivity of the choices of parameters in our algorithm by large amounts of experiments.

Choice of N In principle, we can choose N to be the number of data points.

Choice of α and ω

Choice of λ A small value for λ performs well though it will require relatively long time for the sampling process to be stable. For the document datasets used in our following experiments, we found that a good choice of λj is 1.0/σj, where σj is the sample standard variance of {x1i, x2i,…, xDi}.

Document Clustering via Dirichlet Process Mixture Model with Feature Selection

Document Clustering via Dirichlet Process Mixture Model with Feature Selection

Presentation Transcript

Mixture Models for Document Clustering

Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures

Feature Selection in k-Median Clustering

Collapsed Variational Dirichlet Process Mixture Models

Feature Selection, Dimensionality Reduction, and Clustering

Variational Inference for Dirichlet Process Mixture

Feature selection

Document Clustering via Matrix Representation

Variational Bayes Model Selection for Mixture Distribution

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture

Document Clustering

Dirichlet process tutorial

Mixture model clustering for mixed data with missing information

Feature Selection

Document Clustering

Hierarchical Double Dirichlet Process Mixture of Gaussian Processes

Memoized Online Variational Inference for Dirichlet Process Mixture Models

Feature Selection

Document Clustering with Cluster Refinement and Model Selection Capabilities

Feature selection

Document Clustering with Prior Knowledge