300 likes | 433 Vues
Pseudo-supervised Clustering for Text Documents. Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy. Outline. Document representation Pseudo-Supervised Clustering Evaluation of cluster quality
E N D
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di SienaSiena - Italy
Outline • Document representation • Pseudo-Supervised Clustering • Evaluation of cluster quality • Experimental results • Conclusions Pseudo-Supervised Clustering for Text Documents
Vector Space Model • Representation with a term-weight vector in the vocabulary space • di = [ wi,1, wi,2 , wi,3 , … , wi,v]’ • A commonly used weight scheme is TF-IDF • Documents are compared using the cosine correlation Pseudo-Supervised Clustering for Text Documents
Vector Space Model: Limitations • High dimensionality • Each term is an independent component in the document representation • the semantic relationships between words are not considered • Many irrelevant features • feature selection may be difficult especially for unsupervised tasks • Vectors are very sparse Pseudo-Supervised Clustering for Text Documents
Vector Space Model: Projection • Projection to a lower dimensional space • Definition of a basis for the projection • Use of statistical properties of the word-by-document matrix on a given corpus • SVD decomposition (Latent Semantic Analysis) • Concept Matrix Decomposition • [Dhillon & Modha, Machine Learning,2001] • Data partition + SVD/CMD for each partition • (Partially) supervised partitioning Pseudo-Supervised Clustering for Text Documents
Singular Value Decomposition • SVD of the |V|x|D| word-by-document matrix (|D|>|V|) • The orthonormal matrix U represents a basis for document representation • The k columns corresponding to the largest singular values in ∑ form the basis for the projected space Pseudo-Supervised Clustering for Text Documents
Concept Matrix Decomposition-1 • Use a basis which describes a set of k concepts represented by k reference term distributions • The projection into the concept space is obtained by solving Pseudo-Supervised Clustering for Text Documents
Concept Matrix Decomposition-2 • The k concept vectors ci can be obtained as the normalized centroids of a partition of the document collection D D = {D1 , D2 , …., Dk} • CMD exploits the prototypes of certain homogenous sets of documents Pseudo-Supervised Clustering for Text Documents
Pseudo-Supervised Clustering • Selection of the projection basis using a supervised partition of the document set • Determine a partition of a reference subset T of the document corpus • Select a basis Bi for each set i in the partition using SVD/CMD • Project the documents using the basis B=Ui Bi • Apply a clustering algorithm to the document corpus represented using the basis B • Eventually iterate refining the choice of the reference subset Pseudo-Supervised Clustering for Text Documents
Pseudo SVD-1 • The SVD is computed for the documents in each subset i in • The basis Bi is composed of the vi left singular vectors Ui • The new basis B is represented by the matrix Pseudo-Supervised Clustering for Text Documents
Pseudo SVD-2 • The Pseudo-SVD representation of the word-by-document matrix of the corpus is the matrix Z* computed as • The projection requires the solution of a least mean square problem Pseudo-Supervised Clustering for Text Documents
Pseudo CMD-1 • An orthogonal basis is computed as follows • Compute the centroid (concept vector) of each subset i in • Compute the word cluster for each concept vector • A word belongs to the word cluster Wi of subset i if its weight in the concept vector ci is greater then its weights in the other concept vectors • Each word is assigned to only one subset i • Represent the documents in i using only the features in the corresponding word cluster Wi • Compute the partition of i into viclusters and compute the word vectors of each centroid Pseudo-Supervised Clustering for Text Documents
Pseudo CMD-2 • Each partition i is represented by a set of vi directions obtained from the concept vectors c’ij • These vectors are orthogonal since each word belongs to only one cij • Document projection Pseudo-Supervised Clustering for Text Documents
Evaluation of cluster quality • Experiments onpre-classified documents • Measure of the dispersion of the classes among the clusters • Contingency table: the matrix H, whose element h(Ai,Cj) is the number of items with label Ai assigned to the cluster Cj. • Accuracy • “classification using majority voting” • Conditional Entropy • “confusion” in each cluster • Human Evaluation Pseudo-Supervised Clustering for Text Documents
Experimental results-1 • Data Preparation • Parsing of PDF file • Term filtering using the Aspell-0.50.4.1 library • Removal of the stop words • Application of the Luhn Reduction to remove common words Pseudo-Supervised Clustering for Text Documents
Experimental result-2 • Data Set (conference papers) Pseudo-Supervised Clustering for Text Documents
Experimental result-3 • We applied k-means using three different document representations: • original vocabulary basis • Pseudo-SVD (PSVD) • Pseudo-CMD (PCMD) • Each algorithm was applied setting the number of clusters to 10 • For PSVD and PCMD, we varied the number of principal components Pseudo-Supervised Clustering for Text Documents
Experimental result-4 Pseudo-Supervised Clustering for Text Documents
Experimental result-5 Topic Distribution for the Pseudo-SVD algorithm with v=7 Pseudo-Supervised Clustering for Text Documents
Experimental result-6 • Analyzing the results: • Low Accuracy • High Entropy • Due to: • Data set has many transversal topics (for es. Class 5 ->Wavelets) We have evaluated the accuracy using the expert’s evaluations. Pseudo-Supervised Clustering for Text Documents
Experimental result-7 Human expert’s evaluation of cluster accuracy Pseudo-Supervised Clustering for Text Documents
Conclusions • We have presented two clustering algorithms for text documents which use a clustering step also in definition of the basis for the document representation • We can exploit the prior knowledge of human expert about the data set and bias the feature reduction step towards a more significant representation • The results show that PSVD algorithm is able to perform better than vocabulary Tf-Idf representation and PCMD Pseudo-Supervised Clustering for Text Documents
Thanks for your attention!!! Pseudo-Supervised Clustering for Text Documents
Appendix: Vector Space Model-2 • Cosine correlation • Two vectors xi and xj are similar if: Pseudo-Supervised Clustering for Text Documents
Appendix: Contingency and Confusion Matrix • If you associate the cluster Cj to the topic Am(j) for which Cj has the maximum number of documents and you rearrange the column of H such that j’=m(j), you obtain the confusion matrix Fm Pseudo-Supervised Clustering for Text Documents
Appendix: Pseudo CMD-2 • For each word-by-document matrix for cluster i, we keep only the components related to the words in the word cluster Wj • We sub-partition each new matrix to obtain more than one direction for each original partition Pseudo-Supervised Clustering for Text Documents
Appendix: Evaluation of cluster quality-2 • Accuracy • Classification Error Pseudo-Supervised Clustering for Text Documents
Evaluation of cluster quality-3 • Conditional Entropy Where Pseudo-Supervised Clustering for Text Documents
Thanks for your attention!!! Pseudo-Supervised Clustering for Text Documents
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di SienaSiena - Italy