1 / 30

Pseudo-supervised Clustering for Text Documents

Pseudo-supervised Clustering for Text Documents. Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy. Outline. Document representation Pseudo-Supervised Clustering Evaluation of cluster quality

presley
Télécharger la présentation

Pseudo-supervised Clustering for Text Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di SienaSiena - Italy

  2. Outline • Document representation • Pseudo-Supervised Clustering • Evaluation of cluster quality • Experimental results • Conclusions Pseudo-Supervised Clustering for Text Documents

  3. Vector Space Model • Representation with a term-weight vector in the vocabulary space • di = [ wi,1, wi,2 , wi,3 , … , wi,v]’ • A commonly used weight scheme is TF-IDF • Documents are compared using the cosine correlation Pseudo-Supervised Clustering for Text Documents

  4. Vector Space Model: Limitations • High dimensionality • Each term is an independent component in the document representation • the semantic relationships between words are not considered • Many irrelevant features • feature selection may be difficult especially for unsupervised tasks • Vectors are very sparse Pseudo-Supervised Clustering for Text Documents

  5. Vector Space Model: Projection • Projection to a lower dimensional space • Definition of a basis for the projection • Use of statistical properties of the word-by-document matrix on a given corpus • SVD decomposition (Latent Semantic Analysis) • Concept Matrix Decomposition • [Dhillon & Modha, Machine Learning,2001] • Data partition + SVD/CMD for each partition • (Partially) supervised partitioning Pseudo-Supervised Clustering for Text Documents

  6. Singular Value Decomposition • SVD of the |V|x|D| word-by-document matrix (|D|>|V|) • The orthonormal matrix U represents a basis for document representation • The k columns corresponding to the largest singular values in ∑ form the basis for the projected space Pseudo-Supervised Clustering for Text Documents

  7. Concept Matrix Decomposition-1 • Use a basis which describes a set of k concepts represented by k reference term distributions • The projection into the concept space is obtained by solving Pseudo-Supervised Clustering for Text Documents

  8. Concept Matrix Decomposition-2 • The k concept vectors ci can be obtained as the normalized centroids of a partition of the document collection D D = {D1 , D2 , …., Dk} • CMD exploits the prototypes of certain homogenous sets of documents Pseudo-Supervised Clustering for Text Documents

  9. Pseudo-Supervised Clustering • Selection of the projection basis using a supervised partition of the document set • Determine a partition  of a reference subset T of the document corpus • Select a basis Bi for each set i in the partition using SVD/CMD • Project the documents using the basis B=Ui Bi • Apply a clustering algorithm to the document corpus represented using the basis B • Eventually iterate refining the choice of the reference subset Pseudo-Supervised Clustering for Text Documents

  10. Pseudo SVD-1 • The SVD is computed for the documents in each subset i in  • The basis Bi is composed of the vi left singular vectors Ui • The new basis B is represented by the matrix Pseudo-Supervised Clustering for Text Documents

  11. Pseudo SVD-2 • The Pseudo-SVD representation of the word-by-document matrix of the corpus is the matrix Z* computed as • The projection requires the solution of a least mean square problem Pseudo-Supervised Clustering for Text Documents

  12. Pseudo CMD-1 • An orthogonal basis is computed as follows • Compute the centroid (concept vector) of each subset i in  • Compute the word cluster for each concept vector • A word belongs to the word cluster Wi of subset i if its weight in the concept vector ci is greater then its weights in the other concept vectors • Each word is assigned to only one subset i • Represent the documents in i using only the features in the corresponding word cluster Wi • Compute the partition of i into viclusters and compute the word vectors of each centroid Pseudo-Supervised Clustering for Text Documents

  13. Pseudo CMD-2 • Each partition i is represented by a set of vi directions obtained from the concept vectors c’ij • These vectors are orthogonal since each word belongs to only one cij • Document projection Pseudo-Supervised Clustering for Text Documents

  14. Evaluation of cluster quality • Experiments onpre-classified documents • Measure of the dispersion of the classes among the clusters • Contingency table: the matrix H, whose element h(Ai,Cj) is the number of items with label Ai assigned to the cluster Cj. • Accuracy • “classification using majority voting” • Conditional Entropy • “confusion” in each cluster • Human Evaluation Pseudo-Supervised Clustering for Text Documents

  15. Experimental results-1 • Data Preparation • Parsing of PDF file • Term filtering using the Aspell-0.50.4.1 library • Removal of the stop words • Application of the Luhn Reduction to remove common words Pseudo-Supervised Clustering for Text Documents

  16. Experimental result-2 • Data Set (conference papers) Pseudo-Supervised Clustering for Text Documents

  17. Experimental result-3 • We applied k-means using three different document representations: • original vocabulary basis • Pseudo-SVD (PSVD) • Pseudo-CMD (PCMD) • Each algorithm was applied setting the number of clusters to 10 • For PSVD and PCMD, we varied the number of principal components Pseudo-Supervised Clustering for Text Documents

  18. Experimental result-4 Pseudo-Supervised Clustering for Text Documents

  19. Experimental result-5 Topic Distribution for the Pseudo-SVD algorithm with v=7 Pseudo-Supervised Clustering for Text Documents

  20. Experimental result-6 • Analyzing the results: • Low Accuracy • High Entropy • Due to: • Data set has many transversal topics (for es. Class 5 ->Wavelets) We have evaluated the accuracy using the expert’s evaluations. Pseudo-Supervised Clustering for Text Documents

  21. Experimental result-7 Human expert’s evaluation of cluster accuracy Pseudo-Supervised Clustering for Text Documents

  22. Conclusions • We have presented two clustering algorithms for text documents which use a clustering step also in definition of the basis for the document representation • We can exploit the prior knowledge of human expert about the data set and bias the feature reduction step towards a more significant representation • The results show that PSVD algorithm is able to perform better than vocabulary Tf-Idf representation and PCMD Pseudo-Supervised Clustering for Text Documents

  23. Thanks for your attention!!! Pseudo-Supervised Clustering for Text Documents

  24. Appendix: Vector Space Model-2 • Cosine correlation • Two vectors xi and xj are similar if: Pseudo-Supervised Clustering for Text Documents

  25. Appendix: Contingency and Confusion Matrix • If you associate the cluster Cj to the topic Am(j) for which Cj has the maximum number of documents and you rearrange the column of H such that j’=m(j), you obtain the confusion matrix Fm Pseudo-Supervised Clustering for Text Documents

  26. Appendix: Pseudo CMD-2 • For each word-by-document matrix for cluster i, we keep only the components related to the words in the word cluster Wj • We sub-partition each new matrix to obtain more than one direction for each original partition Pseudo-Supervised Clustering for Text Documents

  27. Appendix: Evaluation of cluster quality-2 • Accuracy • Classification Error Pseudo-Supervised Clustering for Text Documents

  28. Evaluation of cluster quality-3 • Conditional Entropy Where Pseudo-Supervised Clustering for Text Documents

  29. Thanks for your attention!!! Pseudo-Supervised Clustering for Text Documents

  30. Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di SienaSiena - Italy

More Related