260 likes | 370 Vues
This paper presents a novel algorithm for hierarchically co-clustering documents and words, motivated by the overwhelming amount of unstructured information available today. The Rowset Partitioning and Submatrix Agglomeration (RPSA) method is outlined, which combines partitioning and agglomerative techniques to improve document organization. The experimental results show that the proposed algorithm performs comparably or better than traditional clustering methods like k-means, with acceptable purity in hierarchical structures. This advancement aims to facilitate easier browsing and navigation of vast document collections.
E N D
A matrix density based algorithm to hierarchically co-cluster documents and words Advisor : Dr. Hsu Graduate:Keng-Wei Chang Author :Bhushan Mandhani Sachindra Joshi Krishna Kummamuru
outline • Motivation • Objective • Introduction • background • Rowset Partitioning and Submatrix Agglomeration(RPSA) • Experimental results • Conclusions • Personal Opinion
Motivation • With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.
Objective • A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo、Google. • This paper proposes an algorithm to hierarchically cluster documents for solving problems.
Introduction • 90s -> 100 thousand pages; • 2002 -> 2 billion pages; • it has become increasingly important to organize the information • Manually is accurate, but not always feasible • Need tools to automatically arrange documents to labeled hierarchies • Propose RPSA -> two step partitional-agglomerative
background • Vector Model for Documents • Evaluation of Clustering Quality • Evaluation of Hierarchical Clustering
Vector Model for Documents Unitized-TF IDF We have d documents Document i is represented by is the number of occurrences of word j in document i Term Frequency,TF Inverse Document Frequency,IDF
Evaluation of Clustering Quality • 1. Purity: • 2. Entropy:
Rowset Partitioning and Submatrix Agglomeration(RPSA) • tow-step partitional-agglomerative algorithm • 1th step:The Partitioning Step • 2th step:The Agglomerative Step
The Partitioning Step • Define the density of submatices a row r,a column c a set R of rows,a set C of columns
The Partitioning Step • Generating a Leaf Cluster
The Partitioning Step • Choice of Leader Documents • The sum of TFIDF vector representing that document • Documents with relatively large lengths were observed to be better leader documents for the algorithm above
The Partitioning Step • The Complete Partitioning Algorithm
The Partitioning Step • Complexity Analysis • The time complexity is O(mz) • The space complexity is O(z)
The Agglomerative Step • Reduce the number of clusters • The similarity measure between two clusters for merging • Flat Clustering • Hierarchical Clustering
The Agglomerative Step • Complexity Analysis • The time complexity is O( ) • The space complexity is O( )
Experimental results-Flat Clustering • Data Sets
Experimental results-Flat Clustering • Results
Experimental results-Hierarchical Clustering • Data Sets
Experimental results-Hierarchical Clustering • Data Sets
Conclusions • It is comparable with or better than the best k-means run • It’s performance does not degrade on small data sets • It’s acceptable on purity in hierarchy