Technique for Text Clustering: A Partition-Based Approach

A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen

Presentation Outline • Knowledge Retrieval Scenarios • Challenges for Clustering of Documents • Applicability Assumptions • Cluster Analysis Steps • An Example • Conclusion

Knowledge Retrieval Scenarios • Brute Force Search (no Guidance) • Contextual Search • On the Fly Structuring/Grouping of Result List

Clustering With Retrieved Documents (Jamie Callan, 2002)

Challenges for Clustering of Documents • Handling of high dimensionality • Clustering quality • Supporting of multi-subject content • Clustering Data Presentation • Performance throughput

Applicability Assumptions • Clustering of documents returning from a search engine • Maximum of number of documents searchers may gain access to • Clustering is applied only when result list is broad • Web-Searchers want quick access to content whenever possible

Proposed System Architecture Result List + Clustering Data User Query Query processor Search Engine Cluster Analysis Result Formulator Stop Words Stemming Rules Search Indexes

Document Vectors • Represented as vectors when used computationally • Each vector holds a place for every term in the collection. • Therefore, most vectors are sparse. Doc A Doc B Doc C Doc D Doc E Doc F Doc G Doc H Doc I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in Doc A “Galaxy” occurs 5 times in Doc A “Heat” occurs 3 times in Doc A (Blank means 0 occurrences.) (Hearst & Larson: Simm 202 UCB )

Measuring Similarity • Cosine Similarity Expression: • Tanimoto Expression:

Document Frequency (DF) • Document frequency: The number of documents in which a word occurs in the dataset. Use for feature selection purpose. DocA DocB DocC DocD DocE DocF DocG DocH DocI DF nova galaxy heat h’wood film role diet fur x x x x x x x x x x x x x x x x x x x x x x x x x x 3 4 2 4 5 2 3 3

Cluster Analysis Steps • Text Preprocessor • - Build Word_Index_Table (Word ids, Doc ids, Document Frequency (DF)) • Features Extraction (Data Splitter) • Build Word_Processed_Table (Document Vectors) • Combined Words when applicable • Build Clusters • Compute seeded (centroid) documents • Centroid document consists of the present words that appear in two or more documents under the same • Word_Cluster_Id. • Assign documents to a seeded cluster if (1) share a common • Word_Clustered_Id and (2) distance between centroid document and • inspected documents within a threshold value • Decompose of main cluster into sub-clusters • Cluster Output • Order all clusters by their DF values • Label each cluster with name: Word_Clustered_Ids and DF • Output the result

An Example Search Result for a Query of “Car”

Sample Output of Preprocessor Word_Index_Table

Sample Output of Feature Extraction (Data Splitter)

Sample Output of Word_Processed_Table

Sample Output of Word_Processed_Table (with combined words)

Result of Clustering Formulation

CAR Example of Cluster Output

Conclusion • Implementation concerns • Data analysis • Throughput performance considerations • Usability of label of cluster name

Technique for Text Clustering: A Partition-Based Approach

Technique for Text Clustering: A Partition-Based Approach

Presentation Transcript

A Partition-Based Heuristic for Translational Box Covering

Text Clustering

A Semi-Persistent Clustering Technique for VLSI Circuit Placement

A Clustering Based Approach to Creating Multi-Document Summaries

A general grid-clustering approach

A Semantic Clustering-based Approach For Searching And Browsing Tag Spaces

A Link-Based Cluster Ensemble Approach for Categorical Data Clustering

A Wevelet based Secured ECG Distribution Technique for Patient Centric Approach

A New Approach Of Data Clustering Using a Flock Of Agents

Clustering: Partition Clustering

Clustering-A neural network approach

Text Clustering

A Clustering Utility Based Approach for

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

A Text Categorization Based on summarization Technique

Text clustering using frequent itemsets

Clustering short status messages: A topic model based approach

Robust methodologies for partition clustering

Model-based clustering using Bayesian approach for binary panel Probit models

Robust methodologies for partition clustering

Shift: A Technique for Operating Pen-Based Interfaces Using Touch