1 / 20

A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen

A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen. Presentation Outline. Knowledge Retrieval Scenarios Challenges for Clustering of Documents Applicability Assumptions Cluster Analysis Steps An Example Conclusion. Knowledge Retrieval Scenarios.

kimn
Télécharger la présentation

A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen

  2. Presentation Outline • Knowledge Retrieval Scenarios • Challenges for Clustering of Documents • Applicability Assumptions • Cluster Analysis Steps • An Example • Conclusion

  3. Knowledge Retrieval Scenarios • Brute Force Search (no Guidance) • Contextual Search • On the Fly Structuring/Grouping of Result List

  4. Clustering With Retrieved Documents (Jamie Callan, 2002)

  5. Clustering With Retrieved Documents (Jamie Callan, 2002)

  6. Challenges for Clustering of Documents • Handling of high dimensionality • Clustering quality • Supporting of multi-subject content • Clustering Data Presentation • Performance throughput

  7. Applicability Assumptions • Clustering of documents returning from a search engine • Maximum of number of documents searchers may gain access to • Clustering is applied only when result list is broad • Web-Searchers want quick access to content whenever possible

  8. Proposed System Architecture Result List + Clustering Data User Query Query processor Search Engine Cluster Analysis Result Formulator Stop Words Stemming Rules Search Indexes

  9. Document Vectors • Represented as vectors when used computationally • Each vector holds a place for every term in the collection. • Therefore, most vectors are sparse. Doc A Doc B Doc C Doc D Doc E Doc F Doc G Doc H Doc I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in Doc A “Galaxy” occurs 5 times in Doc A “Heat” occurs 3 times in Doc A (Blank means 0 occurrences.) (Hearst & Larson: Simm 202 UCB )

  10. Measuring Similarity • Cosine Similarity Expression: • Tanimoto Expression:

  11. Document Frequency (DF) • Document frequency: The number of documents in which a word occurs in the dataset. Use for feature selection purpose. DocA DocB DocC DocD DocE DocF DocG DocH DocI DF nova galaxy heat h’wood film role diet fur x x x x x x x x x x x x x x x x x x x x x x x x x x 3 4 2 4 5 2 3 3

  12. Cluster Analysis Steps • Text Preprocessor • - Build Word_Index_Table (Word ids, Doc ids, Document Frequency (DF)) • Features Extraction (Data Splitter) • Build Word_Processed_Table (Document Vectors) • Combined Words when applicable • Build Clusters • Compute seeded (centroid) documents • Centroid document consists of the present words that appear in two or more documents under the same • Word_Cluster_Id. • Assign documents to a seeded cluster if (1) share a common • Word_Clustered_Id and (2) distance between centroid document and • inspected documents within a threshold value • Decompose of main cluster into sub-clusters • Cluster Output • Order all clusters by their DF values • Label each cluster with name: Word_Clustered_Ids and DF • Output the result

  13. An Example Search Result for a Query of “Car”

  14. Sample Output of Preprocessor Word_Index_Table

  15. Sample Output of Feature Extraction (Data Splitter)

  16. Sample Output of Word_Processed_Table

  17. Sample Output of Word_Processed_Table (with combined words)

  18. Result of Clustering Formulation

  19. CAR Example of Cluster Output

  20. Conclusion • Implementation concerns • Data analysis • Throughput performance considerations • Usability of label of cluster name

More Related