130 likes | 268 Vues
This paper explores a novel approach to text categorization based on summarization techniques, specifically leveraging a combination of word frequency and position methods. It emphasizes the importance of accurately and quickly categorizing documents in the growing realm of online information. By focusing on the title field, the proposed methodology demonstrates acceptable performance in labeling new documents across predefined categories. Experimental results from the Reuters Corpus support the premise that a reduced text size, like titles, can still yield effective categorization while minimizing execution time, making it suitable for online document classifiers.
E N D
A Text Categorization Based on summarization Technique Sue J .Ker Department of Computer Science, Soochow University Jen-Nan Chen Department of Management, Ming Chuan University ACL2000 報告人:翁鴻加
Abstract • Text categorization base on summarization • Combine word-based frequency and position method to get knowledge • Summarization_based categorization can achieve acceptable performance
Introduction • Growth of internet usage • Categorization should provide accurate information quickly • Predefined categories to label new document • Get knowledge from title field only
Text Summarization • Why uses title in categorization 1.summarization identify information evidence from a document 2.summarization techniques include position, cue phrase, word frequency, discourse segmentation. 3.word frequency and position are easy to implement 4.title fits position method(Hovy and Lin -1997) • 5.TREC evaluation shows that no significant • difference between long and short query
Preprocessing and Features Select • delineate by white space and punctuation • lower-case • remove stop word • stem
Term Weight • W(f,c) : weight of term f in category c C1 C2 Cn D1 D2 Dm …….. …. TFf,c : frequency of feature f in category c T : the number of categories DFf :the number of categories that contain feature f MAXc : max frequency of any feature in category c Nc : the document number belonging category c
Term Weight Normalize tf Probability of category
Category Ranking Fc : the set of features f in category c tf f,d : the frequency of features f appearing in the document d
Experiments • The Reuters Corpus • 7789 training documents • 3309 test documents • 93 categories • Average numbers of categories per document: 1.23 • Training documents per categories varies widely (2~2877) => P(c) is varies widely
Experiments Design • Only use title field as the scope of text • 1.Test Maxc and P(c) • 2.Locate the minimum term frequency
Experiments Design • Large feature sets perform better • full text is about 92%
Experiments Design Title contain small noise…
Conclusion • Small text size (title) is not bad for categorization • Short title field will reduce execution time • This system suits online document classifier • Position method can use some specific position