200 likes | 327 Vues
This paper discusses a novel approach to text classification using Stochastic Keyword Generation (SKG). It addresses the challenge of utilizing additional data to improve classification performance, particularly when summaries of texts are available for training but not during classification. The authors present experimental results from a dataset of help desk emails, demonstrating that classification using SKG significantly outperforms traditional methods. Future work includes theoretical analysis and potential application in diverse settings, paving the way for advancements in text classification techniques.
E N D
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
Introduction • Supervised Text Classification • Question: how to use additional data in training to improve the performance? • New Text Classification Problem • Summaries of texts are available in training, which are more indicative of contents • Note: Summaries are not available in classification • Example: classification at a help desk
Example • Email • When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. • Categories • Empty Outlook Message • Cannot Open Word File • Summary • receive emails; some emails have no subject and message body
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
New Text Classification Problem • Spaces • Users’ emails: space X • Categories: space Y • Engineers’ summaries (for training): space S • Assumption • Summaries are much easier to be classified
Text Classification Using SKG Conventional Text Classification Text Classification Using SKG email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. SKG classification probability vector: (x) (email 0.75, receive 0.68, subject 0.45, body 045, … ) classification category: y Y Empty Outlook Message category: y Y Empty Outlook Message
Stochastic Keyword Generation • Generating Keywords from a Given Text • Stochastic Keyword Generation (SKG) • Generate keywords and their conditional probabilities of occurrence given the text • Example emails 0.75 receive 0.68 subject 0.45 body 0.45 When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. Stochastic Keyword Generation
SKG Model new text x
Model for Each Keyword new text x
Learning Using SKG SKG classification
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
Data in Experiments • Data of the Help Desk of Microsoft • 2517 texts from 52 categories • About 10000 unique words in texts • About 1500 unique words in summaries • Conducted stopword removal, but not stemming • Training/Test Split • 5-fold cross validation
Experimental Settings • Classifiers • Linear SVM (Platt 1998; Dumais et al. 1998) • Perceptron algorithm with margins (PAM) (Li et al. 2002) • Methods • Text classification using SKG • Methods for comparison: • Prior • Texts for training • Summaries for training • (text+summary)s for training • Deterministic keyword generation (DKG)
Discussion email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. SKG summary: x X receive emails; some emails have no subject and message body probability vector: (x) (email 0.75, receive 0.68, subject 0.45, body 045, … ) classification category: y Y Empty Outlook Message
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
Conclusion and Future Work • Conclusion • Text classification using SKG significantly outperforms the methods without using it • Future Work • Theoretical analysis of the problem and the proposed method • Applied in different settings