1 / 28

A Survey on Text Categorization with Machine Learning

A Survey on Text Categorization with Machine Learning. Chikayama lab. Dai Saito. Introduction: Text Categorization. Many digital Texts are available E-mail, Online news, Blog … Need of Automatic Text Categorization is increasing without human resource Merits of time and cost.

hedy-hoover
Télécharger la présentation

A Survey on Text Categorization with Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito

  2. Introduction:Text Categorization • Many digital Texts are available • E-mail, Online news, Blog … • Need of Automatic Text Categorization is increasing • without human resource • Merits of time and cost

  3. Introduction:Text Categorization • Application • Spam filter • Topic Categorization

  4. Introduction:Machine Learning • Making Categorization rule automatically by Feature of Text • Types of Machine Learning (ML) • Supervised Learning • Labeling • Unsupervised Learning • Clustering

  5. Introduction:flow of ML • Prepare training Text data with label • Feature of Text • Learn • Categorize new Text Label1 ? Label2

  6. Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion

  7. Number of labels • Binary-label • True or False (Ex. spam or not) • Applied for other types • Multi-label • Many labels, butOne Text has one label • Overlapping-label • One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4

  8. Types of labels • Topic Categorization • Basic Task • Compare individual words • Author Categorization • Sentiment Categorization • Ex) Review of products • Need more linguistic information

  9. Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion

  10. Feature of Text • How to express a feature of Text? • “Bag of Words” • Ignore an order of words • Structure • Ex) I like this car. | I don’t like this car. • “Bag of Words” will not work well • (d:document = text) • (t:term = word)

  11. Preprocessing • Remove stop words • “the”“a”“for”… • Stemming • relational -> relate, truly -> true

  12. Term Weighting • Term Frequency • Number of a term in a document • Frequent terms in a document seems to be important for categorization • tf・idf • Terms appearing in many documents are not useful for categorization

  13. bad happy good Sentiment Weighting • For sentiment classification,weight a word as Positive or Negative • Constructing sentiment dictionary • WordNet [04 Kamps et al.] • Synonym Database • Using a distancefrom ‘good’ and ‘bad’ d (good, happy) = 2 d (bad, happy) = 4

  14. Dimension Reduction • Size of feature vector is (#terms)*(#documents) • #terms ≒ size of dictionary • High calculation cost • Risk of overfitting • Best for training data ≠ Best for real data • Choosing effective feature • to improve accuracy and calculation cost

  15. Dimension Reduction • df-threshold • Terms appearing in very few documents(ex.only one) are not important • Score • If t and cj are independent, Score is equal to Zero

  16. Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion

  17. Learning Algorithm • Many (Almost all?) algorithms are used in Text Categorization • Simple approach • Naïve Bayes • K-Nearest Neighbor • High performance approach • Boosting • Support Vector Machine • Hierarchical Learning

  18. Naïve Bayes • Bayes Rule • This value is hard to calculate • ? • Assumption : each terms occurs independently

  19. k-Nearest Neighbor • Define a “distance” of two Texts • Ex)Sim(d1, d2) = d1・d2 / |d1||d2| = cosθ • check k of high similarityTexts and categorize bymajority vote • If size of test data is larger, memory and search cost is higher d1 k=3 d2 θ

  20. Boosting • BoosTexter [00 Schapire et al.] • Ada boost • making many “weak learner”s with different parameters • Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data • BoosTexter uses Decision Stump as “weak learner”

  21. 1. 2. 3. + + + + + + - - + - + - + + + + - - - - + + + + - - - - + + + + - - - - - - - - Simple example of Boosting

  22. Support Vector Machine • Text Categorization with SVM[98 Joachims] • Maximize margin

  23. Text Categorization with SVM • SVM works well for Text Categorization • Robustness for high dimension • Robustness for overfitting • Most Text Categorization problems are linearly separable • All of OHSUMED (MEDLINE collection) • Most of Reuters-21578 (NEWS collection)

  24. Comparison of these methods • [02 Sebastiani] • Reuters-21578 (2 versions) • difference: number of Categories

  25. Hierarchical Learning • TreeBoost[06 Esuli et al.] • Boosting algorithm for Hierarchical labels • Hierarchical labels and Texts with label as Training data • Applying AdaBoost recursively • Better classifier than ‘flat’ AdaBoost • Accuracy: 2-3% up • Time: training and categorization time down • Hierarchical SVM[04 Cai et al.]

  26. TreeBoost root L2 L4 L1 L3 L11 L12 L41 L42 L43 L421 L422

  27. Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion

  28. Conclusion • Overview of Text Categorizationwith Machine Learning • Feature of Text • Learning Algorithm • Future Work • Natural Language Processing with Machine Learning, especially in Japanese • Calculation Cost

More Related