1 / 14

Enhancing Chinese Text Categorization through Feature Engineering and Classifier Comparison

This study delves into the challenges and methodologies of automatic Chinese text categorization (TC), comparing various machine learning classifiers including Naïve Bayes, SVM, Decision Trees, k-Nearest Neighbors, MaxEnt, and language model methods. The necessity of feature engineering in the Chinese language—highlighting aspects such as data preparation, feature selection, and vector encoding—is explored. Key findings reveal that n-gram models outperform others, and feature engineering significantly reduces feature sparsity, yet may introduce ambiguity. Future research avenues hint at deeper semantic understanding.

thane
Télécharger la présentation

Enhancing Chinese Text Categorization through Feature Engineering and Classifier Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin

  2. Motivation • Text categorization (TC) is extensively researched in English but not in Chinese. • What’s feature engineering help in Chinese? • Should Chinese content be segmented? • What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?

  3. Outline • Data Preparation • Feature Selection • Feature Vector Encoding • Comparison of Classifiers • Feature Engineering • Comparison after Feature Engineering • Conclusion

  4. Data Preparation • Tool: Yahoo News Crawler • Category • Entertainment • Politics • Business • Sports

  5. Feature Selection • statistics:

  6. Top Features by

  7. Feature Vector Encoding • Binary: whether contains a word. • Count: number of occurrence. • TF: ratio of words occurrence. • TF-IDF: with inverse document freq.

  8. Comparison of different encoding

  9. Classifier Comparison Ⅰ

  10. Classifier Comparison Ⅱ

  11. Feature Engineering • Stop Terms: similar to stop words in English. • Group Terms: common substrings. • Key Terms: distinctive terms.

  12. Comparison of feature engineering methods S: stop terms G: group terms K: key terms

  13. Comparison after FE

  14. Conclusion • N-gram model outperforms other methods: • Language Models’ nature: considering all features and avoid error-prone ones. • No restrictive independence (ex. NB). • Better smoothing. • Feature engineering also helps reducing the sparsity but may cause ambiguity. • Semantic understanding could be the next to try in future research.

More Related