Some Effective Techniques for Naive Bayes Text Classification Advisor : Dr. Hsu Presenter : Ai-Chen Liao Authors : Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng 2006 . TKDE . Page(s) : 1457 - 1466
Outline • Motivation • Objective • About Naïve Bayes • Method • A per-document length normalization approach • Weight-enhancing method • Experimental Result • Conclusion • Personal Opinions
Motivation • While naïve Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. • Based on the observation of naïve Bayes for the natural language text, we found a serious problem in the parameter estimation process, which cause poor results in text classification domain.
Objective • We hope to propose some methods that can improve these problems.
About Naive Bayes • Multivariate Bernoulli naïve Bayes A document is considered as a binary feature vector representing whether each word is present or absent. It is not equipped to utilize term frequencies in documents. Multinomial model Two serious problems: (1) rough parameter estimation (2) handling rare categories
Method ─ Multivariate Poisson Model for Text Classification λ表示某特定區間內某事件所發生的平均次數
Experimental Results DS1: Reuters21578 (consists of 21,578 news articles) DS2: 20Newsgroups (consists of 19,997 Usenet articles collected from 20 different newsgroups)
Experimental Results high high high high
Conclusion • We propose a Poisson naive Bayes text classification model with weight-enhancing method. • We suggest per-document term frequency normalization to estimate the Poisson parameter, while the traditional multinomial classifier estimates its parameters by considering all the training documents as a unique huge training document.
Personal Opinions • Advantage • … • Drawback • … • Application • Text classification…