Bayesian Online Classifiers for Text Classification and Filtering

Bayesian Online Classifiers forText Classification and Filtering A Paper by Kian Ming Adam Chai, Hwee Tou Ng and Hai Leong Chieu Presented by Eric Franklin and Changshu Jian

Agenda • Problem Description & Proposed Solution • Algorithm Description • Evaluation & Conclusions • Discussion & Critiques • Summary

Problem Description As the number of documents that exist grows daily, there is a greater need for ways to classify text documents, and handcrafting text classifiers is a tedious process. New methods for classifying documents are relevant to the field of Data Mining as classification is the entry point for all subsequent data mining functions.

Problem Description Why document classification? • Spam filter • User wants to browse by topic through mass search results (IR) • A document cluster is relevant to the same query? Why Bayesian? • Bayesian classification proves to be useful in Document Classification, religious theoretic basis • Offline? needs large data set to improve accuracy

Proposed Solution • Two related Bayesian algorithms can perform comparably to Support Vector Machines (SVM) • Bayesian Online Perceptron • Bayesian Online Gaussian Process • The online approach allows continuous learning without storing all the previous data • Continuous learning allows the utilization of information obtained from subsequent data after the initial training

Bayesian Online Learning • Given m instances of past data Dm = {(yt, xt), t = 1...m}, the predictive probability of the relevance of a document described by x is • a is a random variable with probability density p(a|Dm) • Integrate over all the possible values of a to obtain the prediction • Explicit dependence of the posterior p(a|Dt+1) on the past data is removed by approximating it with a distribution p(a|At+1)

Bayesian Online Learning (cont) • Starting from the prior p0(a) = p(a|A0), learning comprises two steps • Update the posterior probability using Bayes rule • Approximate the updated posterior probability • Approximation is done by minimizing the Kullback-Leibler distance between the the approximating and approximated distributions • Kullback-Leibler Distance • Non-symmetric measure of the difference between two probability distributions • Measures the expected number of extra bits required to code from one probability distribution to another

What is a Perceptron? • Simplest feed-forward neural network • A perceptron is a binary classifier that maps real input vector values to binary output • A thresholding function, f(x) is used • If w ∙ x + b > 0, f(x) maps to 1, else f(x) maps to 0 • x is the input vector • w is a vector of real-valued weights • b is a bias, a constant term that does not depend on any input value

Bayesian Online Perceptron • Likelihood is defined as a probit model: • Where a defines a perceptron • σ02 is a fixed noise variance • Φ is the cumulative Gaussian distribution • x is a vector representing a document • y is the document relevance, where y ∈ {−1,1}

Bayesian Online Perceptron Algorithm • Successive calculation of the means ⟨a⟩t and covariances Ct of the posterior probabilities for m documents • Initialize ⟨a⟩0 to be 0 and C0 to be 1 • For t = 0, 1, ..., m−1 • yt+1 is the relevance indicator for document xt+1 • Calculate st+1 , σt+1 ,⟨h⟩t and ⟨p(yt+1 | h)⟩t • Calculate and • Calculate • Calculate • Calculate ⟨a⟩t+1 and Ct+1 • The prediction for datum (y,x) is ⟨p(y|x,a)⟩m = ⟨p(y|h)⟩m

Bayesian Online Gaussian Process • Gaussian process (GP) has been historically constrained to problems with small data sets • Uses efficient and effective approximations to the full GP formulation • Similar to Perceptron, but uses a kernel function to estimate weights

Evaluation • The author uses case studies to validate the proposed methodology • Two benchmark data sets • Strengths • Easily to compare with other methods • Weakness • Incomprehensive test, no real application test • No theoretical proof

Evaluation • Two tasks: classification & filtering • Classification • Reuters-21578 corpus • 9,603 training documents and 3,299 test documents • Filtering • OHSUMED • Only the Bayesian Online Perceptron considered

Evaluation: Classification • Feature Selection • select as features for each category the set of all words for which −2 ln λ > 12.13 • Further prune by using only top 300 features • Thresholding • Bayes decision rule, p(y = 1|x,Dm) > 0.5 • Additionally, MaxF1: empirically optimized threshold for each category for the F1

Methods comparison

Classification on Reuters-21578 • Generally, MaxF1 thresholding increases the performance of all the systems, especially for rare categories. • For the Bayesian classifiers, ExpectedF1 thresholding improves the performance of the systems on rare categories. • Perceptron implicitly implements the kernel used by GP-1, hence their similar results. • With MaxF1 thresholding, feature selection impedes the performance of SVM.

Classification on Reuters-21578 • for limited features, Bayesian classifiers outperform SVM for both common and rare categories. • Based on the sign tests, the Bayesian classifiers outperform SVM (using 8,362 words) for common categories, and vice versa for rare categories

Evaluation: Filtering • Feature selection and adaptation • Training the classifier • Information gain • Results

Filtering on OHSUMED • System comparison • Using Bayesian online perceptron

Parameter Settings • Feature selection and Adaptation • Training the classifier • Information Gain

Results • a kind of active learning, where the willingness to tradeoff precision for learning decreases with Nret. • features are constantly added as relevant documents are seen. When the classifier is retrained on past documents, the new features enable the classifier to gain new information from these documents.

Results • Bayesian online perceptron, together with the consideration for information gain, is a very competitive method.

Conclusions & Future Work • These algorithms performed comparably to SVM • Future work • Hybrid classification using Bayesian classifiers for common categories and maximum margin classifiers for rare categories • Modify Bayesian classifiers to use relevance feedback • Compare incremental SVM with the Bayesian online classifiers

Discussion • Major contributions of the paper • Testing of Existing Capability • Implemented and tested Bayesian online perceptron and Gaussian processes • Demonstrated the effectiveness of online learning with information gain on the TREC-9 batch-adaptive filtering task • New Capability • Offers online capability • Online processing is the most significant contribution of this paper, but we both feel that it needs further testing

Discussion • Assumptions made by the authors • Assume the test results are positive • How does negative feedback affect the system? • Assuming that the approximated value for the posterior are close enough to the actual calculated posterior • Based on probability distribution • Computing cost & Scalability • This was tested against a corpus of ~20,000 documents. How would the system perform against 1 million documents? 1 billion documents?

Summary • The authors discuss the problem of classifying large sets of text documents • They propose two variants of Bayesian classifying algorithms • Testing was performed against the Reuters-21578 corpus • The authors algorithm performed similarly to Support Vector Machine

Bayesian Online Classifiers for Text Classification and Filtering

Bayesian Online Classifiers for Text Classification and Filtering

Presentation Transcript

Bayesian classifiers

Bayesian Filtering

Bayesian Classification

CoBaFi : Collaborative Bayesian Filtering

Virtual Vector Machine for Bayesian Online Classification

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

Discriminative Naïve Bayesian Classifiers

Bayesian Classifiers

Bayesian Learning Application to Text Classification Example: spam filtering

MLE’s, Bayesian Classifiers and Naïve Bayes

Bayesian Classification

Bayesian Classification

Classification and Linear Classifiers

Classification Techniques: Bayesian Classification

Classification Bayesian Classifiers

Bayesian Filtering

Bayesian Classification

Bayesian Filtering for Location Estimation

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

Bayesian Filtering for Location Estimation

Classification of Paddy Types using Naïve Bayesian Classifiers