Externally Enhanced Classifiers and Application in Web Page Classification

Externally Enhanced Classifiers and Application in Web Page Classification Jyh-Jong Tsay National Chung Cheng University Join work with Chi-Feng Chang and Hsuan-Yu Chen This research is supported in part by National Science Council, Taiwan, under.

Outline • Introduction • Externally Enhanced Classifiers • Enhanced NB • Topic Restriction • Conclusion

Classification: Definition • assignment of objects into a set of predefined categories (classes) • classification of applicants into risk levels • classification of web pages into topics • classification of protein sequences into families • topic-specific retrieval, information filter, recommendation, …

Classification: Task • Input: a training set of examples, each labeled with one class label • Output: a model (classifier) that assigns a class label to each instance based on the other attributes • The model can be used to predict the class of new instances, for which the class label is missing or unknown

Train and Test • example =instance + class label • Examples are divided into training set + test set • Classification model is built in two steps: • training - build the model from the training set • test - check the accuracy of the model using test set

Train and Test • Kind of models: • if - then rules • decision trees • joint probabilities • decision surfaces • Accuracy of models: • the known class of test samples is matched against the class predicted by the model • accuracy rate = % of test set samples correctly classified by the model

Training step Classification algorithm training data Classifier (model) if age < 31 or Car Type =Sports then Risk = High class label

Test step Classifier (model) test data

Classification (prediction) Classifier (model) new data

Classification Techniques • Decision Tree Classification • Bayesian Classifiers • Hidden Markov Models(HMM) • Neural Networks • Support Vector Machines(SVM) • k-nearest neighbor classifiers(KNN) • Genetic Algorithms • Rough Set Approach

Web Page Classification • automatically assign the document to a predefined category(topic) • Topic Specific Retrieval, Filter, Recommendation, …

External Annotations S: Source Hierarchy T: Target Hierarchy www.openfind.com.tw www.yam.com T1 T2 T3 S1 S2 S3 S3 T5 T6 S6 T4 S4 S5 T7 T8 T9 : topic (class) : document 使用者瀏覽,找出有興趣的資訊, 根據使用者興趣來做filtering及資料歸類。利用其他相關類別的資訊來幫助歸類。 • use external annotations to enhance classification of documents categorized in one topic hierarchy (source) to another one (target).

Examples • web directories • Google, Yahoo, ProFusion, … • domain-specific channels • music, sports, … • product catalogs • expert annotations

Learning Approaches • internal learning • produces traditional classifiers from internal information • large amount of internal information • external learning • produces external enhancer or reducers from external information • heterogeneous, sparse, dynamic

External Learning • Probabilistic Enhancement • use probabilistic enhancer to improve probabilistic classifiers • Naïve Bayes, Hidden Markov Models, … • Topic Retriction • cascade reducer to reduce the set of candidates • KNN, SVM, Neural Nets, …

Externally Enhanced Classifiers • KNN • SVM • NB • HMM Reducers Enhancers Topic Restriction Probabilistic Enhancement Predicted Class Annotated Instance Externally Enhanced Classifiers

Summary • Traditional Clasifiers (Yam . 工商經濟Openfind . 工商經濟) • Naïve Bayes: 55% • SVM: 57% • Enhanced Classfiers: • Enhanced Naïve Bayes: 66% • Topic Restricted SVM: 67%

Proposed Approaches • Probabilistic Enhancement that uses class information to enhance probabilistic classifiers such as Naïve Bayes and HMM • Topic Restriction that uses class information to restrict the set of candidate classes, and can be used to extend any classifier such as SVM and kNN

Probabilistic Methods • Probabilistic Classifier • When external information is available, • Probabilistic Enhancement

Estimation of P(vt|s) • straightforward estimation • more robust estimation • when • when

NB-Based Methods (Agrawal and Srikant, 2001)

Data Sets • Data set I • source hierarchy: Yam • target hierarchy: Openfind • Data set II • source hierarchy: Yam.BusinessAndEconomics • target hierarchy: Openfind.BusinessAndEconomics • Data set III • source hierarchy: Google.Business • target hierarchy: Yahoo.BusinessAndEconomics

Comparison of NB-Based Method

Class-Level Comparison

Topic Restriction(TR) • TR uses class information to reduce the set of candidate classes, and can be used for any traditional classifiers such as SVM and kNN • Static Topic Restriction • Most source classes are related to a small number of targeted classes • Consider only those target classes that intersect the source class • Dynamic Topic Restriction • Simple classifiers achieve very high top k measure for small k • Consider only those top k classes ranked by a simple classifier

Static Topic Restriction

Dynamic Topic Restriction Data Set II

Conclusion • We propose probabilistic enhancement to enhance Naïve Bayes. • We propose a topic restriction method to extend SVM. • We carry out extensive experiment for text collections from Google and Yahoo, and Openfind and Yam. • Experiment shows that our approaches significantly improve traditional approaches

Further Remarks • Topic restriction is a general idea for cascading simpler, such as NB and linear classifiers, and more complicated classifiers, such as SVM and kNN • Cascading improves both the running times and classification accuracy of SVM and kNN, especially when the number of topic classes is large. • Further study on topic restriction is going on.

Cascaded SVM • Web Directory Data (Openfind)

Cascaded SVM • CNA news collection

Externally Enhanced Classifiers and Application in Web Page Classification