310 likes | 405 Vues
Explore how externally enhanced classifiers improve web page classification efficiency. Learn about classification models, techniques, and probabilistic methods. Enhance classifiers using external information and topic restrictions. Evaluate performance with examples.
E N D
Externally Enhanced Classifiers and Application in Web Page Classification Jyh-Jong Tsay National Chung Cheng University Join work with Chi-Feng Chang and Hsuan-Yu Chen This research is supported in part by National Science Council, Taiwan, under.
Outline • Introduction • Externally Enhanced Classifiers • Enhanced NB • Topic Restriction • Conclusion
Classification: Definition • assignment of objects into a set of predefined categories (classes) • classification of applicants into risk levels • classification of web pages into topics • classification of protein sequences into families • topic-specific retrieval, information filter, recommendation, …
Classification: Task • Input: a training set of examples, each labeled with one class label • Output: a model (classifier) that assigns a class label to each instance based on the other attributes • The model can be used to predict the class of new instances, for which the class label is missing or unknown
Train and Test • example =instance + class label • Examples are divided into training set + test set • Classification model is built in two steps: • training - build the model from the training set • test - check the accuracy of the model using test set
Train and Test • Kind of models: • if - then rules • decision trees • joint probabilities • decision surfaces • Accuracy of models: • the known class of test samples is matched against the class predicted by the model • accuracy rate = % of test set samples correctly classified by the model
Training step Classification algorithm training data Classifier (model) if age < 31 or Car Type =Sports then Risk = High class label
Test step Classifier (model) test data
Classification (prediction) Classifier (model) new data
Classification Techniques • Decision Tree Classification • Bayesian Classifiers • Hidden Markov Models(HMM) • Neural Networks • Support Vector Machines(SVM) • k-nearest neighbor classifiers(KNN) • Genetic Algorithms • Rough Set Approach
Web Page Classification • automatically assign the document to a predefined category(topic) • Topic Specific Retrieval, Filter, Recommendation, …
External Annotations S: Source Hierarchy T: Target Hierarchy www.openfind.com.tw www.yam.com T1 T2 T3 S1 S2 S3 S3 T5 T6 S6 T4 S4 S5 T7 T8 T9 : topic (class) : document 使用者瀏覽,找出有興趣的資訊, 根據使用者興趣來做filtering及資料歸類。 利用其他相關類別的資訊來幫助歸類。 • use external annotations to enhance classification of documents categorized in one topic hierarchy (source) to another one (target).
Examples • web directories • Google, Yahoo, ProFusion, … • domain-specific channels • music, sports, … • product catalogs • expert annotations
Learning Approaches • internal learning • produces traditional classifiers from internal information • large amount of internal information • external learning • produces external enhancer or reducers from external information • heterogeneous, sparse, dynamic
External Learning • Probabilistic Enhancement • use probabilistic enhancer to improve probabilistic classifiers • Naïve Bayes, Hidden Markov Models, … • Topic Retriction • cascade reducer to reduce the set of candidates • KNN, SVM, Neural Nets, …
Externally Enhanced Classifiers • KNN • SVM • NB • HMM Reducers Enhancers Topic Restriction Probabilistic Enhancement Predicted Class Annotated Instance Externally Enhanced Classifiers
Summary • Traditional Clasifiers (Yam . 工商經濟Openfind . 工商經濟) • Naïve Bayes: 55% • SVM: 57% • Enhanced Classfiers: • Enhanced Naïve Bayes: 66% • Topic Restricted SVM: 67%
Proposed Approaches • Probabilistic Enhancement that uses class information to enhance probabilistic classifiers such as Naïve Bayes and HMM • Topic Restriction that uses class information to restrict the set of candidate classes, and can be used to extend any classifier such as SVM and kNN
Probabilistic Methods • Probabilistic Classifier • When external information is available, • Probabilistic Enhancement
Estimation of P(vt|s) • straightforward estimation • more robust estimation • when • when
NB-Based Methods (Agrawal and Srikant, 2001)
Data Sets • Data set I • source hierarchy: Yam • target hierarchy: Openfind • Data set II • source hierarchy: Yam.BusinessAndEconomics • target hierarchy: Openfind.BusinessAndEconomics • Data set III • source hierarchy: Google.Business • target hierarchy: Yahoo.BusinessAndEconomics
Topic Restriction(TR) • TR uses class information to reduce the set of candidate classes, and can be used for any traditional classifiers such as SVM and kNN • Static Topic Restriction • Most source classes are related to a small number of targeted classes • Consider only those target classes that intersect the source class • Dynamic Topic Restriction • Simple classifiers achieve very high top k measure for small k • Consider only those top k classes ranked by a simple classifier
Dynamic Topic Restriction Data Set II
Conclusion • We propose probabilistic enhancement to enhance Naïve Bayes. • We propose a topic restriction method to extend SVM. • We carry out extensive experiment for text collections from Google and Yahoo, and Openfind and Yam. • Experiment shows that our approaches significantly improve traditional approaches
Further Remarks • Topic restriction is a general idea for cascading simpler, such as NB and linear classifiers, and more complicated classifiers, such as SVM and kNN • Cascading improves both the running times and classification accuracy of SVM and kNN, especially when the number of topic classes is large. • Further study on topic restriction is going on.
Cascaded SVM • Web Directory Data (Openfind)
Cascaded SVM • CNA news collection