1 / 17

An Efficient Centroid Based Chinese Web Page Classifier

This paper presents an efficient Chinese web page classifier based on centroid methodology. It discusses the background, basic technique, classifier design, and implementation, along with various features and experiments. The classifier demonstrates satisfactory performance, high accuracy, and very fast speed.

velmat
Télécharger la présentation

An Efficient Centroid Based Chinese Web Page Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Centroid Based Chinese Web Page Classifier LIU Hui EE Dept of Tsinghua Univ. China Aug 28, 2003

  2. Outline • Background • Basic Technique • Classifier Design & Implementation • Idea • Architecture • Feature • Experiment • Summary

  3. Background of Web Page Classification • Explosive information need organization • Digital Library • Search Engine • Special (Categorized) Sites • Research hot points • Data Mining • Information Retrieval • Pattern Recognition • Text Automatic Categorization

  4. Background of Our Classifier • Net-compass Search Engine • An emerging large and distributed search engine • Embedded in its new version • Chinese web page categorization competition • Held on March 14th –15th, 2003 • Ranked first • Workgroup • EE Dept of Tsinghua Univ., 3 master students & 1 undergraduate student

  5. Basic Text Categorization System

  6. Feature Selection • Term Frequency (TF) • Term Frequency & Inverse Document Frequency (TF.IDF) • Mutual Information (MI) • Statistics

  7. Training - Statistical Machine Learning Vector Distance • Centroid Based Method • k-Nearest Neighbor: lazy learning • Support Vector Machine: Structural Risk Minimization Feedback & Combining Classifiers • Neuron Network • Boosting method Probability • Naïve Bayes: Pr (Term/Class) -> Pr(Text/Class)

  8. Idea • Large Database Net-compass Search Engine • Fast Speed • Tolerable Precision • Web Resource Fast changing • Easy building Classifier Fast Training • Supporting multi-language • Word segmentation • Easy Training Set Building & Updating

  9. Architecture

  10. Features Preprocessing • Chinese Word Segmentation • Dictionary built on search engine log • Adaptability, Manageability, Accuracy • Maximum Matching Segmenting Method • Fast, tolerable accuracy • Noise Filtering • Stop word: common word, abandon word • Advertising links: length & content

  11. Features • Combined Feature Selection • Statistics: tend to choose high-freq words • Mutual Information: tend to low-freq words • Subspace

  12. Features • Adaptive Factors Adjust model, compensate for deficiency of training set • Class Weight • VIP word factor • Implementation • Berkeley DB • Structured dictionary • Avoid I/O • 3000 medium-sized Chinese Web page: 50 seconds

  13. Experiment • Corpus • Chinese Web Page training set • Provided by Peking University • 11 classes, 14000 samples, much unbalanced distribution • Evaluation • Precision, Recall, F-measure

  14. Experiment Result

  15. Experiment Discussion • More samples, more accurate • Some classes are more difficult • Corpus cover not large enough • Open testing: 85% Relation between Precision and number of training samples

  16. Summary An efficient Chinese Web Page Classifier • Clear Design • Centroid based, general steps • Novel Features • Preprocessing tricks • Combined feature selection • Subspace & Adaptive factors • Satisfactory Performance • Comparatively high accuracy • Very fast speed • High adaptability

  17. Thank you all! Welcome any question 

More Related