1 / 1

CIKM 200 8 , Napa Valley, California October 26-30, 2008

Michigan State University. The Chinese University of Hong Kong. Semi-supervised Text Categorization by Active Search. Zenglin Xu 1 , Rong Jin 2 , Kaizhu Huang 1 , Michael R. Lyu 1 , and Irwin King 1. 2 Department of Computer Science and Engineering Michigan State University

lonna
Télécharger la présentation

CIKM 200 8 , Napa Valley, California October 26-30, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michigan State University The Chinese University of Hong Kong Semi-supervised Text Categorization by Active Search Zenglin Xu1, Rong Jin2, Kaizhu Huang1, Michael R. Lyu1, and Irwin King1 2 Department of Computer Science and Engineering Michigan State University rongjin@cse.msu.edu 1 Department of Computer Science and Engineering The Chinese University of Hong Kong {zlxu, kzhuang, lyu, king}@cse.cuhk.edu.hk 1 Motivations 2 Contributions • A general framework for semi-supervised text categorization that collects the unlabeled documents via Websearch engines. • A novel discriminative query generation method • The categorization framework can significantly improve the classification accuracy. • Given a small number of labeled documents, it is very challenging to build a reliable classifier • .Unlabeled data are helpful in automated text categorization How to obtain unlabeled documents? • We can collect the unlabeled documents through search engines • Semi-supervised learning can take advantage of both the labeled documents and unlabeled documents 3 Framework & Model • Query generation: generate a query for every labeled document (document: (x,y), Vi: vocabulary for i-th document, w: word weights, ξ: margin error) • 2.Text Categorization Models • D: labeled documents, U: retrieved unlabeled documents • Auxiliary SVM (y* is the input) • Semi-supervised SVM (y* is an optimization variable) • Query generation that generates the textual queries for document retrieval • Document retrieval that retrieves the Web documents through the Web search engine • Semi-supervised text categorization utilizing both the labeled documents and the retrieved unlabeled Web documents 4 Experiment results • Data Repositories: 20-newsgroup, Reuters-21578, Ohsumed • Training data: 5 labeled documents in each category • Each documents generates one query • Each query returns 100 unlabeled documents • Auxi-SVM: Auxiliary SVM (Optimization : QP) • Semi-SVM: Semi-supervised SVM (Optimization: CCCP) • Search engine: Google • Accuracy improvement over SVM: • Auxi-SVM: 26% • Semi-SVM: 34% CIKM 2008, Napa Valley, California October 26-30, 2008

More Related