Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2009.09.24 From NIPS 2008

Outlines • Introduction • Related Work • Review SVM • SSLW (Semi-supervised Learning with Weakly-Related Unlabeled Data) • Experiments • Conclusion

Introduction • Semi-supervised Learning (SSL) • takes advantage of a large amount of unlabeled data to enhance classification accuracy • Cluster assumption • puts the decision boundary in low density areas without crossing the high density regions • is only meaningful when the labeled and unlabeled data are somehow closely related • If they were weakly related, the labeled and unlabeled data could be well separated

Introduction (conti.) • This paper aiming to • Identify a new data representation (in feature space) • By constructing a new kernel function • Advantages • Informative to the target class(category) • consistent with the feature coherence patterns exhibiting in the weakly related unlabeled data

Related Work • The two types of semi-supervised learning (SSL) • Transductive SSL • labels only for the available unlabeled data • Inductive SSL • also learns a classifier that can be used to predict labels for new data • SSLW

SVM • Notations • £ = {(x1, y1), . . . , (xl, yl)} Labeled documents • U= {(xl+1, yl+1), . . . , (xn, yn)} unlabeled documents • Document-word matrix D=(d1, d2, …, dn), di∈NV • V: the size of the vocabulary • di: word-frequency vector for document i • Word-Document matrix G=(g1, g2, …, gV) • gi=(gi,1, gi,2,…,gi,n) K=DTD, K ∈ Rnxn Document pairwise similarity α。y=(α1y1,α2y2, …, αnyn) element-wise product

SSLW • K=DTD  K=DTRD • R ∈ RVxV: word-correlation matrix • Two ways to construct the matrix R G=UW, W=(w1,w2,…wV) wi: internal representation o the i-th word R= WTW, T=UUT the top p right eigenvectors of G αi ≥0, ξ ≥0

SSLW (conti.)

SSLW (conti.) • An Efficient Algorithm of SSLW

Experiments • Corpus • Reuters-21578 (9400 docs), • WebKB (4518 docs) • TREC AP88: an external information source for both datasets (1000 documents, randomly selected)

Evaluation Methodology • 4 positive + 4 negative samples from each training set • AUR (area under the ROC curve) • Averaging the AUR (ten times of each experiment)

Conclusion • SSLW • Significantly improves both the accuracy and the reliability of text categorization, • given a small training pool and the additional unlabeled data that are weakly related to the test bed.

Thanks!!

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Presentation Transcript

SQL Server Full-Text Search Using full-text search in SQL Server 2005

Distance Metric Learning: A Comprehensive Survey

Keyword Search on Structured and Semi-Structured Data

807 - TEXT ANALYTICS

CMSC 671 Fall 2003

Recent results in Mathematics related to Data Transmission

此报告仅供客户内部使用。未经麦肯锡公司的书面许可，其它任何机构不得擅自传阅、引用或复制。

Student Organizations

ALCOHOL RELATED DISEASES. DISCUSS THERAPY OF ALCOHOL DEPENDENCE

Information extraction from text

Unsupervised Models for Coreference Resolution

Text Structure

807 - TEXT ANALYTICS

Example text Go ahead and replace it with your own text. This is an example text.

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Practical English, Book II

Predictive Learning from Data

Chapter 3: Supervised Learning

Temple University – CIS Dept. CIS616– Principles of Data Management

Text-main1