Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling

Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang† Huanhuan Liu† Chu-Ren Huang‡ † Soochow University ‡ Hong Kong Polytechnic University

Outline • Introduction • Inadequacies of the Existing Work • Our Methods • Experimental Results • Conclusion

Introduction • Sentiment classification is a task of predicting the sentimental orientation (e.g., positive or negative) for a certain text. • However, the resources are rather imbalanced across different languages. • For example, due to dominant studies on English sentiment classification, the labeled data in English is often in a large scale while the labeled data in some other languages is much limited.

Introduction（Cont.） • Cross-lingual sentiment classification aims to predict the sentiment orientation of a text in a language (named as the target language) with the help of the resources from another language (named as the source language).

Inadequacies of the Existing Work • The classification performance of only using the labeled data in the source language remains far away from satisfaction due to the huge difference in linguistic expression and social culture. • One challenge in active learning-based cross-lingual sentiment classification lies in the much imbalanced labeled data from the source and target languages. • A huge imbalance in the labeled data easily floods the small amount of the labeled target data in the abundance of labeled source data and largely reduces the contribution of the labeled data in the target language.

Our Methods • We propose a certainty-based quality measurement (the intra-quality measurement), together with cross-validation to select high-quality samples in the source language. • We propose a similarity measurement (the extra-quality measurement) to select the samples in the source language that are similar to those in the target language. • For a particular data in the target language, these two kinds of measurements are integrated to select high-quality samples in the source language. • After obtaining the high-quality samples in the source language, we employ standard uncertainty sampling for active learning-based cross-lingual sentiment classification.

Intra-quality Measurement • It only employs the data in the source language to measure the quality of the samples in the source language. • We first split the labeled data from the source language into two different parts. One is severed as the training data and the other is severed as the validation data. • Then, we use the training data to train a classifier which is used to predict the samples in the validation data. • After the prediction process, we assume that the samples with high posterior possibilities are capable of representing the classification knowledge in the training data.

Intra-quality Measurement

Extra-quality Measurement

Integrating Intra- and Extra-Quality Measurements • We consider the certainty measurement as the main ranking factor and leave the similarity measurement as a supplementary one when designing the way to integrate them. • Input: Translated training data from the source language Testing data from the target language • Output: The selected data set

Integrating Intra- and Extra-Quality Measurements

Active Learning-based Cross-lingual Sentiment Classification

Experimental Settings • Labeled Data in the Source Language: English reviews from four domains: Book (B), DVD (D), Electronics (E) and Kitchen (K). Each domain contains 1000 positive and 1000 negative reviews. All these labeled samples are translated into Chinese ones with Google Translate. • Testing Data in the Target Language: Chinese reviews from IT168 and Chinese reviews from 360BUY , together with 2000 unlabeled reviews. • Unlabeled Data in the Target Language: We select 500 positive and 500 negative as the unlabeled samples for active learning.

Experimental Results Table 1:The classification performance by using all 8000 samples in the source domain Four Approaches: Random + No_source Uncertainty + No_source Uncertainty + All_source Uncertainty + Selected_source

Experimental Results（Cont.）

Conclusion • We propose an active learning approach for cross-lingual sentiment classification and address the huge challenge of the data imbalance by controlling data quality in the source language. Experimentation verifies the appropriateness of active learning for cross-lingual sentiment classification. • In future work, we would like to improve the extra-quality measurement to make it more effective for selecting high quality samples. Meanwhile, we will try data quality controlling in other cross-lingual NLP tasks.

Thank You！

Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling

Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling

Presentation Transcript

Unsupervised Sentiment Classification Across Domains

Cross-lingual projection of Semantics

Cross-Lingual IR

Controlling Internet Quality with Price

Pronunciation Extraction Through Cross-Lingual Word-to-Phoneme Alignment

Introduction to Active Learning and Active Learning Classrooms :

Tweet Classification for Political Sentiment Analysis

Active Learning for Imbalanced Sentiment Classification

Coping with Missing Data for Active Learning

Patient-Adaptive Beat Classification using Active Learning

Active Learning on Spatial Data

ACTIVE LEARNING USING CONFORMAL PREDICTORS: APPLICATION TO IMAGE CLASSIFICATION

ACTIVE LEARNING “WITH” TECHNOLOGY

Cross Lingual Information Retrieval (CLIR)

ACTIVE LEARNING FOR TEXT CLASSIFICATION

Transfer Learning with Applications to Text Classification

Cross-lingual Information Extraction System Evaluation

Controlling Water Quality

Controlling Internet Quality with Price