Sentiment analysis of microblogging data using sentiment lexicons as noisy labels

Sentiment analysis of microblogging data using sentiment lexicons as noisy labels By OLUGBEMI Eric ODUMUYIWA Victor OKUNOYE Olusoji

OUTLINE OBJECTIVES OF THE STUDY RESEARCH QUESTIONS SIGNIFICANCE OF THE STUDY MACHINE LEARNING SUPERVISED MACHINE LEARNING WEB MINING AND TEXT CLASSIFICATION WHY TWITTER? THE BAG OF WORDS APPROACH THE NAÏVE BAYES ALGORITHM SUPPORT VECTOR MACHINES THE PROCESS OF TEXT CLASSIFICATION MY DATA SET RESULTS PRACTICAL DEMONSTRATION CONCLUSION

OBJECTIVE OF THE STUDY Improve the accuracy of SVM and the Naïve bayes classifier by using sentiment lexicons rather than emoticons as noisy labels in creating a training corpus Compare the accuracy of the Naïve bayes classifier with that of an SVM When sentiment lexicons are used as noisy labels and when emoticons are Used as noisy labels.

RESEARCH QUESTIONS Is it better to use sentiment lexicons as noisy labels or emoticons as noisy labels in creating a training corpus for sentiment analysis? What is the accuracy of SVM on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the accuracy of the Naïve Bayes classifier on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the effect of word Ngrams on the accuracy of our classifiers and accuracy SVM classifier and Naïve Bayes classifier using the approach in this study? What is the effect of term frequency inverse document frequency on the accuracy SVM classifier and Naïve Bayes classifier using the approach in this study?

SIGNIFICANCE OF THE STUDY Mining the opining of customers, electorates e.t.c Product reviews

MACHINE LEARNING A machine can learn if you teach it. teaching a machine supervised learning semi-super. leaarning Unsupervised learning.

SUPERVISED MACHINE LEARNING TRAINING CLASSIFIER LABELED TWEETS FEATURE EXTRACTOR FEATURE S PREDICTION CLASSIFIER FEATURE EXTRACTOR FEATURE S LABEL UNLABELED TWEETS

WEB MINING AND TEXT CLASSIFICATION WEB MINING: Mining web content for information Sentiment analysis of web content involves extracting sentiment from web content. Sentiment in this case can be positive, negative, or neutral.

WHY TWITTER? Twitter data are messy A large data set can be collected from twitter Tweets have fixed length (140 characters) Twitter users are heterogeneous.

THE BAG OF WORDS APPROACH The sentiment of a text depends only on the type of words in the text. So each word in a text has to be assessed independent of other words in the same text.

THE NAÏVE BAYES ALGORITHM the naïve Baye’s classifier is a very simple classifier that relies on the “bag of word” representation of a document Assumptions: 1. The position of a word in a document does not matter all that 2. P(xi|Cj) are independent

A PRACTICAL EXAMPLE

After preprocessing

FIRST, WE CALCULATE THE PRIOR PROB. for the test data V = {nigeria, good, country, people, friendly, youth, productive, word, describe, bad, leadership, cope, erratic, power, supply}, |V|= 15 Count (p) = n(nigeria, good, country, people, nigeria, friendly, youth, nigeria, productive) = 9 Count(n) = n(word, describe, country, bad, leadership, nigeria, cope, erratic, power, supply) = 10

NEXT WE CALCULATE THE LIKELIHOOD P(nigeria|p) = (3+1)/(9+15) = 4/24 =2/12 = 1/6 P(nigeria|n) = (1+1)/(10+15) =2/25 P(country|p) = (1+1)/(9+15) = 2/24 = 1/12 P(country|n) = (1+1)/(10+15)= 2/25 P(viable|p) = (0+1)/(9+15) = 1/24 P(viable|n) = (1+1)/(10+15) = 2/25 P(youth|p) = (1+1)/(9+15) = 2/24 = 1/12 P(youth|n) = (0+1)/(10+15) = 1/25 To determine the class of text6: P(p|text6) = 3/5 * 1/6 * 1/12 * 1/24 * 1/12 = 0.00003 P(n|text6) = 2/5 *2/25 *2/25 * 2/25 * 1/25 = 0.00001 Since 0.00003 > 0.00001 text6 is classified as a positive text.

SUPPORT VECTOR MACHINES searches for the linear or nonlinear optimal separating hyper plane (i.e., a “decision boundary”) that separate the data sample of one class from another. Minimize (in,W,b) ||W|| Subject to yi(W.Xi – b) ≥ 1 (for any i = 1,…,n) RUN

DATA SET PREPARATION USING EMOTICONS : POSITIVE EMOTICONS : ‘=]’, ‘:]’, ‘:-)’, ‘:)’, ‘=)’ and ':D’ NEGATIVE EMOTICONS:’:-(‘, ‘:(‘, ‘=(‘, ‘;(‘ NEUTRAL EMOTICONS : ‘=/ ‘, and ‘:/ ‘ Tweets with both positive and negative emoticons are ignored USING SENTIMENT LEXICON: POSITIVE : using positive lexicons NEGATIVE : using negative lexicons NEUTRAL : contains no neg and no pos lexicon Tweets with question marks are ignored

MY DATA SET Using emoticons: pos = 8000 neg = 8000 neu = 8000 Using sentiment lexicon: pos = 8000 neg = 8000 neu = 8000 Hand labeled test : pos = 748 neg = 912 neu = 874 Total = 2534

Data Visualisation Lexiconbased data set Emoticonbased data set

RESULTS

CLASSIFICATION REPORT CONFUSION MATRIX

CONCLUSION Emoticons are noisier than sentiment lexicon, therefore it is better to use sentiment lexicon as noisy label to train a classifier for sentiment analysis SVM perform better than the Naïve Bayes classifier Increasing the number of grams did not improve the accuracy of our classifiers trained with corpus generated using sentiment lexicons as noisy labels. The reverse was the case when emoticons were used as noise labels.

References DataGenetics 2012“Emoticon Analysis in Twitter”. http://www.datagenetics.com/blog/october52012/index.html Alec Go, RichaBhayani, and Lei Huang, 2009, Twitter Sentiment analysis,CS224N Project Report, Stanford Pedregosa F. ,Varoquaux, G.,Gramfort, A.,Michel, V., Thirion, B., Grisel, O. ,Blondel, M. et al., 2011, “Scikit-learn: Machine Learning in Python”Journal of Machine Learning Research vol 12 Bing Liu, MinqingHu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan

THANK YOU

Sentiment analysis of microblogging data using sentiment lexicons as noisy labels

Sentiment analysis of microblogging data using sentiment lexicons as noisy labels

Presentation Transcript

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

Sentiment analysis

Sentiment Analysis

Sentiment analysis

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

Sentiment analysis

Sentiment Analysis

Sentiment Analysis on Twitter Data

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

NLTK Sentiment Analysis

Sentiment Analysis of Feedback Data

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis