280 likes | 742 Vues
Sentiment analysis of microblogging data using sentiment lexicons as noisy labels. By OLUGBEMI Eric ODUMUYIWA Victor OKUNOYE Olusoji. OUTLINE. OBJECTIVES OF THE STUDY RESEARCH QUESTIONS SIGNIFICANCE OF THE STUDY MACHINE LEARNING SUPERVISED MACHINE LEARNING
E N D
Sentiment analysis of microblogging data using sentiment lexicons as noisy labels By OLUGBEMI Eric ODUMUYIWA Victor OKUNOYE Olusoji
OUTLINE OBJECTIVES OF THE STUDY RESEARCH QUESTIONS SIGNIFICANCE OF THE STUDY MACHINE LEARNING SUPERVISED MACHINE LEARNING WEB MINING AND TEXT CLASSIFICATION WHY TWITTER? THE BAG OF WORDS APPROACH THE NAÏVE BAYES ALGORITHM SUPPORT VECTOR MACHINES THE PROCESS OF TEXT CLASSIFICATION MY DATA SET RESULTS PRACTICAL DEMONSTRATION CONCLUSION
OBJECTIVE OF THE STUDY Improve the accuracy of SVM and the Naïve bayes classifier by using sentiment lexicons rather than emoticons as noisy labels in creating a training corpus Compare the accuracy of the Naïve bayes classifier with that of an SVM When sentiment lexicons are used as noisy labels and when emoticons are Used as noisy labels.
RESEARCH QUESTIONS Is it better to use sentiment lexicons as noisy labels or emoticons as noisy labels in creating a training corpus for sentiment analysis? What is the accuracy of SVM on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the accuracy of the Naïve Bayes classifier on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the effect of word Ngrams on the accuracy of our classifiers and accuracy SVM classifier and Naïve Bayes classifier using the approach in this study? What is the effect of term frequency inverse document frequency on the accuracy SVM classifier and Naïve Bayes classifier using the approach in this study?
SIGNIFICANCE OF THE STUDY Mining the opining of customers, electorates e.t.c Product reviews
MACHINE LEARNING A machine can learn if you teach it. teaching a machine supervised learning semi-super. leaarning Unsupervised learning.
SUPERVISED MACHINE LEARNING TRAINING CLASSIFIER LABELED TWEETS FEATURE EXTRACTOR FEATURE S PREDICTION CLASSIFIER FEATURE EXTRACTOR FEATURE S LABEL UNLABELED TWEETS
WEB MINING AND TEXT CLASSIFICATION WEB MINING: Mining web content for information Sentiment analysis of web content involves extracting sentiment from web content. Sentiment in this case can be positive, negative, or neutral.
WHY TWITTER? Twitter data are messy A large data set can be collected from twitter Tweets have fixed length (140 characters) Twitter users are heterogeneous.
THE BAG OF WORDS APPROACH The sentiment of a text depends only on the type of words in the text. So each word in a text has to be assessed independent of other words in the same text.
THE NAÏVE BAYES ALGORITHM the naïve Baye’s classifier is a very simple classifier that relies on the “bag of word” representation of a document Assumptions: 1. The position of a word in a document does not matter all that 2. P(xi|Cj) are independent
FIRST, WE CALCULATE THE PRIOR PROB. for the test data V = {nigeria, good, country, people, friendly, youth, productive, word, describe, bad, leadership, cope, erratic, power, supply}, |V|= 15 Count (p) = n(nigeria, good, country, people, nigeria, friendly, youth, nigeria, productive) = 9 Count(n) = n(word, describe, country, bad, leadership, nigeria, cope, erratic, power, supply) = 10
NEXT WE CALCULATE THE LIKELIHOOD P(nigeria|p) = (3+1)/(9+15) = 4/24 =2/12 = 1/6 P(nigeria|n) = (1+1)/(10+15) =2/25 P(country|p) = (1+1)/(9+15) = 2/24 = 1/12 P(country|n) = (1+1)/(10+15)= 2/25 P(viable|p) = (0+1)/(9+15) = 1/24 P(viable|n) = (1+1)/(10+15) = 2/25 P(youth|p) = (1+1)/(9+15) = 2/24 = 1/12 P(youth|n) = (0+1)/(10+15) = 1/25 To determine the class of text6: P(p|text6) = 3/5 * 1/6 * 1/12 * 1/24 * 1/12 = 0.00003 P(n|text6) = 2/5 *2/25 *2/25 * 2/25 * 1/25 = 0.00001 Since 0.00003 > 0.00001 text6 is classified as a positive text.
SUPPORT VECTOR MACHINES searches for the linear or nonlinear optimal separating hyper plane (i.e., a “decision boundary”) that separate the data sample of one class from another. Minimize (in,W,b) ||W|| Subject to yi(W.Xi – b) ≥ 1 (for any i = 1,…,n) RUN
DATA SET PREPARATION USING EMOTICONS : POSITIVE EMOTICONS : ‘=]’, ‘:]’, ‘:-)’, ‘:)’, ‘=)’ and ':D’ NEGATIVE EMOTICONS:’:-(‘, ‘:(‘, ‘=(‘, ‘;(‘ NEUTRAL EMOTICONS : ‘=/ ‘, and ‘:/ ‘ Tweets with both positive and negative emoticons are ignored USING SENTIMENT LEXICON: POSITIVE : using positive lexicons NEGATIVE : using negative lexicons NEUTRAL : contains no neg and no pos lexicon Tweets with question marks are ignored
MY DATA SET Using emoticons: pos = 8000 neg = 8000 neu = 8000 Using sentiment lexicon: pos = 8000 neg = 8000 neu = 8000 Hand labeled test : pos = 748 neg = 912 neu = 874 Total = 2534
Data Visualisation Lexiconbased data set Emoticonbased data set
CLASSIFICATION REPORT CONFUSION MATRIX
CONCLUSION Emoticons are noisier than sentiment lexicon, therefore it is better to use sentiment lexicon as noisy label to train a classifier for sentiment analysis SVM perform better than the Naïve Bayes classifier Increasing the number of grams did not improve the accuracy of our classifiers trained with corpus generated using sentiment lexicons as noisy labels. The reverse was the case when emoticons were used as noise labels.
References DataGenetics 2012“Emoticon Analysis in Twitter”. http://www.datagenetics.com/blog/october52012/index.html Alec Go, RichaBhayani, and Lei Huang, 2009, Twitter Sentiment analysis,CS224N Project Report, Stanford Pedregosa F. ,Varoquaux, G.,Gramfort, A.,Michel, V., Thirion, B., Grisel, O. ,Blondel, M. et al., 2011, “Scikit-learn: Machine Learning in Python”Journal of Machine Learning Research vol 12 Bing Liu, MinqingHu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan