1 / 24

Sentiment analysis of microblogging data using sentiment lexicons as noisy labels

Sentiment analysis of microblogging data using sentiment lexicons as noisy labels. By OLUGBEMI Eric ODUMUYIWA Victor OKUNOYE Olusoji. OUTLINE. OBJECTIVES OF THE STUDY RESEARCH QUESTIONS SIGNIFICANCE OF THE STUDY MACHINE LEARNING SUPERVISED MACHINE LEARNING

ellema
Télécharger la présentation

Sentiment analysis of microblogging data using sentiment lexicons as noisy labels

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sentiment analysis of microblogging data using sentiment lexicons as noisy labels By OLUGBEMI Eric ODUMUYIWA Victor OKUNOYE Olusoji

  2. OUTLINE OBJECTIVES OF THE STUDY RESEARCH QUESTIONS SIGNIFICANCE OF THE STUDY MACHINE LEARNING SUPERVISED MACHINE LEARNING WEB MINING AND TEXT CLASSIFICATION WHY TWITTER? THE BAG OF WORDS APPROACH THE NAÏVE BAYES ALGORITHM SUPPORT VECTOR MACHINES THE PROCESS OF TEXT CLASSIFICATION MY DATA SET RESULTS PRACTICAL DEMONSTRATION CONCLUSION

  3. OBJECTIVE OF THE STUDY Improve the accuracy of SVM and the Naïve bayes classifier by using sentiment lexicons rather than emoticons as noisy labels in creating a training corpus Compare the accuracy of the Naïve bayes classifier with that of an SVM When sentiment lexicons are used as noisy labels and when emoticons are Used as noisy labels.

  4. RESEARCH QUESTIONS Is it better to use sentiment lexicons as noisy labels or emoticons as noisy labels in creating a training corpus for sentiment analysis? What is the accuracy of SVM on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the accuracy of the Naïve Bayes classifier on twitter data with training corpus created by using sentiment lexicons as noisy label? What is the effect of word Ngrams on the accuracy of our classifiers and accuracy SVM classifier and Naïve Bayes classifier using the approach in this study? What is the effect of term frequency inverse document frequency on the accuracy SVM classifier and Naïve Bayes classifier using the approach in this study?

  5. SIGNIFICANCE OF THE STUDY Mining the opining of customers, electorates e.t.c Product reviews

  6. MACHINE LEARNING A machine can learn if you teach it. teaching a machine supervised learning semi-super. leaarning Unsupervised learning.

  7. SUPERVISED MACHINE LEARNING TRAINING CLASSIFIER LABELED TWEETS FEATURE EXTRACTOR FEATURE S PREDICTION CLASSIFIER FEATURE EXTRACTOR FEATURE S LABEL UNLABELED TWEETS

  8. WEB MINING AND TEXT CLASSIFICATION WEB MINING: Mining web content for information Sentiment analysis of web content involves extracting sentiment from web content. Sentiment in this case can be positive, negative, or neutral.

  9. WHY TWITTER? Twitter data are messy A large data set can be collected from twitter Tweets have fixed length (140 characters) Twitter users are heterogeneous.

  10. THE BAG OF WORDS APPROACH The sentiment of a text depends only on the type of words in the text. So each word in a text has to be assessed independent of other words in the same text.

  11. THE NAÏVE BAYES ALGORITHM the naïve Baye’s classifier is a very simple classifier that relies on the “bag of word” representation of a document Assumptions: 1. The position of a word in a document does not matter all that 2. P(xi|Cj) are independent

  12. A PRACTICAL EXAMPLE

  13. After preprocessing

  14. FIRST, WE CALCULATE THE PRIOR PROB. for the test data V = {nigeria, good, country, people, friendly, youth, productive, word, describe, bad, leadership, cope, erratic, power, supply}, |V|= 15 Count (p) = n(nigeria, good, country, people, nigeria, friendly, youth, nigeria, productive) = 9 Count(n) = n(word, describe, country, bad, leadership, nigeria, cope, erratic, power, supply) = 10

  15. NEXT WE CALCULATE THE LIKELIHOOD P(nigeria|p) = (3+1)/(9+15) = 4/24 =2/12 = 1/6 P(nigeria|n) = (1+1)/(10+15) =2/25 P(country|p) = (1+1)/(9+15) = 2/24 = 1/12 P(country|n) = (1+1)/(10+15)= 2/25 P(viable|p) = (0+1)/(9+15) = 1/24 P(viable|n) = (1+1)/(10+15) = 2/25 P(youth|p) = (1+1)/(9+15) = 2/24 = 1/12 P(youth|n) = (0+1)/(10+15) = 1/25 To determine the class of text6: P(p|text6) = 3/5 * 1/6 * 1/12 * 1/24 * 1/12 = 0.00003 P(n|text6) = 2/5 *2/25 *2/25 * 2/25 * 1/25 = 0.00001 Since 0.00003 > 0.00001 text6 is classified as a positive text.

  16. SUPPORT VECTOR MACHINES searches for the linear or nonlinear optimal separating hyper plane (i.e., a “decision boundary”) that separate the data sample of one class from another. Minimize (in,W,b) ||W|| Subject to yi(W.Xi – b) ≥ 1 (for any i = 1,…,n) RUN

  17. DATA SET PREPARATION USING EMOTICONS : POSITIVE EMOTICONS : ‘=]’, ‘:]’, ‘:-)’, ‘:)’, ‘=)’ and ':D’ NEGATIVE EMOTICONS:’:-(‘, ‘:(‘, ‘=(‘, ‘;(‘ NEUTRAL EMOTICONS : ‘=/ ‘, and ‘:/ ‘ Tweets with both positive and negative emoticons are ignored USING SENTIMENT LEXICON: POSITIVE : using positive lexicons NEGATIVE : using negative lexicons NEUTRAL : contains no neg and no pos lexicon Tweets with question marks are ignored

  18. MY DATA SET Using emoticons: pos = 8000 neg = 8000 neu = 8000 Using sentiment lexicon: pos = 8000 neg = 8000 neu = 8000 Hand labeled test : pos = 748 neg = 912 neu = 874 Total = 2534

  19. Data Visualisation Lexiconbased data set Emoticonbased data set

  20. RESULTS

  21. CLASSIFICATION REPORT CONFUSION MATRIX

  22. CONCLUSION Emoticons are noisier than sentiment lexicon, therefore it is better to use sentiment lexicon as noisy label to train a classifier for sentiment analysis SVM perform better than the Naïve Bayes classifier Increasing the number of grams did not improve the accuracy of our classifiers trained with corpus generated using sentiment lexicons as noisy labels. The reverse was the case when emoticons were used as noise labels.

  23. References DataGenetics 2012“Emoticon Analysis in Twitter”. http://www.datagenetics.com/blog/october52012/index.html Alec Go, RichaBhayani, and Lei Huang, 2009, Twitter Sentiment analysis,CS224N Project Report, Stanford Pedregosa F. ,Varoquaux, G.,Gramfort, A.,Michel, V., Thirion, B., Grisel, O. ,Blondel, M. et al., 2011, “Scikit-learn: Machine Learning in Python”Journal of Machine Learning Research vol 12 Bing Liu, MinqingHu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan

  24. THANK YOU

More Related