introducing movie review n.
Skip this Video
Loading SlideShow in 5 Seconds..
introducing Movie Review PowerPoint Presentation
Download Presentation
introducing Movie Review

introducing Movie Review

96 Vues Download Presentation
Télécharger la présentation

introducing Movie Review

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Peiti Li1, Shan Wu2, Xiaoli Chen1 1Computer Science Dept. 2Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA introducingMovie Review

  2. Why ? It is a fast and more direct way for people to share their opinions on a topic

  3. Twitter Search API + Stream API Python

  4. Opinion Mining or Sentiment Analysis Computational study of opinions, sentiments, subjectivity, attitudes

  5. Just like a text classification task but different from topic-based text classification In topic-based text classification (e.g., computer, sport, science), topic words are important. But in sentiment classification, opinion/sentiment words are more important, e.g., awesome, great, excellent, horrible, bad, worst, etc.

  6. Why a HARD task? Structure the unstructured: Natural language text is often regarded as unstructured data Besides data mining, we need NLP technologies I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive,… Credits: Bing Liu for this example

  7. Tell people whether to go to buy a movie ticket using tweets Give a rating of the movie based on tweets Classify the tweet as either positive or negative

  8. Different Machine Learning Approaches Accuracies Table from: Bo Pang et al. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proc. Of the ACL, pp. 79-86. Association for Computational Linguistics

  9. Our approach is Naïve Bayes P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence) Smoothing: P(token | sentiment) = (count(this token in class) + 1) / (count(all tokens in class) + count(all tokens)) We didn’t use any third-party classifier, we coded our classifier all by ourselves. Reason: want to explore what is under the hook; tune the algorithm structure according to the experiment result

  10. Getting Started .

  11. Dataset • Dev set: • The movie review dataset provided by Bo Pang and Lillian Lee, Cornell University • sentence_polarity_dataset_v1.0 • 5331 positive, 5331 negative • Real set: • Tweets about a specific movie • Cannot tell exact number • Twitter Search API(REST): last 6-7 days • Twitter Stream API: real timeline • (Drawbacks: • REST API has rate limiting; Stream data takes time to collect.)

  12. Top 100 words including stopwords

  13. Better and better but…. Baseline model is the Naïve Bayes, without any nontrivial text preprocessing; punctuations excluded, stopwords included Tuned model still Naïve Bayes, better feature extraction technique: eliminating low information features. Best unigram model, best unigram and bigram model

  14. Dev set result: Takes 1 hour! Intel Core i5 laptop died in the middle because of too hot for too long Observation: definitely not consider bigrams, but still don’t know whether we should remove the stopwords

  15. 150 tweets 150 tweets 75 labeled by Xiaoli, 75 labeled by Shan 75 labeled by Xiaoli, 75 labeled by Shan 5 neg, 87 pos 76 neg, 32 pos

  16. Results on the 2 recent movies(Real set) Regular expression 1: (?:@\S*|#\S*|http(?=.*://)\S*) Regular expression 2: (#[A-Za-z0-9]+) | (@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+) (All punctuations removed) Which regular expression should we choose based on this result? Hard to say…. :-(

  17. We moved our attention to: other similar products . lingPipe, Twendz, Twitter Sentiment, tweetfeel

  18. They are new too.

  19. Our classifier get the exact same results with them, but wait…

  20. Two pieces of tweet made us frown :-(

  21. Emoticons play a role!!! :-) >:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) >:D :-D :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 :P FTW :'( ;*( :_( T.T T_T Y.Y Y_Y >:[ :-( :( :-c :c :-< :< :-[ :[ :{ >.> <.< >.< >:\ >:/ :-/ :-. :/ :\ =/ =\ :S

  22. So we choose the regular expression that will keep emoticons And we build a dictionary to eliminate all the punctuations that appear alone '`','~','!','@','#','$','%','^','&','*','(',')','-','_','+','=','{','}','[',']',';',':','"',"'",'<','>',',','.','?','|','\\','/'

  23. Demo Finally, the python begins to catch the twittering bird……..

  24. We still need to do more semi-supervised learning. 1.Specific bigrams like “don’t love” 2.Finer classifier which can exclude objectives 3. Detect and remove annoying movie name like “Happy Feet” 4. Give more weights to dominant words like “excellent”, “worst” 5. Our final task: Give ratings “Happy” Feet? So all tweets are positive?

  25. Thank you Columbia! Thank you all!Thank you STAT 4240!