1 / 34

TEXT ANALYTICS - LABS

Learn the basics and advanced techniques of text analytics, including sentiment analysis, text classification, information extraction, and named entity recognition. Explore Python libraries such as SciKit-Learn and NLTK for natural language processing and machine learning. Gain hands-on experience with practical examples and tutorials.

gstaton
Télécharger la présentation

TEXT ANALYTICS - LABS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TEXT ANALYTICS - LABS MahaAlthobaiti UdoKruschwitz Massimo Poesio

  2. LABS • Basic text analytics: text classification using bags-of-words • Sentiment analysis of tweets using Python’s SciKit Learn library • More advanced text analytics: information extraction using NLP pipelines • Named Entity Recognition

  3. LABS • Basic text analytics: text categorization using bags-of-words • Specifically, sentiment analysis of tweets using Python’s SciKit-Learn’s library • More advanced text analytics: information extraction using NLP pipelines • Named Entity Recognition

  4. Sentiment analysis using SciKit Learn • Materials for this part of the tutorial: • http://csee.essex.ac.uk/staff/poesio/Teach/TextAnalyticsTutorial/SentimentLab • Based on: chap. 6 of

  5. TEXT ANALYTICS IN PYTHON • Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages • SCIKIT-LEARN for machine learning including basic text classification • NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification

  6. TEXT ANALYTICS IN PYTHON • Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages • SCIKIT-LEARN for machine learning including basic text classification • NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification

  7. SCIKIT-LEARN • An open-source library supporting machine learning work • Based on numpy, scipy, and matplotlib • Provides implementations of • Several supervised ML algorithms including eg regression, Naïve Bayes, SVMs • Clustering • Dimensionality reduction • It includes several facilities to support text classification including eg ways to create NLP pipelines out of componen td • Website: • http://scikit-learn.org/stable/

  8. REMINDER : SENTIMENT ANALYSIS • (or opinion mining) • Develop algorithms that can identify the ‘sentiment’ expressed by a text • Product X sucks • I was mesmerized by film Y

  9. SENTIMENT ANALYSIS AS TEXT CATEGORIZATION • Sentiment analysis can be viewed as just another type of text categorization, like spam detection or topic classification • Most successful approaches use SUPERVISED LEARNING: • Use corpora annotated for subjectivity and/or sentiment • To train models using supervised machine learning algorithms: • Naïve bayes • Decision trees • SVM • Good results can already be obtained using only WORDS as features

  10. TEXT CATEGORIZATION USING A NAÏVE BAYES, WORD-BASED APPROACH • Attributes are text positions, values are words.

  11. SENTIMENT ANALYSIS OF TWEETS • A very popular application of sentiment analysis is trying to extract sentiment towards products or organizations from people’s comments about them on Twitter • Several datasets for that • E.g., SEMEVAL-2014 • In this lab: Nick Sanders’s dataset • 5000 Tweets • Annotated as positive / negative / neutral / irrelevant • List of ID / sentiment pairs, + script to download tweets on the basis of their ID

  12. First Script Start an IDLE window Open the file: 01_start.py (but do not run it yet!!)

  13. A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn • The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers • GaussianNB(for features that have a Gaussian distribution, e.g., physical traits – height, etc) • MultinomialNB(when features are frequencies of words) • BernoulliNB(for boolean features)

  14. A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn • The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers • GaussianNB(for features that have a Gaussian distribution, e.g., physical traits – height, etc) • MultinomialNB(when features are frequencies of words) • BernoulliNB(for boolean features) • For sentiment analysis: MultinomialNB

  15. Creating the model • The words contained in the tweets are used as features. They are extracted and weighted using the function create_ngram_model • create_ngram_modeluses the function TfidfVectorizer from the package feature_extraction in scikitlearn to extract terms from tweets • http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html • create_ngram_modeluses MultinomialNB to learn a classifier • http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html • The function Pipelineof scikit-learn is used to combine the feature extractor and the classifier in a single object (an estimator) that can be used to extract features from data, create (‘fit’) a model, and use the model to classify • http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

  16. Tweet term extraction & classification Extract features and weights them Naïve Bayes classifier Creates Pipeline

  17. Training and evaluation • The function train_model • Uses a method from the cross_validation library in scikit-learn, ShuffleSplit, to calculate the folds to use in cross validation • At each iteration, the function creates a model using fit, then evaluates the results using score

  18. Creating a model Identifies the indices in each fold Trains the model

  19. Execution

  20. Optimization • The program above uses the default values of the parametes for TfidfVectorizerandMultinomialNB • In text analytics it’s usually easy to build a first prototype, but lots of experimentation is needed to achieve good results • Alternative choices for TfidfVectorizer: • Using unigrams, bigrams, trigrams (Ngrams parameter) • Removing stopwords (stop_words parameter) • Using binomial format of counts • Alternative choices for MultinomialNB: • Which type of SMOOTHING to use

  21. Smoothing • Even a very large corpus remains a limited sample of language use, so many words even of common use are not found • Problem particularly common with tweets where a lot of ‘creative’ use of words found • Solution: SMOOTHING – distribute the probability so that every word gets some • Most used: ADD ONE or LAPLACE smoothing

  22. Optimization • Looking for the best values for the parameters is a standard operation in machine learning • Scikit-learn, like Weka and similar packages, provides a function (GridSearchCV) to explore the results that can be achieved with different parameter configurations

  23. Optimizing with GridSearchCV Note the syntax to specify the values of the parameters Use F metric to evaluate Which smoothing function to use

  24. Second Script Start an IDLE window Open the file: 02_tuning.py (but do not run it yet!!)

  25. Additional improvements: normalization, preprocessing • Further improvements may be possible by doing some form of NORMALIZATION

  26. Example of normalization: emoticons

  27. Normalization: abbreviations

  28. Adding a preprocessing step to TfidfVectorizer

  29. Other possible improvements • Using NLTK’s POS tagger • Using a sentiment lexicon such as SentiWordNet • http://sentiwordnet.isti.cnr.it/download.php • (in the data/ directory)

  30. Third Script (Start an IDLE window) Open and run the file: 03_clean.py

  31. Overall results

  32. TO LEARN MORE

  33. SCIKIT-LEARN

  34. NLTK http://www.nltk.org/book

More Related