310 likes | 695 Vues
A Survey of Opinion Mining. Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University. Introduction. The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites
E N D
A Survey of Opinion Mining Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University
Introduction • The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites • A few problems • What is the general opinion on the proposed tax reform? • How is popular opinion on the presidential candidates evolving? • Which of our customers are unsatisfied? Why? • Opinion Mining (OM) • a recent discipline at the crossroads of information retrieval and computational linguistics which is concerned not with the subject of a document, but with opinion it expresses • Related Areas • Data Mining(DM), Information Retrieval (IR), Text Classification (TC), Text Summarization (TS) Center for E-Business Technology
Agenda • Introduction • Development of Linguistic Resource • Conjunction Method • PMI Method • WordNet Expanding Method • Gloss Use Method • Sentiment Classification • PMI Method • Machine Learning Method • NLP Combined Method • Extracting and Summarizing Opinion Expression • Statistical Approach • NLP Based Approach • Discussion Center for E-Business Technology
Development of Linguistic Resource (1) • Linguistic resources can be used to extract opinion and to classify the sentiment of text • Appraisal Theory • Sentiment related properties are well-defined • A framework of linguistic resources which describes how writers and speakers express inter-subjective and ideological position • underlying linguistic foundation of OM • Tasks • Determining the subjectivity of a term • Determining term orientation • Determining the strength of term attitude • Example • Objective: vertical, yellow, liquid • Subjective • Positive: good < excellent • Negative: bad < terrible Center for E-Business Technology
Development of Linguistic Resource (2) • Conjunction Method • PMI Method • Orientation • Subjectivity • WordNet Expansion Method • Gloss Use Method • Orientation • Subjectivity • SentiWordNet Center for E-Business Technology
Conjunction Method - overview • Hatzivassiloglou and McKeown, 1997 • Hypothesis • Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’ is used with opposite orientation. • Process • Randomly selected adjectives with positive and negative orientation seed terms were used to predict orientation. negative • All conjunction of adjectives are extracted from the corpus. • A log-linear regression model combines information from different conjunctions to determine if each two conjoined adjectives are of same or different orientation. • A clustering algorithm separates the adjectives into two subsets of different orientation. It places as many words of same orientation as possible into the same subset. • The average frequencies in each group are compared and the group with the higher frequency is labeled as positive. positive seed terms corpus and but Center for E-Business Technology
Conjunction Method –objective function and constraints • Select pmin that minimizes Φ(p) • dissimilarity between adjectives in same cluster is minimized and dissimilarity between adjectives in different cluster is maximized. • Experiments • HM term set : 1,336 adjectives • 657 positive, 679 negative terms • Methods to improve performance of orientation prediction • But rule : Most conjunctions had same orientation, while some conjunctions linked by ‘but’ had almost opposite orientation • log-linear regression model • morphological relationship • adequate-inadequate or thoughtful –thoughtless • log-linear model with morphological relationship : 82.5% accuracy |Ci| : the cardinality of cluster i d(x, y): the dissimilarity between adjectives x , y Center for E-Business Technology
PMI Method - overview • Pointwise Mutual Information (PMI) • a measure of association used in information theory and statistics • Orientation • Turney and Littman, 2003 • terms with similar orientation tend to co-occur in documents • Subjectivity • Baroni and Vegnaduzzo, 2004 • subjective adjectives tend to occur in the near of other subjective adjectives Center for E-Business Technology
PMI Method – predicting semantic orientation • Modified PMI was measured using the number of results returned by the AltaVista search engine with NEAR operator • Predicting semantic orientation of a term SO(t) • Experiments • With HM term set and three corpora • With small corpus, accuracy isn’t higher than conjunction method. • With large corpus, accuracy is higher than conjunction method. t : target term • ti : paradigmatic term Center for E-Business Technology
WordNet Expansion Method • Hu et al., 2004 • used synonym and antonym relationship between words • Hypothesis • adjectives usually share the same orientation as their synonyms and opposite orientation as their antonyms • By using a set of seed adjectives, orientation of all adjectives in WordNet can be assigned through a procedure exploring on the cluster graphs. Center for E-Business Technology
Gloss Use Method - overview • Esuli et al., 2005, 2006 • Hypothesis • Orientation • terms with similar orientation have similar glosses • Subjectivity • terms with similar orientation have similar glosses • terms without orientation have non-oriented glosses • SentiWordNet • All words in the WordNet have three scores • positivity, negativity, and objectivity • Term Sense is positioned in reversed triangle Center for E-Business Technology
Gloss Use Method – classification process • Process • A seed set (Lp, Ln) is provided as input • Lexical relations (e.g. synonymy) from a thesaurus, or online dictionary, are used to extend seed set. Once added to the original ones, the new terms yield two new, richer sets Trp and Trn; together they form the training set for the learning phase of Step 4. • For each term ti in Trp∪Trn or in the test set, a textual representation of ti is generated by collating all the glosses of ti as found in a machine-readable dictionary. Each such representation is converted into vectorial form by standard text indexing techniques. • A binary text classifier is trained on the terms in Trp∪Trn and then applied to the terms in the test set. • Experiments • Classifier : NB, SVM, PrTFIDF • 87.38% Accuracy Center for E-Business Technology
Development of Linguistic Resource - Summary Center for E-Business Technology
Sentiment Classification • The process of identifying the sentiment – or polarity – of a piece of text or a document. • Document-level • Sentence-level, phrase-level • Feature-level • Define target of the opinion and assign the sentiment of the target • Document-level Sentiment Classification Method • PMI method • Machine Learning Method • Default Classifiers • Enhanced Classifier • NLP Combined Method • A Two-Step Classification • Combining Appraisal Theory Center for E-Business Technology
PMI Method • Turney et al., 2002 • Process • Only two-word phrases containing adjectives or adverbs are extracted • Semantic orientation of a phrase • SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”) • Semantic orientation is an average semantic orientation of the phrases • Experiments • 410 reviews from Epinions (epinion.com): 170 positive, 240 negative • calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours Center for E-Business Technology
ML - Default Classifier • Pang and Lee, 2002 • A special case of text categorization with sentiment- rather than topic-based categories • Document modeling • standard bag-of-features framework • Experiments • Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive • Naïve Bayes, Maximum Entropy, Support Vector Machine • In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to do the best, although the differences aren’t very large. Center for E-Business Technology
ML - Using Only Subjective Sentences • Pang and Lee, 2004 • improved polarity classification by removing objective sentences • A subjectivity detector determines whether each sentence is subjective or not • Standard subjectivity classifier • Subjectivity classifier using proximity relationship • The use of subjectivity extracts can improve the polarity classification at least no loss of accuracy. Center for E-Business Technology
NLP Combined Method– A Two-Step Classification • Wilson et al., 2005 • A Two-Step Contextual Polarity Classification • employ machine learning and 28 linguistic features • document polarity : the average polarity of phrases Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates their contextual polarity (positive, negative, both, or neutral). • 28 Features : were extracted using NLP techniques with a dependency parser • 4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features, 1 Document Feature • Experiments • Data : Multi-perspective Question Answering (MPQA) Opinion Corpus neutral-polar classification (%) polarity classification (%). Center for E-Business Technology
NLP Combined Method- Combining Appraisal Theory • Whitelaw et al., 2005 • applied the appraisal theory to the machine learning methods of Pang and Lee • Structure of an appraisal • An example “not very happy” • Experiments • a lexicon of 1329 appraisal entities have been produced semi-automatically from 400 seed terms in around twenty man-hours • combining attitude type and orientation : accuracy 90.2%. Center for E-Business Technology
Sentiment Classification - Summary Center for E-Business Technology
Extracting and Summarizing Opinion Expression • Goal • Extract the opinion expression from large reviews and present it with an effective way • Tasks • Feature Extraction • Sentiment classification at the feature-level requires the extraction of features that are the target of opinion words • Sentiment Assignment • Each feature is usually classified as being either favorable or unfavorable. • Visualization • Extracted opinion expression are summarized and visualized. • Methods • Statistical Approaches • ReviewSeer (2003) • Opinion Observer (2004) • Red Opal (2007) • NLP-Based Approaches • Kanayama System (2004) • WebFountain (2005) • OPINE (2005) product Summarize Extract Features Assign Sentiment product reviews Center for E-Business Technology
Opinion Observer - Overview • Hu and Liu, 2005 • Extract and summarize opinion expression from customer reviews on the Web. • Only mines the features of the product on which the customers have expressed their opinions and whether the opinion are positive or negative • Overall process • Review crawling • Feature extraction • Sentiment assignment • Opinion word extraction • Opinion orientation identification • Summary generation Overall process Center for E-Business Technology
Opinion Observer - Tasks • Feature Extraction • Product features are extracted from the noun or noun phrase by the association miner CBA • Compactness pruning, redundancy pruning • Sentiment Assignment • Opinion sentence : a sentence contains one or more product features and one or more opinion words • Adjectives are the only opinion words • Prior polarity of adjectives was identified by WordNet expansion methods with seed terms • Infrequent features are extracted by using frequent opinion words • Polarity of a sentence is assigned as a dominant orientation • Extracted form : (product feature, # of positive sentences, # of negative sentences) • Experiments • Large collection of reviews of 15 electronic products • 86.3% recall, 84.0% precision Center for E-Business Technology
Opinion Observer - Visualization • Features of products are compared by the bar graph • Number of positive and negative sentences of each feature are normalized Positive portion Negative portion Center for E-Business Technology
Web Fountain - Overview • Yi et al., 2005 • Extracts target features of the sentiment from the various resources and assigns polarity to the features • System Architecture • Sentiment Miner • Analyzes grammatical sentence structures and phrases by using NLP techniques Center for E-Business Technology
Web Fountain – Tasks • Feature Extraction • Candidate features • a part-of relationship with the given topic • an attribute-of relationship with the given topic. • an attribute-of relationship with a known feature of the given topic • bBNP (Beginning definite Base Noun Phrase) heuristic is used • Select bnp (base noun phrase) that has high likelihood ratio • Experiments • Precision - digital camera: 97%, music reviews: 100% • Sentiment Assignment • Parse and traverse with two linguistic resources • Sentiment lexicon: define the sentiment polarity of terms • Sentiment pattern database: contain the sentiment assignment patterns of predicates • Experiments • Product review • Recall 56%, Precision 87% Center for E-Business Technology
Web Fountain – Visualization • Web interface listing sentiment bearing sentences about a given product Center for E-Business Technology
Extracting and Summarizing Opinion Expression - Summary Center for E-Business Technology
Discussion • OM is a growing research discipline related to various research areas, such as IR, computational linguistics, TC, TS, and DM. • Surveyed three topics and summarized it. • For Korean OM? • There isn’t any published research into the Korean OM. • Language differences may impose some limits on the methods used in the OM subtasks. • Structural differences between English and Korean may mean that the same heuristics cannot be applied to extract features from text • The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior polarity of words for the PMI or conjunction methods. • Research into Korean OM must be conducted in conjunction with other related areas. Center for E-Business Technology
Discussion - Research Map of OM Center for E-Business Technology
Thank you Center for E-Business Technology