1 / 27

A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining

A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining. Wei Jin Department of Computer Science, North Dakota State University, USA Hung Hay Ho Department of Computer Science & Engineering, State University of New York at Buffalo, USA. Outline. Motivation Related Work

chace
Télécharger la présentation

A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining Wei Jin Department of Computer Science, North Dakota State University, USA Hung Hay Ho Department of Computer Science & Engineering, State University of New York at Buffalo, USA

  2. Outline • Motivation • Related Work • A Lexicalized HMM-based Learning Framework • Experimental Results • Conclusion

  3. Introduction • Two main types of information on the Web. • Facts (Google searches) • Opinions (Google does not search for opinions) • Opinions are hard to express with keywords • Current search ranking strategy is not appropriate for opinion search. • E-commerce more and more popular • More and more products sold on the web • More and more people buy products online. • Customers share their opinions and hands-on experiences on products • Reading through all customer reviews is difficult, • for popular items, the number of reviews can be up to hundreds or even thousands.

  4. Introduction – Applications • Businesses and organizations: product and service benchmarking. Market intelligence. • Business spends a huge amount of money to find consumer sentiments and opinions. • Consultants, surveys and focused groups, etc • Ads placements: Placing ads in user-generated content • Place an ad when one praises an product. • Place an ad from a competitor if one criticizes an product.

  5. Design Goal • Design a framework capable of • extracting, learning and classifying product entities and opinion expressions automatically from product reviews. • E.g., Battery Life is quite impressive. • Three main tasks Task 1: Identifying and extracting product entities that have been commented on in each review. Task 2: Identifying and extracting opinion sentences describing the opinions for each recognized product entity. Task 3: Determining whether the opinions on the product entities are positive or negative.

  6. Related Work • Zhuang et al., 2006 • Classified and summarized movie reviews by extracting high frequency feature keywords and high frequency opinion keywords. • Feature-opinion pairs were identified by using a dependency grammar graph. • Popescu and Etzioni, 2005 • Proposed a relaxation labeling approach to find the semantic orientation of words. • However, their approach only extracted feature words with frequency greater than an experimentally set threshold value and ignored low frequency feature words. • Hu and Liu, 2004 • Proposed a statistical approach capturing high frequency feature words by using association rules. • Infrequent feature words are captured by extracting known opinion words’ adjacent noun phrases. • A summary is generated by using high frequency feature words and ignoring infrequent features. • However, the capability of recognizing phrase features is limited by the accuracy of recognizing noun-group boundaries. • Lacks an effective way to address infrequent features. • Our approach • Propose a framework naturally integrates multiple linguistic features into automatic learning. • Identify complex product-specific features and infrequently mentioned features (which are possible low frequency phrases in the reviews). • Self-learn and predict new features based on the patterns it has seen from the training data.

  7. The Proposed Technique • Lexicalized HMMs • Previously used in Part-of-Speech (POS) Tagging and Named Entity Recognition (NER) problem. • To correlate the web opinion mining task with POS Tagging and NER may well be a significant contribution in itself in this work.

  8. Task 1: Entity Categories and Tag Sets

  9. Task 1: Basic Tag Set and Patten Tag Set

  10. Task 1: Basic Tag Set and Patten Tag Set • An entity can be a single word or a phrase. • A word w in an entity may take one of the following four patterns to present itself: • w is an independent entity; • w is the beginning component of an entity; • w is at the middle of an entity; • w is at the end of an entity.

  11. An Example of Hybrid Tag and Basic Tag Representation I love the ease of transferring the pictures to my computer. • Hybrid tags: • <BG>I</BG><OPINION_POS_EXP>love</OPINION_POS_EXP><BG>the</BG><PROD_FEAT-BOE>ease</PROD_FEAT-BOE> <PROD_FEAT-MOE> of</PROD_FEAT-MOE><PROD_FEAT-MOE>transferring</PROD_FEAT-MOE> <PROD_FEAT-MOE>the</PROD_FEAT-MOE> <PROD_FEAT-EOE>pictures</PROD_FEAT-EOE> <BG>to</BG><BG>my</BG><BG>computer</BG> • Basic tags: • <BG>I</BG><OPINION_POS_EXP>love</OPINION_POS_EXP> <BG> the </BG> <PROD_FEAT>ease of transferring the pictures </PROD_FEAT> <BG> to </BG><BG>my</BG><BG>computer</BG>

  12. Task 1: Lexicalized HMMs • Integrate linguistic features • Part-of-Speech (POS), lexical patterns • An observable state: a pair (wordi, POS(wordi)) • Given a sequence of words W=w1w2w3…wn and corresponding parts-of-speech S = s1s2s3…sn, find an appropriate sequence of hybrid tags that maximize the conditional probability P(T|W,S)

  13. Task 1: Approximations to the General Model • First Approximation: First-order HMMs based on the independent hypothesis P(ti | ti-K…ti-1 )≈ P(ti | ti-1 ) • Second approximation: combines the POS information with the lexicalization technique: • The assignment of current tag ti is supposed to depend not only on its previous tag ti-1 but also previous J (1≤J≤i-1) words wi-J…wi-1. • The appearance of current POS si is supposed to depend both on the current tag ti and previous L(1≤L≤i-1) words wi-L…wi-1. • The appearance of current word wiis assumed to depend not only on the current tag ti, current POS si, but also the previous K(1≤K≤i-1) words wi-K…wi-1. • Maximum Likelihood Estimation: estimate the parameters

  14. Task1: Smoothing Technique • Solve zero probabilities of MLE for any cases that are not observed in the training data. • linear interpolation smoothing technique • smooth higher-order models with their relevant lower-order models, or • smooth the lexicalized parameters using the related non-lexicalized probabilities Where λ, β and α denote the interpolation coefficients • Decoding of the best hybrid tag sequence: Viterbi algorithm.

  15. Task 2: Identifying and Extracting Opinion Sentences • Opinion sentences: sentences that express an opinion on product entities. • Not considered as effective opinion sentences: • Describe product entities without expressing reviewers’ opinions. • Express opinions on another product model’s entities • model numbers, such as DMC-LS70S, P5100, A570IS, identified using the regular expression “[A-Z-]+\d+([A-Za-z]+)?”).

  16. Task 3: Opinion Orientation Classification • Opinion orientation: not simply equal to opinion entity (word/phrase)’s orientation. • E.g., “I can tell you right now that the auto mode and the program modes are not that good.” (negative comment on both “auto mode” and “program modes”) • Natural Language Rules Reflecting Sentence Context • For each recognized product entity, search its matching opinion entity (defined as the nearest opinion word/phrase identified by the tagger). • The orientation of this matching opinion entity: the initial opinion orientation for the corresponding product entity. • Natural language rules to deal with specific language constructs (may change the opinion orientation), • The presence of negation words (e.g., not).

  17. Task 3: Opinion Orientation Classification • Line 8 to line 23: any negation words (e.g., not, didn’t, don’t) within five-word distance in front of an opinion entity, and changes opinion orientation accordingly, except • A negation word appears in front of a coordinating conjunction (e.g., and, or, but). (line 10 – 13) • A negation word appears after the appearance of a product entity during the backward search within the five-word window. (line 14 -17) • Line 27 to 32: the coordinating conjunction “but” and prepositions such as “except” and “apart from”.

  18. Experiments • Corpus: Amazon’s digital camera reviews • Experimental Design • The first 16 unique cameras listed on Amazon.com during November 2007 crawled. • Each individual review content, model number as well as manufacturer name extracted. • Sentence segmentation • POS parsing • Training Design • 1728 review documents • Two sets • One set (293 documents for 6 cameras): manually tagged by experts. • The remaining documents (1435 documents for 10 cameras): the bootstrapping process • Two challenges • Inconsistent terminologies: describe product entities • E.g., battery cover, battery/SD door, battery/SD cover, battery/SD card cover. • Recognizing rarely mentioned entities (i.e. infrequent entities). • E.g., only two reviewers mentioned “poor design on battery door” • Bootstrapping • Reduce the effort of manual labeling of a large set of training documents • Enable self-directed learning • In situations where collecting a large training set could be expensive and difficult to accomplish.

  19. Bootstrapping • The parent process: Master. • coordinate the bootstrapping process, extract and distribute high confidence data to each worker. • The rest: workers (two child processes) • Split the training documents into two halves, t1 and t2. • used as seeds for each worker’s HMM. • Each worker trains its own HMM classifier. • tag the bootstrap document set. • Master inspects each sentence tagged by each HMM classifier, and extracts opinion sentences that are agreed upon by both classifiers. • a newly discovered sentence: store it into the database. • Randomly splits the newly discovered data into two halves t1’ and t2’, and adds t1’ and t2’ to the training set of two workers respectively.

  20. Bootstrapping results

  21. Evaluation • The review documents for 6 cameras: manually labeled by experts. • The largest four data sets (containing 270 documents): a 4-fold cross-validation. • The remaining review documents for 2 cameras (containing 23 documents): for training only. • The bootstrap document set (containing 1435 documents for 10 cameras): extract high confidence data through self-learning • Evaluation • The recall, precision and F-score • Comparing the results tagged by the system with the manually tagged truth data. • Only an exact match is considered as a correct recognition. • Entity recognition: the exact same word/phrase identified and classified correctly; each identified entity occurs in the same sentence, same position and same document as compared with the truth data. • Opinion sentence extraction: the exact same sentence from the same document. • Opinion orientation classification: the exact same entity and entity category identified with correct orientation (positive or negative).

  22. Baseline • A rule-based baseline system • Motivated by (Turney 2002) and (Hu and Liu 2004)’s approaches. • Uses a number of rules to identify opinion-bearing words. • Matching nouns: product entities • Matching adjectives: opinion words • Adjectives’ semantic orientations • Using the bootstrapping technique proposed in (Hu and Liu 2004) • Twenty five commonly used positive adjectives and twenty five commonly used negative adjectives as seeds. • Expand these two seeds lists by searching synonyms and antonyms for each seed word. • Newly discovered words added into their corresponding seeds lists. • The orientations of extracted adjectives: checking the existence of these words in the lists.

  23. Evaluation Results

  24. Evaluation Results

  25. Evaluation Results

  26. Evaluation Results

  27. Conclusion • Propose a new and robust machine learning approach for web opinion mining and extraction. • Self-learn new vocabularies based on the patterns (not supported by rule based approaches. • Identify complex product entities and opinion expressions, infrequently mentioned entities • Abootstrapping approach to • situations that collecting a large training set could be expensive and difficult to accomplish • Future directions • Expansion of the datasets. • Researching the role of pronoun resolution in improving the mining results.

More Related