Create Presentation
Download Presentation

Download Presentation

Opinion Detection by Transfer Learning

Download Presentation
## Opinion Detection by Transfer Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Opinion Detection by Transfer Learning**11-742 Information Retrieval Lab Grace Hui Yang Advised by Prof. Yiming Yang**Outline**• Introduction • The Problem • Transfer Learning by Constructing Informative Prior • Datasets • Evaluation Method • Experimental Results • Conclusion**Introduction**• TREC 2006 Blog Track • Opinion Detection Task <num> Number: 851 <title> "March of the Penguins" <desc> Description: Provide opinion of the film documentary "March of the Penguins". <narr> Narrative: Relevant documents should include opinions concerning the film documentary "March of the Penguins". Articles or comments about penguins outside the context of this film documentary are not relevant.**Opinion Detection Literature Review**• Researchers in Natural Language Processing (NLP) community • Turney (2002) : groups online words whose point mutual information close to "excellent" and "poor" • Riloff & Wiebe (2003): use a high-precision classifier to get high quality opinions and non-opinions, and then extract syntactic patterns. Repeat this process to bootstrap • Pang et al. (2002): treat opinion and sentiment detection and as a text classification problem • Naive Bayes, Maximum Entropy, SVM +unigram pres. (82.9%) • Pang & Lee (2005): use Minicuts to cluster sentences based on their subjectivity and sentiment orientation. • Researchers from data mining community • Morinaga et al. (2002) : use word polarity, syntactic pattern matching rules to extract opinions, PCA to create correspondence between the product names and keywords**Existing System**• Query Expansion • Document Retrieval • Binary Text Classification by Bayesian Logistic Regression**No Available Training Data**• Transfer Learning • Transfer knowledge over similar tasks but different domain • Generalize knowledge from limited training data • Discover underlying general structures across domains**Transfer Learning Literature Review**• Baxter(1997) and Thrun(1996): both used hierarchical Bayesian learning • Lawrence and Platt (2004), Yu et al. (2005): also use hierarchical Bayesian models to learn hyper-parameters of Gaussian process • Ando and Zhang (2005): proposed a framework for Gaussian logistic regression for text classification . • Raina et al. (2006): continued this approach and built informative priors for Gaussian logistic regression**Transfer Learning**• The Approach presented in this project is Inspired by the work done by Raina, Ng & Koller (2006) on text classification • Transferring common knowledge (word dependence) in similar tasks by constructing a informative prior in a Bayesian Logistic Regression Framework**Logistic Regression Framework**• Logistic regression assumes sigmoid-like data distribution • To avoid overfitting, multivariate Gaussian prior is added on θ • Maximum a posteriori (MAP) Estimation**Non-diagonal Covariance**• Zero-mean, equal variance Prior • Cannot capture relationship among words • Zero-mean, non-diagonal covariance Prior • Model word dependency in covariance matrix’s off-diagonal entries**Pair-wised Covariance**• Covariance Definition: • Given zero mean,**Get Covariance by MCMC**• Markov Chain Monte Carlo (MCMC) • Sample V (V=4) small vocabularies with size S (S=5) containing the two words wi and wjcorresponding to θi and θj. • From each vocabulary, sample T (T=4) training sets with size Z(Z=3) to train an ordinary Log. Reg. model on labeled datasets**Get Covariance by MCMC**• Subtract a bootstrap estimation of the covariance due to randomness of training set change**Learning a Covariance Matrix**• Learning a single covariance for pairs of regression coefficients is NOT all we need • Two Challenges: (1) Valid Covariance Matrix • A valid covariance matrix needs to be positive semi-definite (PSD) • Hermitian matrix (square, self-adjoint) with nonnegative eigen values. • Project the matrix on to a PSD cone**Learning a Covariance Matrix**(2) Pair-wise calculations increase the complexity quadratically with vocabulary size • represent the word dependence as linear combination of underlying features • Learn the coefficients by Least Squared Error**Learning a Covariance Matrix By Joint Minimization**• λ is the trade-off coefficient between the two objectives. • As λ-> 0, only care about PSD cone • As λ-> 1, only care about word pair relationship • Set to 0.6**Solve the Joint Minimization**• Convex problem, converge to global minimum • Fix Σ , minimize over ψ • Use Quadratic Program (QP) Solver • Fix ψ , minimize over Σ • A special semi-definite programming (SDP) • Eigen decomposition and keep the nonnegative values**Feature Design**• Model word dependency • Wordnet synset • and? • People do not always use the same general syntactic patterns to express opinion • "blah blah is good", • "awesome blah blah!"**Target-Opinion Word Pair**• Different opinion targets relate to different customary expression • A person is knowledgeable • A computer processor is fast • A computer processor is knowledgeable (ill) • A person is fast (ill) • A computer processor is running like a horse (word polarity test fails)**Target-Opinion Word Pair**• From training corpus, extract from a positive example • subject and object (excludes pronouns) • “Melvin, pig” • subject and BE-predicate • “lens, clear”, “base, heavy” • modifier and subject • “good, coffee” , “interesting, movie”**Word Synonym**• Bridge vocabulary gap from training to testing • “This movie is good" in training corpus • "The film is really good" in the testing corpus**Feature Vector**Log-co-occurrence Target-Opinion Synonym**Datasets**• Training Corpus • Movie reviews [Pang & Lee from Cornell] • 10,000 sentences (5,000 opinions, 5,000 non-opinions) • Product reviews [Hu & Liu from UIC] • 4,000+ sentences (2,034 opinions, 2,173 non-opinions. • Digital camera, cell phone, DVD player, Jukebox, …**Datasets**• Test Corpus – TREC 2006 Blog corpus • 3,201,002 articles (TREC reports 3,215,171) • December 2005 to February 2006 • Technorati, Bloglines, Blogpulse … • For each topic, 5,000 passages are retrieved • Using Lemur as search engine • 132,399 passages in total • 2,648 passages per topic • Each passage 1-10 sentences ( less than 100 words)**Evaluation Method**• Precision at 11-pt recall level • Mean average precision (MAP) • Answers are provided by TREC qrels, • Document ids of documents containing an opinion • Note that our system is developed for opinion detection at sentence level • An averaged score of all the sentences in a retrieved passages • Extract Unique document ids to compare with TREC qrels**Experimental Results**• Effects of Using Non-diagonal Prior Covariance • Baseline: Using movie reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2) • Feature Selection: Using common word features in movie reviews and product reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2) • Informative Prior:Using movie reviews to calculate prior covariance, train the Gaussian log. Reg. model with theinformative prior ~N(0,Σ)**Experimental Results**• Effects of Feature Design • Baseline: Using movie reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2), bi-gram model • Transfer Learning Using Synonyms: Using informative prior ~N(0,Σ) • Transfer Learning Using Target-Opinion pairs: informative prior ~N(0,Σ) • Transfer Learning Using Both: informative prior ~N(0,Σ)**Experimental Results**• Effects on External Dataset Selection Negative Effect of Transfer Learning**Why Negative Effect Occurs?**• Movie covers more general topics • Product only share 23% topics**Conclusion**• Applying Transfer Learning in Opinion Detection • Transfer Learning by Informative Prior improves brutal transfer learning by 32% • Discovering a good feature for opinion detection • Target-Opinion pair • Need to be careful when choosing external datasets to help