1 / 19

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities. Masumi Shirakawa 1 , Kotaro Nakayama 2 , Takahiro Hara 1 , Shojiro Nishio 1 1 Osaka University, Osaka, Japan 2 University of Tokyo, Tokyo, Japan. Challenge in short text analysis.

howie
Télécharger la présentation

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Semantic Similarity Measurementsfor Noisy Short Texts Using Wikipedia Entities Masumi Shirakawa1, Kotaro Nakayama2, Takahiro Hara1, ShojiroNishio1 1Osaka University, Osaka, Japan 2University of Tokyo, Tokyo, Japan

  2. Challenge in short text analysis • Statistics are not always enough. A year and a half after Google pulled its popular search engine out of mainland China Baidu and Microsoft did not disclose terms of the agreement

  3. Challenge in short text analysis • Statistics are not always enough. A year and a half after Google pulled its popular search engine out of mainland China They are talking about... Search enginesand China Baidu and Microsoft did not disclose terms of the agreement

  4. Challenge in short text analysis • Statistics are not always enough. A year and a half after Google pulled its popular search engine out of mainland China They are talking about... Search enginesand China Baidu and Microsoft did not disclose terms of the agreement How do machines know that the two sentences mention about the similar topic?

  5. Reasonable solution • Use external knowledge. A year and a half after Google pulled its popular search engine out of mainland China Baiduand Microsoft did not disclose terms of the agreement Wikipedia Thesaurus [Nakayama06]

  6. Related work • ESA: Explicit Semantic Analysis [Gabrilovich07] • Add Wikipedia articles (entities) to a text as its semantic representation. • 1. Get search ranking of Wikipedia for each term (i.e. Wiki articles and scores). • 2. Simply sum up the scores for aggregation. Key termextraction Related entityfinding Aggregation pear Apple Apple Inc. iPhone Apple sellsa new product pricing iPhone product iPad pear Apple Inc. sells Output:ranked list ofc business Input: T new c t

  7. Problems in real world noisy short texts • “Noisy” means semantically noisy in this work. (We do not handle informal or casual surface forms, or misspells) • Term ambiguity • Apple (fruit) should not be related with Microsoft. • Fluctuation of term dominance • A term is not always important in texts. We explore more effective aggregation method.

  8. Probabilistic method • We propose Extended naïve Bayesto aggregate related entities Key termextraction Related entityfinding Aggregation pear Apple Inc. iPhone Apple Apple sellsa new product pricing iPhone iPad product iPad Apple Inc. Output:ranked list ofc t Input: T c • [Mihalcea07] • [Milne08] • [Nakayama06] • [Song11] From text T to related articles c

  9. When input is multiple terms • Apply naïve Bayes [Song11] to multiple terms to obtain related entity cusing each probabilityP (c |tk). Apple • “iPhone” product iPhone • … • Compute for each related entity c new

  10. When input is multiple terms • Apply naïve Bayes [Song11] to multiple terms to obtain related entity cusing each probabilityP (c |tk). Apple • “iPhone” By using naïve Bayes, entities that are relatedto multiple terms can be boosted. product iPhone • … • Compute for each related entity c new

  11. When input is text • Not “multiple terms” but “text,” i.e., we don’t know which terms are key terms. • We developed extended naïve Bayes to solve this problem. • Cannot observewhich are key terms Apple • “iPhone” product iPhone • … new

  12. Extended naïve Bayes Apple • Candidates of key term product Apple • … • … new Apple • Apply naïve Bayes • to each state T’ product • … Apple • Probability that the set of key terms T is a state T’ : product • … new

  13. Extended naïve Bayes Apple • Candidates of key term product Apple • … • … new Apple • Apply naïve Bayes • to each state T’ product • … Apple • Probability that the set of key terms T is a state T’ : product • … new

  14. Extended naïve Bayes Apple • Candidates of key term product Apple • … • … new Apple • Apply naïve Bayes • to each state T’ product • … Term dominance is incorporated into naïve Bayes Apple • Probability that the set of key terms T is a state T’ : product • … new

  15. Experiments on short text sim datasets • [Datasets] Four datasets derived from word similarity datasets using dictionary • [Comparative methods]Original ESA [Gabrilovich07], ESA with 16 parameter settings • [Metrics] Spearman’s rank correlation coefficient ESA with well-adjusted parameter is superior to our method for “clean” texts.

  16. Tweet clustering • K-means clustering using the vector of related entities for measuring distance • [Dataset] 12,385 tweets including 13 topics • [Comparative methods]Bag-of-words (BOW), ESA withthe same parameter, ESA with well-adjusted parameter • [Metric] Average of Normalized Mutual Information (NMI), 20 runs #MacBook (1,251) #Silverlight (221) #VMWare (890) #MySQL (1,241) #Ubuntu (988) #Chrome (1,018) #NFL (1,044) #NHL (1,045) #NBA (1,085) #MLB (752) #MLS (981) #UFC (991) #NASCAR (878)

  17. Results p-value < 0.01

  18. Results p-value < 0.01 Our method outperformed ESA withwell-adjustedparameter for noisy short texts.

  19. Conclusion • We proposed extended naïve Bayes to derive related Wikipediaentities given a real world noisy short text. • [Future work] • Tackle multilingual short texts • Develop applications of the method

More Related