1 / 19

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results. Celina Santamaria Julio Gonzalo Javier Artiles n lp.uned.es UNED,c /Juan del Rosal , 16, 28040 Madrid, Spain celina.santamaria@gmail.com julio@lsi.uned.es javart@bec.uned.es. ACL 2010. Introduction.

benito
Télécharger la présentation

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina Santamaria Julio Gonzalo Javier Artiles nlp.uned.es UNED,c/Juan del Rosal, 16, 28040 Madrid, Spain celina.santamaria@gmail.com julio@lsi.uned.es javart@bec.uned.es ACL 2010

  2. Introduction • Word sense Disambiguation(WSD) • Promoting diversity in the search result • Present the results as a set of clusters • Complement search results with search suggestions • Two lexical resource • Wikipedia • Wordnet3.0

  3. Introduction

  4. Introduction • Problem • Coverage • Estimate search results diversity using our senses • Sense frequencies • Classification

  5. Test Set • It are susceptible to form a one-word query • Denote one or more named entities • 40 nouns • 15 nouns from the Senseval-3 lexical sample dataset • 25 nouns which satisfy two conditions • Ambiguous • They are all names for music bands in one of their senses

  6. Test Set • Average of 22 senses per noun in Wikipedia • Average of 4.5 senses per noun in Wordnet • Wikipedia has an larger coverage • Retrieve 150 documents for each noun(Google) • Annotate each document in each of the dictionaries

  7. Coverage of Web Search Results • If we focus on the top ten results, in the band subset Wikipedia covers 68% of the top ten documents • In the top ten results that are not covered by Wikipedia • a majority of the missing senses consists of names of companies(45%) and products or services(26%) • the other frequent type (12%) of non annotated document is disambiguation pages

  8. Coverage of Web Search Results • Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information • If we want to extend the coverage of Wikipedia, the best strategy seems to be to consider lists of companies, products and services

  9. Diversity in Google Search Results • Use Wikipedia senses to test how well search results respect diversity in terms of this subset of senses • 63% of the pages in search results belong to the most frequent sense of the query word • Diversity may not play a major role in the current Google ranking algorithm

  10. Sense Frequency Estimators for Wikipedia • Frequency information is crucial in a lexicon • But Wikipedia don’t provide the relative importance of senses for a given word • Attempt to use two estimators of expected sense distribution • Incoming links for the sense page • The number of visits for the sense page(May, June and July 2009 http://stats.grok.se/)

  11. Association of Wikipedia Senses to Web Pages • Test whether the information can be used to classify search results accurately • No consider approaches that involve a manual training data • A web page p and the set of senses w1,…wn listed in Wikipedia • Approach • Vector Space Model(VSM) • Word Sense Disambiguation(WSD) System • Random • Assign the most frequent sense to all documents

  12. VSM • Represent page in a vector space model(tf*idf weights) • VSM : compute idf in the collection of retrieval documents • VSM-GT : use the statistics provided by the google Terabyte collection • VSM-mix : combine statistics from the collection and from the Google Terabyre Collection • VSM-GT+freq

  13. WSD system • Extract learning examples from the Wikipedia automatically • Disambiguate all occurrences of word w in the page p • TiMBL-core : use only the examples found in the Wikipedia page • TiMBL-inlinks : use the examples found in Wikipedia pages pointing to the page • TiMBL-all : use both sources of examples • TiMBL-core+freq

  14. Classification Results • VSM is a simpler and more efficient approach • May indicate that using frequency estimations is only helpful up to certain precision ceiling

  15. Precision/Coverage Trade-off • All systems assign a sense for every document in the test collection • It is possible to enhance search results diversity without annotating every document • Set threshold[0.00-0.90]

  16. Using Classification to Promote Diversity • Use our best classifier(VSM-GT+freq) • Make a list of the top-ten documents • Maximize the number of senses • Maximize the similarity scores of the documents to their assigned senses • Algorithm • Fill each position in the rank with the highest similarity sense which are not yet represented in the rank • Once all senses are represented, we start choosing a second representative for each sense

  17. Using Classification to Promote Diversity • Other approaches • Clustering(centroids) • Clustering(top ranked) • Random • Upper bound

  18. Using Classification to Promote Diversity • coverage=the number of senses in the top ten result / the number of senses in all search results • Using wikipedia to enhance diversity seems to work much better than clustering • Note, Our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to estimate diversity

  19. Conclusion • Wikipedia has a much better coverage • The distribution of senses can be esitmated • Improve search results diversity for one word queries with simple and efficient algorithm • Our results do not imply that the Wikipedia modified rank is better than the original Google rank

More Related