Wikipedia as Sence Inventory to Improve Diversity in Web Search Results

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results Celina Santamar´ıa, Julio Gonzalo and Javier Artiles UNED, c/Juan del Rosal, 16, 28040 Madrid, Spain (National University of Distance Education) ACL 2010

Motivation • Motivation • Test Set • Set of Words • Set of Documents • Manual Annotation • Coverage of Web Search Results: Wikipedia vsWordnet • Diversity in Google Search Results • Sense Frequency Estimators for Wikipedia • Association of Wikipedia Senses to Web Pages • VSM Approach • WSD Approach • Classiﬁcation Results • Precision/Coverage Trade-off • Using Classiﬁcation to Promote Diversity • Related Work • Conclusions

for very short queries • for very short queries • one word • disambiguation may not be possible • focus on two broad-coverage lexical resources • WordNet • Wikipedia

Test Set • Motivation • Test Set • Set of Words • Set of Documents • Manual Annotation • Coverage of Web Search Results: Wikipedia vsWordnet • Diversity in Google Search Results • Sense Frequency Estimators for Wikipedia • Association of Wikipedia Senses to Web Pages • VSM Approach • WSD Approach • Classiﬁcation Results • Precision/Coverage Trade-off • Using Classiﬁcation to Promote Diversity • Related Work • Conclusions

Test Set • Set of Words • Set of Documents • Manual Annotation

Set of Words • Corpus annotation • two annotator • handle 40 nouns • 15 nouns from the Senseval-3 lexical sample dataset • 25 additional words which satisfy two conditions: • they are all ambiguous, and • they are all names for music bands in one of their senses

Corpus • The Senseval set is : • {argument, arm, atmosphere, bank, degree, difference, disc, image, paper, party, performance, plan, shelter, sort, source}. • The bands set is : • {amazon, apple, camel, cell, columbia, cream, foreigner, fox, genesis, jaguar, oasis, pioneer, police, puma, rainbow, shell, skin, sun, tesla, thunder, total, trafﬁc, trapeze, triumph, yes}

Table 1: Coverage of Search Results: Wikipedia vs. WordNet • For each noun in set, we looked up all its possible senses in WordNet 3.0 and in Wikipedia disambiguation pages • Wikipedia has an average of 22 senses (per noun) • 25.2 in the Bands set • 16.1 in the Senseval set • Wordnet a much smaller ﬁgure, 4.5 senses (per noun) • 3.12 for the Bands set • 6.13 for the Senseval set

Set of Documents • Step 1: • retrieved top 150 (per noun) in google • Step 2: • for each document, we stored both the snippet and whole HTML document • assume a ”one sense per document”

Manual Annotation • Annotation • Two annotators • for every document, whether there was appropriate senses in each of the dictionaries. • They provide annotations for 100 documents per noun • If an URL in the list was corrupt or not available, it had to be discarded • 150 -> 100 documents per noun

Coverage of Web Search Results: Wikipedia vsWordnet • Motivation • Test Set • Set of Words • Set of Documents • Manual Annotation • Coverage of Web Search Results: Wikipedia vsWordnet • Diversity in Google Search Results • Sense Frequency Estimators for Wikipedia • Association of Wikipedia Senses to Web Pages • VSM Approach • WSD Approach • Classiﬁcation Results • Precision/Coverage Trade-off • Using Classiﬁcation to Promote Diversity • Related Work • Conclusions

The top ten result are not cover by wikipedia • 32% of top ten document are not cover by wikipedia • manually examined • a majority of the missing senses consists of names of (generally not well-known) • companies (45%) • products or services (26%); • The other frequent type (12%) of non annotated document is disambiguation pages

Degree of overlap between Wikipedia and Wornnet senses • just 3% fit wordnet only. • Wikipedia seems to extend the coverage of Wordnet

Coverage of Web Search Results: Wikipedia vsWordnet • Abstract • Motivation • Test Set • Set of Words • Set of Documents • Manual Annotation • Coverage of Web Search Results: Wikipedia vsWordnet • Diversity in Google Search Results • Sense Frequency Estimators for Wikipedia • Association of Wikipedia Senses to Web Pages • VSM Approach • WSD Approach • Classiﬁcation Results • Precision/Coverage Trade-off • Using Classiﬁcation to Promote Diversity • Related Work • Conclusions

diversity is not a major priority for ranking results • the top ten results only cover, in average, 3 Wikipedia senses • average number of senses listed in Wikipedia is 22 • First 100 documents, this number grows up to 6.85 senses per noun. • Average 63% of the pages in search results belong to the most frequent sense of the query word

Sense Frequency Estimators for Wikipedia • Wikipedia disambiguation don’t contain the relative importance of senses for a given word. • Internal relevance • incoming links for the URL of a given sense in Wikipedia. • stable • External relevance • number of visits for the URL of a given sense • (as reported in http://stats.grok.se). • Not stable

Measured correlation • for each noun w and for each sense wi, we consider three values: • proportion of documents retrieved for w which are manually assigned to each sense • inlinks(wi): • Relative amount of incoming links to each sense wi • visits(wi): • relative number of visits to the URL for each sense wi.

Measured correlation • We have measured the correlation between these three values using a linear regression correlation coefficient, • correlation value of .54 for the number of visits • correlation value of .71 for the number of incoming links. • Both estimators seem to be positively correlated

Measured correlation • freq(wi) = k * inlinks(wi) + (1 – k) * visits(wi), • k = 0, 0.1, 0.2, …, 1 • When k is 0.9 , the function have maximal correlation valueof .73 • freq(wi) = 0.9 * inlinks(wi) + 0.1 * visits(wi) • This weighted estimator provides a slight advantage over the use of incoming links only • (0.73 vs 0.71)

Two different techniques • Two different techniques • Vector Space Model (VSM) • WSD system • Two baselines • random assignment of senses • most frequent sense

VSM Approach • For each word sense, its Wikipedia page in a (unigram) vector space model • idf weights are computed in two different ways • VSM : • IDF in the collection of retrieved documents • VSM-GT: • uses the statistics provided by the Google Terabyte collection • VSM-mixed: • VSM + VSM-GT

VSM Approach • cosine similarity • Assign the sense with the highest similarity to the document • In case of ties, pick the first sense in the Wikipedia disambiguation page • VSM-GT+freq • Consider the case of ties • we pick up the one which has the largest frequency according to our estimator

WSD Approach • TiMBL • a state-of-the-art supervised WSD system • uses Memory-Based Learning. • TiMBL-core • Occurrences of the word in the Wikipedia page for the word sense. • TiMBL-inlinks • occurrences of the word in Wikipedia pages pointing to the page for the word sense. • TiMBL-all • Core + inlinks

TiMBL • first : • disambiguate all occurrences of word w in the page p. • Then : • we choose the sense which appears most frequently in the page according to TiMBL results. • In case of ties : • pick up the first sense listed in the Wikipedia disambiguation page. • TiMBL-core+freq • Consider the case of ties • we pick up the sense with the highest frequency according to our estimator • when no sense reaches 30% of the cases in the page to be disambiguated • we also resort to the most frequent sense heuristic

Table 4: • Precision: • the number of pages correctly classified divided by the total number of predictions.

Using Classification to Promote Diversity • we fill each position in the rank (starting at rank 1), with the document which has the highest similarity to some of the senses which are not yet represented in the rank;

Alternative ranking for comparison • clustering (centroids): • this method applies Hierarchical Agglomerative Clustering • clustering (top ranked): • this time the top ranked document (in the original Google rank) of each cluster is selected. • random: • Randomly selects ten documents from the set of retrieved results. • upper bound: • coverage is not 100% • because some words have more than ten meanings in Wikipedia and we are only considering the top ten documents.

Coverage : • number of senses in top 10 / number of senses in all result • Coverage of senses going from 49% to 77% • the coverage of Wikipedia senses in the top ten results is 70% larger than in the original ranking

Using Wikipedia to enhance diversity seems to work much better than clustering • bias • only Wikipedia senses are considered to estimate diversity. • our results do not imply that the Wikipedia modified rank is better than the original Google rank. • Wikipedia can be used as a reference to improve search results diversity for one-word queries.

Conclusions • We have investigated whether generic lexical resources can be used to promote diversity in Web search results for one-word, ambiguous queries. We have compared WordNet and Wikipedia • (i) unsurprisingly, Wikipedia has a much better coverage of senses in search results, and is therefore more appropriate for the task; • (ii) the distribution of senses in search results can be estimated using the internal graph structure of the Wikipedia and the relative number of visits received by each sense in Wikipedia • (iii) associating Web pages to Wikipedia senses with simple and efficient algorithms, we can produce modified rankings that cover 70% more Wikipedia senses than the original search engine rankings.

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results