Cross-Language Retrieval with Wikipedia Participation Report

Performing Cross-Language Retrieval with WikipediaParticipation report for Ad Hoc bilingualHungarian →English joint work with András A. Benczúr, István Bíró, Károly Csalogány Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Péter Schönhofen

Our Approach • Term-by-term query translation by dictionaries • Bigram language model helps select the most probable English translation • Using Wikipedia to discard off-topic terms IR System: Hungarian Academy of Sciences Search Engine (http://search.sztaki.hu) • TF×IDF-based • OR query, heavily weighted by # matched terms • Also taking into account proximity and term location Use only query title; description and narrative contributes to mapping title to Wikipedia concepts

Outline of the algorithm • Preparations • construct a dictionary • generate concept network from Wikipedia • pre-process queries and documents • Raw translation • disambiguation with bigram model • Improve translation quality with Wikipedia • map terms to concept space • rank concepts • map concepts to words

Dictionary Construction • Two sources of Hungarian-English term pairs: • On-line dictionary of the Institute(official + community edited entries)‏ • cross-language links present in Wikipedia • Select conflicting entries in above order(official, community, Wikipedia) • 100,510 dictionary entries in total(however, large portion is idiom)‏

Raw translation • Find Hungarian dictionary terms in queries • Hungarian terms may overlap • Select best translations based on bigram model • a translation is better if it joins to other translations through bigrams with higher probability • Wikipedia model used but any other large corpus suffices Hungarian word query score by bigram model Translation candidate 1 Translation candidate 1 Translation candidate 2 Translation candidate 2 max

Role of Wikipedia

Concept network • Regular Wikipedia articles represent concepts • article title is concept name • links to other articles describe semantic relations • redirections are handled as additional concept names(sort of synonyms) • Category assignments are ignored • Wikipedia is in fact converted to an ontology • less formal than a proper ontology (e. g. WordNet) • only one type of relationship exists

Map terms to concepts • Match Wikipedia article titles with query terms • Concepts behind Wikipedia article titles: • the same title may represent multiple concepts • another layer of disambiguation is introduced • Concepts are recognized through terms, and are carried by text locations occupied by the term

Rank concepts • Select concepts which are the most tightly connected to other candidate concepts • Score of concept C computed from three factors: • L: # text locations carrying conceptssemantically related to C; • M: # concepts carried by the same text locations as C; • F: # text locations carrying C

Map concepts to words • Concepts→ titles (word sequences) pasting titles would yield too long queries • Titles→ set of words • Words are ranked based on the scores of concepts behind them the same word may represent many concepts • Query title words required if all translations of a title word discarded, forcefully injected into the translated query

Why use Wikipedia? • Advantages • freely available (snapshots are downloadable)‏ • relatively high-quality • wide range of subjects covered • rapidly growing, up-to-date • Disadvantages • articles not always link to other relevant articles • category assignments not always consistent • basic verbs and nouns are not covered

Example query • Original query title:“cancer research” • Raw translation:“oncology” • Improved translation:“oncology cancer treatment”

Evaluation

Difficulties • Hungarian stemmer is not perfect • language is complex • pronouns not always recognized as such • Dictionary is small • In short: raw translation is of very low quality • Retrieval is not performed on the concept level • Context is not large enough to support the reliable selection of relevant Wikipedia concepts

Future work • Performing German queries against English corpora • More rich dictionary • Improved mechanism • raw translation is used for retrieval • Wikipedia concept network is used for determining relevance of documents in hit-lists: query-document matching carried out in the space of Wikipedia concepts • Improved matching • POS information also taken into account

Thank you for your attention

Cross-Language Retrieval with Wikipedia Participation Report