“Multilingual Pseudo Relevance Feedback: A way of Query Expansion and Disambiguation

“Multilingual Pseudo Relevance Feedback: A way of Query Expansion and Disambiguation Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay www.cse.iitb.ac.in/~pb Work done with Manoj, Karthik, Arjun and many others

Classical Information Retrieval (Simplified) query Retrieval Model a.k.aRanking algorithm document representation relevant documents late 1960’s • 40+ years of work in designing better models • Vector space models • Binary independence models • Network models • Logistic regression models • Bayesian inference models • Hyperlink retrieval models 2010 (Courtesy: Dr. SriramRaghvan, IBM India Research Lab)

The elusive user satisfaction Ranking Output Presentation Correctness of Query Processing Coverage Indexing Crawling Snippet NER Stemming MWE

How to improve ranking with more meaningful query models • Set theoretic, Algebraic and Probabilistic Models • Underlying current of attempt trying to capture “Query Meaning” • Started with Karen Spark-Jones’ thesis titled “Synonyms and Semantic Search” in Cambridge in the 90s • The effort continues

Another perspective: Mutilinguality • English still the most dominant language on the web • Contributes 72% of the content • Number of non-English users steadily rising all over the world • English penetration in India • Estimated to be around 3-4% • Mostly the urban educated class • Need to enable access to above information through local languages

India’s CLIR project • Enable access to information through local languages • Query Languages: 9 • (Assamese, Bengali , Gujarati, Hindi, Marathi, Priya, Punjabi, Telugu, Tamil) • Results in: Source Language + Hindi + English • Domains: Tourism, Health • http://www.clia.iitb.ac.in/sandhan • Public release for select languages planned in Jan 2012

तिरूपति आने के लिए रेल साधनतिरूपति पुण्य नगर पहुँचने के लिए बहुत रेल उपलब्ध हैं | अगर मुंबई से यात्रा कर रहे है तो मुंबई-चेन्नई एक्सप्रेस गाड़ी से प्रवास कर सकते है | Target Language Index in English Crawled and Indexed Web Pages Hindi Query तिरूपतियात्रा CLIR Engine तिरूपतियात्रा Target Information in English Language Resources Result Snippets in Hindi Ranked List of Results

Unlike English, many of the non-English languages are constrained on resources like Large crawl coverage Large annotated corpora Language resources like Stemmers, Morphological analyzers, Word de-compounders etc. Resource Constrained Languages

Main Message of the presentation

User Information Need • Expressed as short query (average length 2.5 words) • Need query expansion • Lexical resources based expansion did not deliver (Voorhees 1994) • Paradigmatic association (synonyms, antonyms, hypo and hypernyms) • Introduces severe topic drift through unrelated senses of expansion terms • Also through irrelevant senses of query terms

Illustration Drifted topic due to inapplicable sense!!! {case, container} Query word: “Madrid bomb blast case” {case, suit, lawsuit} Drifted topic due to expanded term!!! {suit, apparel}

Query Expansion: Current Dominant Practice • Syntagmatic Expansion • Through Pseudo Relevance Feedback • We show • Mutlilingual PRF helps • Familially related language helps still more • Result of insight from linguistics and NLP • Disambiguation by leveraging multilinguality

Road map • Need for Pseudo Relevance Feedback (PRF) • Limitations of PRF • Need for MultiPRF • MultiPRF using English • MultiPRF using other languages • Conclusions and future directions

Offers a principled approach to IR Each document modeled as a probability distribution – Document Language Model User information need is modeled as a probability distribution – Query Model Language Modeling Approach to IR

Ranking Function – KL Divergence Ranking: a matter of resolving divergence Document words Query words q1, q2, q3, q4, … qn d1, d2, d3, d4, … dn Problem of Retrieval ↔ Problem of Estimating P(w|ΘR) and P(w|D)

The Challenge - Estimating Query Model ΘR • Average length of query: 2.5 words • Relevance Feedback to the rescue • User marks some documents from initial ranked list as “relevant” • Usually difficult to obtain

Pseudo-Relevance Feedback (PRF) Initial Results Final Results Doc. Score d1 2.4 d2 2.1 d3 1.8 d4 0.7 . dm0.01 Doc. Score d2 2.3 d1 2.2 d3 1.8 d50.6 . dm0.01 Query Q IR Engine Rerank Corpus with Updated Query Relevance Model Updated Query Relevance Model Document Collection d1 √ d2 √ d3 √ d4 √ dk√ Assume top ‘k’ as Relevant Pseudo-Relevance Feedback (PRF) Learn Feedback Model from Documents

If coverage is less, precision at higher ranks decreases Experimented on CLEF collections in French Same set of queries run on different collection sizes Impact of Coverage on PRF Performance

Lexical and Semantic Non-Exclusion and Attempts to Solve It Initial Retrieval Documents Final Expanded Query Accession to European Union europe union access nation russia presid getti year state Stemmed Query “access europe union” • Previous Attempts • Voorhees et al. SIGIR ‘94 used Wordnet • Negative results • Random walk models • Translation Models - Zhai et al. SIGIR ‘01 • Collins-Thompson and Callan ‘CIKM ’05 • Latent Concept Expansion • Metzler et al. SIGIR ‘07 Relevant documents with terms like “Membership”, “Member”, “Country” not ranked higher

Limitations of PRF: Lack of Robustness Final Expanded Query Olive Oil Production in Mediterranean Initial Retrieved Documents Oil Oliv Mediterranean Produc Cook Salt Pepper Serv Cup Documents about Cooking Stemmed Query “oliv oil mediterranean” Causes Query Drift • Previous Attempts • Refining top document set • Refining initial terms obtained through PRF • Selective query expansion • TREC Robustness Track – improving robustness

Harnesses “Multilinguality”: Take help of a collection in a different language called “assisting language” Expectation of increased robustness, since searching in two collections An attractive proposition for languages that have poor monolingual performance due to Resource constraints like inadequate coverage Morphological complexity Can both Semantic Non-inclusion and Lack of Robustness be solved?

Gao et al. (2009) use English to improve Chinese language ranking Demonstrate only on a subset of queries Experimentation on a small dataset Uses cross-document similarity Related Work

Multilingual PRF: System Flow θL1 Get Feedback Model in L1 Initial Retrieval Query in L1 Top ‘k’ Results Interpolate Models L1 Index θL1Trans Translate Feedback Model into L1 Initial Retrieval Top ‘k’ Results θL2 Translate Query into L2 Get Feedback Model in L2 Ranking using Final Model L2 Index

We remember that a set of words is compared with another set of words Document words Reformulated Query words d1, d2, d3, d4, … dn q1, q2, q3, q4, … qn PRF Words from Translation Original Query Words OWN PRF Words

English used as assisting language Good monolingual performance Ease of processing MultiPRF consistently and significantly outperforms monolingual PRF baseline (ManojChinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual PRF: English Lends a Helping Hand, SIGIR 2010, Geneva, Switzerland, July, 2010.) English Lends a Helping Hand!

Has the qualities of a good assisting language: Resource-rich More than 70% of the web in English Morphological ease of processing No complex issues like word compounding etc. Good monolingual performance IR issues in English well-studied Why English as the assisting language?

Feedback model estimated in L2 (ΘFL2) translated back into L1 (ΘTransL1) Using probabilistic bi-lingual dictionary from L2 to L1 Learnt from parallel sentence-aligned corpora Back Translation Step Feedback Model Translation

Semantically Related Terms through Feedback Model Translation German-English Word Alignments English-French Word Alignments Nation Aircraft Flugzeug Nation Country Plane State Aeroplane Feedback Model Translation Step UN Air United Flight Nation, Country State, UN, United Aircraft, Plane Aeroplane, Air, Flight

Translation alternatives learnt through word-level alignments Back translation step acts as a rich source of morphological and semantic relations (Tiedemann 2001) Semantically Related Terms through Feedback Model Translation

Original feedback model and translated model interpolated Final model also interpolated with query words to retain query focus ΘMulti used to finally re-rank documents from corpus based on KLD ranking function Linear Model Interpolation

European languages chosen since Europarl freely available English chosen as assisting language CLEF Standard Dataset for Evaluation Four widely differing source languages uses French (Romance Family), German(West Germanic) Finnish (Baltic-Finnic), Hungarian (Uralic-Ugric) On more than 600 topics (only Title field) Use Google Translate for Query Translation Standard Evaluation Metrics (MAP, P@5, P@10, GMAP) used. Experimental Setup

Result: MultiPRF Gains over PRF P@5 MAP P@10 GMAP Significant Gains in All Collections Increased Robustness

italien, président (president), oscar , gouvern (governer) , scalfaro , spadolin(molecular) MAP improves from 0.1238 to 0.4324! Get Feedback Model in L1 θL1 Query in French Initial Retrieval Top ‘k’ Results Oscar honorifique pour des réalisateurs italiens θL1Multi L1 Index Italien, oscar, film, realis, wild,cinem,honorif,president,honorair,cineast Translate & Interpolate Initial Retrieval Top ‘k’ Results Translate Query into English θL2 L2 Index filmmakfilm,movi,tobacco,placement,produc,stallon,studio,italian, oscar,honarari, Get Feedback Model in L2 Honorary Oscar for Italian filmmakers

rhein, ollunfall, fluss, ol, auen, erdreich, heizol, tank, lit, folg, oberrhein, teil MAP improves from 0.0128 to 0.1184! Get Feedback Model in L1 θL1 Query in German Initial Retrieval Top ‘k’ Results θL1Multi Ölunfälle und Vögel L1 Index Olunfall,vogel,ol,olverschmutz (oil pollution),erdol(petroleum),olp(oil slick),rhein,mcgrath,olivenol,fluss,tier,vergoss,vogelart (bird species),olkatastroph,olpreis Translate & Interpolate Initial Retrieval Top ‘k’ Results θL2 Translate Query into English L2 Index Oil, spill, bird,pipelin,river,offici,fish,lake,cleanup,state,gallon Get Feedback Model in L2 Birds and Oil Spills

Query Translation simpler task than MT. We used 3 different QT systems: Google Translate Naïve SMT System Almost “Ideal” Translation Effect of Varying Query Translation • Robust To Suboptimal Translation • Ideal Translation huge gains over MBF Annotated Performance of Systems (on 3-point scale [0,0.5,1]) Performance of Different System on FR-01+02

Can we do as well using another collection in same language? Do we need to go to another language?

Since no publicly available thesaurus, we learn probabilistic thesaurus as suggested by Xu et. al. ( where e is an English word) Using model obtained by PRF + Assisting Collection in same language: , we expand using a thesaurus: How about using Thesaurus Based Expansion as well?

Comparison of Thesaurus-based Expansion with Multi-PRF • Simply adding a Thesaurus to get Synonyms, does not help • Thus MultiPRF, combines both benefits well.

Can languages other than English help? ManojChinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual Relevance Feedback: One Language Can Help Another, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.

Do the results hold for languages other than English? What are the characteristics of a good assisting language? Can any language be used to improve the PRF performance of another language? Can this be extended to multiple assisting languages? Performance Study of Assisting Languages

Language Typology

Use and give weightage to terms from the assisting langauges Again use linear Model Interpolation Query terms PRF terms from Translation Own PRF terms

European languages chosen Europarl corpora CLEF dataset Six languages from different language families French, Spanish (Romance), German, English, Dutch(West Germanic), Finnish (Baltic-Finnic) On more than 600 topics Use Google Translate for Query Translation Experimental Setup

MultiPRF with Non-English Assisting Languages

chronisch (chronic), pet, athlet (athlete), ekrank (ill), gesund (healthy), tuberkulos (tuberculosis), patient, reis (rice), person MAP improves from 0.062 to 0.636! Get Feedback Model in L1 θL1 Query in German Initial Retrieval Top ‘k’ Results Bronchial asthma θL1Multi L1 Index asthma, allergi,krankheit (disease), allerg (allergenic),chronisch, hauterkrank (illness of skin), arzt (doctor), erkrank (ill) Translate & Interpolate Initial Retrieval Top ‘k’ Results θL2 Translate Query into Spanish L2 Index Asma,bronquial,contamin,ozon, cient, enfermed, alerg, alergi,air Get Feedback Model in L2 El asma bronquial

développ (developed), évolu (evolved), product, produit (product), moléculair (molecular) MAP improves from 0.145 to 0.357! Get Feedback Model in L1 θL1 Query in French Initial Retrieval Top ‘k’ Results Ingénierie Génétique θL1Multi L1 Index génet, ingénier,manipul, animal, pêcheur (fisherman), développ (developed), gen Translate & Interpolate Initial Retrieval Top ‘k’ Results θL2 Translate Query into Dutch L2 Index genetisch, manipulatie, exxon, dier (animal), visser (fisherman), gen Get Feedback Model in L2 Genetische Manipulatie

Results

Dependence on Monolingual Performance

Back Translation Performance improves within the same family

Tried parallel composition for two assisting languages Uniform interpolation weights used Exhaustively tried all 60 combinations Improvements reported over best performing PRF of L1 or L2 More than one assisting language

“Multilingual Pseudo Relevance Feedback: A way of Query Expansion and Disambiguation