1 / 35

Our IR Strategy

Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/. 1. Effective monolingual IR system (combining indexing schemes?) 2. Combining query translation tools

paxton
Télécharger la présentation

Our IR Strategy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual Search usingQuery Translation andCollection SelectionJacques Savoy, Pierre-Yves BergerUniversity of Neuchatel, Switzerlandwww.unine.ch/info/clef/

  2. 1. Effective monolingual IR system(combining indexing schemes?) 2. Combining query translation tools 3. Effective collection selection and merging strategies Our IR Strategy

  3. Effective Monolingual IR Define a stopword list Have a « good » stemmer Removing only the feminine and plural suffixes (inflections) Focus on nouns and adjectives see www.unine.ch/info/clef/

  4. For Indo-European languages : Set of significant words CJK: bigrams with stoplist without stemming Indexing Units For Finnish, we used 4-grams.

  5. Probabilistic Okapi Prosit or deviation from randomness IR Models Vector-space • Lnu-ltc • tf-idf (ntc-ntc) • binary (bnn-bnn)

  6. Monolingual Evaluation

  7. Monolingual Evaluation

  8. Monolingual Evaluation

  9. Finnish can be characterized by: 1. A large number of cases (12+), 2. Agglutinative (saunastani)& compound construction (rakkauskirje) 3. Variations in the stem (matto, maton, mattoja, …). Problems with Finnish Do we need a deeper morpholgical analysis?

  10. Relevance feedback (Rocchio)? Improvement using…

  11. Monolingual Evaluation

  12. Relevance Feedback

  13. Data fusion approaches (see the paper) Improvement with the English, Finnish and Portuguese corpora Hurt the MAP with the French and Russian collections Improvement using…

  14. 1. Machine-readable dictionaries (MRD) e.g., Babylon 2. Machine translation services (MT) e.g., FreeTranslation BabelFish 3. Parallel and/or comparable corpora(not used in this evaluation campaign) Bilingual IR

  15. “Tour de France WinnerWho won the Tour de France in 1995?” (Babylon1) “France, vainqueurOrganisation Mondiale de la Santé, le, France,* 1995” (Right) “Le vainqueur du Tour de FranceQui a gagné le tour de France en 1995 ?” Bilingual IR

  16. Bilingual EN->FR/FI/RU/PT

  17. Bilingual EN->FR/FI/RU/PT

  18. Improvement … Query combination (concatenation) Relevance feedback (yes for PT, FR, RU, no for FI) Data fusion (yes for FR, no for RU, FI, PT) Bilingual IR

  19. 1. Create a common index Document translation (DT) Multilingual EN->EN FI FR RU 2. Search on each language and merge the result lists (QT) 3. Mix QT and DT 4. No translation

  20. Merging Problem EN FI FR RU <–– Merging

  21. Merging Problem FR043 0.8 FR120 0.75 FR055 0.65 … 1 RU050 1.62 RU005 1.1 3 RU120 0.9 4 … 1 EN120 1.22 EN200 1.0 3 EN050 0.7 4 EN705 0.6 …

  22. Round-robin (baseline) Raw-score merging Normalize Biased round-robin Z-score Logistic regression Merging Strategies

  23. Merging Problem FR043 0.8 FR120 0.75 FR055 0.65 … 1 RU050 1.62 RU005 1.1 3 RU120 0.9 4 … 1 EN120 1.22 EN200 1.0 3 EN050 0.7 4 EN705 0.6 … 1 RU… 2 EN… 3 RU… 4 ….

  24. Round-robin (baseline) Raw-score merging Normalize Biased round-robin Z-score Logistic regression Merging Strategies

  25. Test-Collection CLEF-2004 EN FI FR RUsize 154 MB 137 MB244MB68MB doc 56,472 55,344 90,261 16,716 topic 42 45 49 34 rel./query 8.9 9.2 18.73.6

  26. Round-robin (baseline) Raw-score merging Normalize Biased round-robin Z-score Logistic regression Merging Strategies

  27. Z-Score Normalization Compute the mean m and standard deviation s 1 EN120 1.22 EN200 1.0 3 EN050 0.7 4 EN765 0.6 … … New score = ((old score-m) / s ) + d 1 EN120 7.02 EN200 5.0 3 EN050 2.0 4 EN765 1.0 …

  28. Condition A (meta-search)Prosit/Okapi (+PRF, data fusion for FR, EN) Condition C (same SE)Prosit only (+PRF) MLIR Evaluation

  29. Multilingual Evaluation

  30. Collection Selection General idea: 1. Use the logistic regression to compute a probability of relevance for each document; 2. Sum the document scores from the first 15 best ranked documents; 3. If this sum ≥ d, consider all the list,otherwise select only the first m doc.

  31. Multilingual Evaluation

  32. Conclusion (monolingual) • Finnish is still a difficult language • Blind-query expansion: different patterns with different IR models • Data fusion (?)

  33. Conclusion (bilingual) • Translation resources not available for all languages pairs • Improvement bycombining query translations yes, but can we have a better combination scheme?

  34. Conclusion (multilingual) • Selection and merging are still hard problems • Overfitting is not always the best choice (see results with Condition C)

  35. Multilingual Search usingQuery Translation andCollection SelectionJacques Savoy, Pierre-Yves BergerUniversity of Neuchatel, Switzerlandwww.unine.ch/info/clef/

More Related