1 / 13

MIRACLE M ultilingual I nformation R etriev A l for the CLE F campaign

MIRACLE M ultilingual I nformation R etriev A l for the CLE F campaign. DAEDALUS – Data, Decisions and Language, S.A. www.daedalus.es Universidad Carlos III de Madrid (UC3M) www.uc3m.es Universidad Politécnica de Madrid (UPM) www.upm.es

lis
Télécharger la présentation

MIRACLE M ultilingual I nformation R etriev A l for the CLE F campaign

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIRACLEMultilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. www.daedalus.es Universidad Carlos III de Madrid (UC3M) www.uc3m.es Universidad Politécnica de Madrid (UPM) www.upm.es Partially funded by IST-2001-32174 (OmniPaper) and CAM 07T/0055/2003 projects

  2. ImageCLEF 2003 Participation in: • Monolingual task: • English -> English: 5 different runs • Bilingual tasks: • Spanish->English: 6 runs • German->English: 6 runs • French->English: 4 runs • Italian->English: 4 runs • TOTAL: 25 runs

  3. System Architecture • IR engine: XAPIAN (based on Probabilistic IR model) • Filtering components: text and word extraction, topic extraction, word count, statistic calculations • Linguistic components: tokenizers, stemmers (based on Porter algorithm), German word decompounding module, stopword filters • Translation components: API to FreeTranslation.com (full text) and ERGANE dictionary (word by word) • Semantic components: Synonym expansion for English (WordNet) • Our idea is to couple these components in different ways to evaluate different approaches and compare the influence of each one in the P/R of the IR process for each language

  4. IR Process: Index • All the images are indexed in the same XAPIAN collection • For each image, HEADLINE and TEXT fields are used (without tags and IDs)

  5. IR Process: Retrieval • Different runs, basically consisting on: • Create the query from the topic • Execute the query in XAPIAN system • Retrieve 1000 best results (ranked list) For each topic, only the TITLE field and the 1st translation variant are used • Evaluation: four relevance sets (2 judges) • Union (any assessor) / Intersection (both assessors) • Strict (relevant only) / Relaxed (also partially relevant) • In our evaluation, we have considered the intersection-strict, which is the most restrictive

  6. Monolingual Runs (en->en) • OR: • Word extraction in topic title  stop word filtering  stemming  weighted OR operator with stems • Intended as the baseline run • ORlem: • Word extraction in topic title  stop word filtering  stemming  weighted OR operator with stems and original words • Idea: measure the effect of stemming • ORlemexp: • Word extraction in topic title  stop word filtering  synonym expansion  stemming  weighted OR operator with stems and original words and synonyms • Idea: measure the effect of increasing the recall despite the penalization in precision • Doc: • Index topic title as document  retrieve similar docs • Idea: Confirm that this is a similar approach to vector space model • ORrf: • Query with OR operator with stems  Top 25 docs  250 most important terms  new weighted OR query • Idea: measure the effect of simplest blind relevance feedback

  7. P-R curve (en->en) • Best runs have too high precision values (“the set of relevant documents is not complete”) • Relevance feedback is the worst (“noise due to unappropriate parameter values - 250 terms when the mean length of image description is about 50 words”) • Any kind of term expansion reduces precision (“low number of documents, existence of ambiguity”)

  8. Average Precision (en->en) • Best run is weighted OR query and Doc (“in Probabilistic IR model, weighted OR is like term weights in Vector Space Model”) • The evaluation with other relevance sets gives a slight increase in overall precision

  9. Bilingual Runs (fr,ge,it,sp->en) • TOR1: • Topic title  FreeTranslation  Word extraction  stop word filtering  stemming  weighted OR operator with stems • Similar to monolingual OR, intended as the baseline run • TOR3: • Topic title  FreeTranslation + ERGANE  Word extraction  stop word filtering  stemming  weighted OR operator with stems • Idea: improve translation by combining different sources • Tdoc: • Topic title  FreeTranslation  Index as document  retrieve similar docs • TOR3exp: • Topic title  FreeTranslation + ERGANE  Word extraction  stop word filtering  synonym expansion  stemming  weighted OR operator with stems and original words and synonyms • TOR3full: • The same as TOR3 but also including topic title in original language • Idea: evaluate the effect of text that cannot be or is incorrectly translated • TOR3fullexp: • Combination of TOR3exp and TOR3full

  10. P-R curve (fr,ge,it,sp->en)

  11. P-R curve (fr,ge,it,sp->en) • Although all results are similar, TOR1 and Tdoc are the best ones in all cases • Using word by word translation with ERGANE has proved to be worse: translation is not adequate or the expansion of the query makes wider queries thus reducing precision • Again, as in monolingual task, any kind of term expansion reduces precision, if not coping with ambiguity • Spanish, German and Italian have similar results, but French is slightly worse: FreeTranslation is worse for French or the French topics are harder to translate • Spanish->English gives our best individual results !!! • Comparing bilingual/monolingual results, a difference of about 10-15% arises (similar to our participation in CLEF tasks this year)

  12. Average Precision (fr,ge,it,sp->en)

  13. Conclusions and Future Work • As new-comers to CLEF, we have worked hard to build the infrastructure to be able to easily execute different runs • Simplest approaches have proved to be the best if not handling ambiguity caused by term expansion • Next time…: • POS filtering for syntactic disambiguation to handle ambiguity • Evaluate the effect of using stemming (and its quality) or not in high flexible languages like Spanish/French/Italian • More focus on Spanish: better stemmer, better synonym expansion (directly in Spanish) • Evaluate the quality of translation engines with respect to the IR process

More Related