1 / 12

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum. Thanks to… Bernardo Magnini Danilo Giampiccolo Pamela Forner Petya Osenova Christelle Ayache Bodgan Scaleanu Diana Santos Juan Feu Ido dagan …. UNED (coord.) Anselmo Peñas Álvaro Rodrigo

Télécharger la présentation

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Answer Validation Exercise - AVEQA subtrack at Cross-Language Evaluation Forum Thanks to… Bernardo Magnini Danilo Giampiccolo Pamela Forner Petya Osenova Christelle Ayache Bodgan Scaleanu Diana Santos Juan Feu Ido dagan … UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo

  2. What? Answer Validation Exercise Validate the correctness of the answers given by realQA systems... ...the answers of participants at CLEF QA 2006 Why? Give feedback on a single QA module, improve QA systems performance, improve systems self-score, help humans in the assessment of QA systems output, develop criteria for collaborative QA systems, ...

  3. Exact Answer QA system Supporting snippet & doc ID Into affirmative form Hypothesis Text How? Turning it into a RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question several sentences <500 bytes

  4. Example • Question: Who is the President of Mexico? • Answer (obsolete): Vicente Fox • Hypothesis: Vicente Fox is the President of Mexico • Supporting Text: “...President Vicente Fox promises a more democratic Mexico...” • Exercise • Text entails Hypothesis? • Answer: YES | NO

  5. Looking for robust systems • Hypothesis are built semi-automatically from systems answers • Some answers are correct and exact • Many are too large, too short, too wrong • Many hypothesis with • Wrong syntax but understandable • Wrong syntax and not understandable • Wrong semantics

  6. So, the exercise • Return an entailment value (YES|NO) for each given text-hypothesis pair • Results were evaluated against the QA human assessments • Subtasks English, Spanish, Italian, Dutch, French, German, Portuguese and Bulgarian

  7. Collections Available for CLEF participants at nlp.uned.es/QA/ave/

  8. Evaluation • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over pairs YES (where text entails hypothesis) • Baseline system: Accept all answers, (give always YES)

  9. DE EN ES FR IT NL PT Fernuniversität in Hagen 2 2 Language Computer Corporation 1 1 2 U. Rome "Tor Vergata" 2 2 U. Alicante (Kozareva) 2 2 2 2 2 2 1 13 U. Politecnica de Valencia 1 1 U. Alicante (Ferrández) 2 2 LIMSI-CNRS 1 1 U. Twente 1 2 2 1 1 2 1 10 UNED (Herrera) 2 2 UNED (Rodrigo) 1 1 ITC-irst 1 1 R2D2 project 1 1 Total 5 11 9 4 3 4 2 38 Participants and runs

  10. Results

  11. Conclusions • Developed methodologies • Build collections from QA responses • Evaluate in chain with a QA Track • New testing collections for the QA and RTE communities • In 7 languages, not only English • Evaluation in a real environment • Real systems outputs -> AVE input

  12. Conclusions • Reformulation of Answer Validation as Textual Entailment problem is feasible • Introduces a 4% of error (in the semi-automatic generation of the collection) • Good participation • 11 systems, 38 runs, 7 languages • Systems that reported the use of Logicobtained the best results

More Related