Is That Your Final Answer? The Role of Confidence in Question Answering Systems

Is That Your Final Answer?The Role of Confidence in Question Answering Systems Robert Gaizauskas1 and Sam Scott2 1Natural Language Processing Group Department of Computer Science University of Sheffield 2Centre for Interdisciplinary Studies Carleton University

Outline of Talk • Question Answering: A New Challenge in Information Retrieval • The TREC Question Answering Track • The Task • Evaluation Metrics • The Potential of NLP for Question Answering • The Sheffield QA System • Okapi • QA-LaSIE • Evaluation Results • Confidence Measures and their Application • Conclusions and Discussion Dublin Computational Linguistics Research Seminar

Question Answering: A New Challenge in IR • Traditionally information retrieval systems are viewed as systems that return documents in response to a query • Such systems better termed document retrieval systems • Once document returned user must search it to find required info • Acceptable if docs returned are short, not too many returned, and info need is general • Not acceptable if many docs returned or docs very long or info need very specific • Recently (1999,2000) the TREC Question Answering (QA) track has been designed to address this issue • As construed in TREC, Q A systems take natural language questions and a text collection as input and return specific answers (literal text strings) from documents in the text collection Dublin Computational Linguistics Research Seminar

QA: An (Incomplete) Historical Perspective Question answering not a new topic: • Erotetic logic (Harrah, 1984; Belnap and Steel, 1976) • Deductive question answering work in AI (Green, 1969;Schubert, 1986) • Conceptual Theories of QA (Lehnert, 1977) • Natural language front-ends to databases (Copestake, 1990; DARPA ATIS evaluations) Dublin Computational Linguistics Research Seminar

The TREC QA Track: Task Definition • Inputs: • 4GB newswire texts (from the TREC text collection) • File of natural language questions (200 TREC-8/700 TREC-9) e.g. Where is the Taj Mahal? How tall is the Eiffel Tower? Who was Johnny Mathis’ high school track coach? • Outputs: • Five ranked answers per question, including pointer to source document • 50 byte category • 250 byte category • Up to two runs per category per site • Limitations: • Each question has an answer in the text collection • Each answer is a single literal string from a text (no implicit or multiple answers) Dublin Computational Linguistics Research Seminar

The TREC QA Track: Metrics and Scoring • The principal metric is Mean Reciprocal Rank (MRR) • Correct answer at rank 1 scores 1 • Correct answer at rank 2 scores 1/2 • Correct answer at rank 3 scores 1/3 • … Sum over all questions and divide by number of questions • More formally: whereN = # questions,ri= the reciprocal ofthe best (lowest) rankassigned by a system at which a correct answeris found for question i, or 0 if no correct answer was found • Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation) Dublin Computational Linguistics Research Seminar

The Potential of NLP for Question Answering • NLP has failed to deliver significant improvements in the document retrieval task. Will the same be true of QA? • Must depend on the definition of task • Current TREC QA task is best construed as micro passage retrieval • There are a number of linguistic phenomena relevant to QA which suggest that NLP ought to be able to help, in principle. • But, it also now seems clear from TREC-9 results that NLP techniques do improve the effectiveness of QA systems in practice. Dublin Computational Linguistics Research Seminar

The Potential of NLP for Question Answering • CoreferencePart of the information required to answer aquestion may occur in one sentence, while the rest occurs in anotherlinked via an anaphor. E.g. Question: Howmuch did Mercury spend on advertising in 1993? Text: Mercury…Last year the company spent £12m onAdvertising. • DeixisReferences (possibly relative) to here and now may needto be correctly interpreted.E.g. to answer the preceding question requires interpretinglast yearas1993 via the date-line of the text (1994). • Grammatical knowledge Difference in grammatical role canbe of crucial importance. E.g. Question: Which company took overMicrosoft? cannot be answered Text:Microsoft took overEntropic. Dublin Computational Linguistics Research Seminar

The Potential of NLP for Question Answering (cont) • Semantic knowledgeEntailments based on lexical semanticsmay need to be computed. E.g. To answer the Question:At what age did Rossinistop writing opera? using the Text:Rossini … did not write anotheropera after he was 35. requires knowing that stopping X at time t means not doing X after t. • World knowledgeWorld knowledge may be requiredto interpret linguistic expressions. E.g. To answer the Question: In which city is the Eiffel Tower? using the Text:The Eiffel Tower is in Paris. but not the Text:The Eiffel Tower is in France. requires the knowledge that Paris is a city, France a country. Dublin Computational Linguistics Research Seminar

Sheffield QA System Architecture Overall objective is to use: • IR system as fast filter to select small set of documents with high relevance to query from the initial, large text collection • IE system to perform slow, detailed linguistic analysis to extract answer from limited set of docs proposed by IR system Dublin Computational Linguistics Research Seminar

Okapi • Used “off the shelf” – available from http://www.soi.city.ac.uk/research/cisr/okapi/okapi.html • based on the probabilistic retrieval model (Robertson + Sparck-Jones, 1976) • Used passage retrieval capabilities of Okapi • Passage retrieval parameters: • Min. passage: 1 para; Max. passage: 3 paras; Para step unit: 1 arrived at by experimentation on TREC-8 data • Examined trade-offs between: • number of documents and “answer loss” : 184/198 questions had answer in top 20 full docs; 160/198 in top 5 • passage length and “answer loss” : only 2 answers lost from top 5 3-para passages Dublin Computational Linguistics Research Seminar

QA-LaSIE • Derived from LaSIE: Large Scale Information Extraction System • LaSIE developed to participate in the DARPA Message Understanding Conferences (MUC-6/7) • Template filling (elements, relations, scenarios) • Named Entity recognition • Coreference identification • QA-LaSIE is a pipeline of 9 component modules – first 8 are borrowed (with minor modifications) from LaSIE • The question document and each candidate answer document pass through all nine components • Key difference between MUC and QA task: IE template filling tasks are domain-specific; QA is domain-independent Dublin Computational Linguistics Research Seminar

QA-LaSIE Components 1. Tokenizer. Identifies token boundaries and text section boundaries. 2. Gazetteer Lookup. Matches tokens against specialised lexicons (place,person names, etc.). Labels with appropriate name categories. 3. Sentence Splitter. Identifies sentence boundaries in the text body. 4. Brill Tagger. Assigns one of the 48 Penn TreeBank part-of-speech tags to each token in the text. 5. Tagged Morph. Identifies the root form and inflectional suffix for tokens tagged as nouns or verbs. 6. Parser. Performs two-pass bottom-up chart parsing first with a special named entity grammar, then with a general phrasal grammar. A “best parse” (possibly partial) is selected and a quasi-logical form(QLF) of each sentence is constructed. For the QA task, a special grammar module identifies the “sought entity” of a question and forms a special QLF representation for it. Dublin Computational Linguistics Research Seminar

QA-LaSIE Components (cont) 7. Name Matcher. Matches variants of named entities across the text. 8. Discourse Interpreter. Adds the QLF representation to a semantic net containing background world and domain knowledge. Additional info inferred from the input is added to the model, and coreference resolution is attempted between instances mentioned in the text. For the QA task, special code was added to find and score a possible answer entity from each sentence in the answer texts. 9. TREC-9 Question Answering Module. Examines the scores for each possible answer entity, and then outputs the top 5 answers formatted for each of the four submitted runs. New module for the QA task. Dublin Computational Linguistics Research Seminar

Q:Who released the internet worm? Question QLF: qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), lobj(e2,e3) worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4) QA in Detail (1): Question Parsing Phrase structure rules are used to parse differentquestion typesandproduce a quasi-logical form (QLF) representation which contains: • a qvar predicate identifying the sought entity • a qattr predicate identifying the property or relationwhose value is sought for the qvar (this may not always be present.) Dublin Computational Linguistics Research Seminar

QA in Detail (2):Sentence/Entity Scoring Two sentence-by-sentence passes through each candidate answer text • Sentence Scoring: • Co-reference system from LaSIE discourse interpreter resolves coreferring entities both within answer texts and betweenanswer and question texts. • Main verb in question matched to similar verbs in answertext • Each non-qvarentity in the question is a “constraint”, and candidate answer sentences get one point for each constraint they contain. Dublin Computational Linguistics Research Seminar

QA in Detail (2):Sentence/Entity Scoring (cont) Entity Scoring: Each entity in each candidate answer sentence which was not matched to a term in the question at the sentence scoring stage receives a score based on: • semantic and property similarity to the qvar • whether it shares with the qvar the same relation to a matched verb (the lobj or lsubj relation) • whether it stands in a relation such as apposition, qualification or prepositional attachment to another entity in the answer sentence which was matched to a term in the question at the sentence scoring stage Entity scores are normalised in the range [0-1] so that they never outweigh a better sentence match Dublin Computational Linguistics Research Seminar

QA in Detail (2):Sentence/Entity Scoring (cont) • Total Score: For each sentence a total score is computed by • summing the sentence score and the “best entity score” • dividing by the number of entities in question + 1 (has no effect on answer outcome but normalises scores in [0-1] – useful for comparisons across questions) • Each sentence is annotated with • Total sentence score • “best entity” • “exact answer” = name attribute of best entity, if found Dublin Computational Linguistics Research Seminar

Question Answering in Detail: Answer Generation • The 5 highest scoring sentences from all 20 candidate answer texts were used as the basis for the TRECanswer output • Results from 4 runs were submitted: • shef50ea – outputthe name of the best entityif available; otherwise output its longest realization in the text • shef50 – output the first occurrence of the best answer entity in the text –if less than 50 bytes longoutputentire sentence or a 50 byte windowaround the answer, whichever is shorter • shef250- same as shef50 but with a limit of 250 bytes • shef250p - same as shef250 but with extra padding from the surrounding text allowed to a 250 byte maximum Dublin Computational Linguistics Research Seminar

Question QLF: qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), lobj(e2,e3) worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4) Answers: Answer QLF: Shef50ea: “Morris” Shef50: “Morris testified that he released the internet wor” Shef250: “Morris testified that he released the internet worm …” Shef250p: “… Morris testified that he released the internet worm …” person(e1),name(e1,’Morris'), testify(e2), lsubj(e2,e1), lobj(e2,e6), proposition(e6), main_event(e6,e3), release(e3), pronoun(e4,he), lsubj(e3,e4), worm(e5), lobj(e3,e5) Question Answering in Detail: An Example Q:Who released the internet worm? A:Morris testified that he released the internet worm… Sentence Score: 2 Entity Score (e1): 0.91 Total (normalized): 0.97 Dublin Computational Linguistics Research Seminar

Evaluation Results • Two sets of results: • Development results on 198 TREC-8 questions • Blind test results on 693 TREC-9 questions • Baseline experiment carried out using Okapi only • Take top 5 passages • Return central 50/250 bytes Dublin Computational Linguistics Research Seminar

Best Development Results on TREC-8 Questions Dublin Computational Linguistics Research Seminar

TREC-9 Results Dublin Computational Linguistics Research Seminar

TREC-9 50 Byte Runs Dublin Computational Linguistics Research Seminar

TREC-9 250 Byte Runs Dublin Computational Linguistics Research Seminar

The Role of Confidence in QA Systems • Little discussion to date concerning usability of QA systems, as conceptualised in the TREC QA task • Imagine asking How tall is the Eiffel Tower? and getting answers: • 400 meters (URL …) • 200 meters (URL …) • 300 meters (URL …) • 350 meters (URL …) • 250 meters (URL …) • There are several issues concerning the utility of such output, but two crucial ones are • How confident can we be in the system’s output? • How confident is the system is its own output? Dublin Computational Linguistics Research Seminar

The Role of Confidence in QA Systems (cont) • That these questions are important to users (question askers) is immediately apparent from watching any episode of the ITV quiz show Who Wants to be a Millionaire? • Participants are allowed to “phone a friend” as one of their “lifelines”, when confronted with a question they cannot answer. Almost invariably they • Select a friend who they feel is most likely to know the answer – i.e. they attach an a priori confidence rating to their friend’s QA ability (How confident can we be in the system’s output?) • Ask their friend how confident they are in the answer they supply – i.e. they ask their friend to supply a confidence rating on their own performance (How confident is the system is its own output?) • MRR scores give an answer to a); however, to date no exploration of b) Dublin Computational Linguistics Research Seminar

The Role of Confidence in QA Systems (cont) • QA-LaSIE associates a normalised score in the range [0-1] with each answer - the combined sentence/entity (CSE) score • can the CSE scores be treated as confidence measures? • To determine this, need to see if CSE scores correlate with answer correctness • Note this is also a test of whether the CSE measure is a good one • Have carried out an analysis of CSE scores for shef50ea and shef250 runs on the TREC-8 question set • Rank all proposed answers by CSE score • For 20, 10, and 5 equal subdivisions of the [0-1] CSE score range determine the % answers correct in that subdivision … Dublin Computational Linguistics Research Seminar

Shef50ea: CSE vs. Correctness Dublin Computational Linguistics Research Seminar

Shef250: CSE vs. Correctness Caveat: analysis based on unequal distribution of data points. For the .2 chunks: Range Data- points 0-.19 115 .2-.39 511 .4-.59 306 .6-.79 45 .8-1.0 5 Dublin Computational Linguistics Research Seminar

Applications of Confidence Measures • The CSE/Correctness correlation (preliminarily) established above indicates the CSE measure is a useful measure of confidence • How can we use this measure? • Show it to the user – good indicator of how much faith they should have in the answer/whether they should bother following up the URL to the source document • In a more realistic setting, where not every question can be assumed to have an answer in the text collection, CSE score may suggest a threshold below which “no answer” should be returned • proposal for TREC-10 Dublin Computational Linguistics Research Seminar

Conclusions and Discussion • TREC-9 test results represent significant drop wrt to best training results • But, much better than TREC-8, vindicating the “looser” approach to matching answers • QA-LaSIE scores better than Okapi-baseline, suggesting NLP is playing a significant role • But, a more intelligent baseline (e.g. selecting answer passages based on word overlap with query) might prove otherwise • Computing confidence measures provides some support that our objective scoring function is sensible. They can be used for • User support • Helping to establish thresholds for “no answer” response • Tuning parameters in the scoring function (ML techniques?) Dublin Computational Linguistics Research Seminar

Future Work • Failure analysis • Okapi – for how many questions were no documents containing an answer found? • Question parsing – how many question forms were unanalysable? • Matching procedure – where did it break down? • Moving beyond word root matching – using Wordnet? • Building an interactive demo to do QA against the web – Java applet interface to Google + QA-LaSIE running in Sheffield via CGI • Gets the right answer to the million £ question “Who was the husband of Eleanor of Aquitaine?” ! Dublin Computational Linguistics Research Seminar

THE END Dublin Computational Linguistics Research Seminar

Is That Your Final Answer? The Role of Confidence in Question Answering Systems

Is That Your Final Answer? The Role of Confidence in Question Answering Systems

Presentation Transcript

Multi-Perspective Question Answering

SIMS 290-2: Applied Natural Language Processing

The Clinician’s Guide to Answering the Service-Connected Classification Question

Jeopardy For Romeo and Juliet

PATHFINDER BIBLE EXPERIENCE

Open Domain Question Answering: Techniques, Resources and Systems

INSTRUCTIONS:

Instructions for Playing Jeopardy

Bad Journalism

Shampoo Chemistry - 1

Communication Networks Review Question/Answer

Introductory Psychology

FINAL GEOPARTY

AP Government Review

Review Questions

Information extraction from text

LtoJ Quizzes

Question Answering Tutorial

Instructions for Playing Jeopardy

Advanced Question Answering: Plenty of Challenges to Go Around

Good Morning!

Instructions for Playing Jeopardy