270 likes | 365 Vues
Evaluation of Automatically Reformulated Questions in Question Series R. Shaw, B. Solway, R. Gaizauskas and M. A. Greenwood. Outline of Talk. Motivations Creating a Gold Standard Developing a Comparison Metric System Evaluation Conclusions and Future Work. Motivations.
E N D
Evaluation of Automatically Reformulated Questions in Question SeriesR. Shaw, B. Solway, R. Gaizauskas and M. A. Greenwood
Outline of Talk • Motivations • Creating a Gold Standard • Developing a Comparison Metric • System Evaluation • Conclusions and Future Work IR4QA Workshop
Motivations • Many existing QA systems are designed to answer single unambiguous factoid questions • How tall is the Eiffel Tower? • How many calories are there in a Big Mac? • In real life most questions are not asked in isolation • Who is John’s father? • When was he born? • This is mirrored by the current TREC/TAC Task • Questions are asked in series about a given target. IR4QA Workshop
Motivations • One consequence of question series is that individual questions may no longer make sense if taken out of context: • When was he born? • When did it sink? • How tall is it? IR4QA Workshop
Motivations • One consequence of question series is that individual questions may no longer make sense if taken out of context: • When was he born? When was he born? Kafka • When did it sink? When did it sink? Russian submarine Kursk • How tall is it? How tall is it? Eiffel Tower • Can just append target to every question but • Does not handle cases where anaphor is to something other than target • Loses information for approaches which may want to parse grammatical questions and exploit, e.g., syntactic relations between question elements 24th August 2008 IR4QA Workshop
Motivations A pre-processing step must therefore be used to remove ambiguities in a question with reference to the topic and previous questions. One way to do this: reformulate/rephrase question as a human would if asked to restate it as a question independent of question series context 24th August 2008 IR4QA Workshop
Motivations Since systems need to do this reformulation (or its equivalent) automatically a “gold-standard” corpus of reformulations would be useful to support development of automatic methods to resolve referential ambiguities in questions … 24th August 2008 IR4QA Workshop
Outline of Talk • Motivations • Creating a Gold Standard • Developing a Comparison Metric • System Evaluation • Conclusions and Future Work IR4QA Workshop
Creating a Gold Standard • A gold standard set of reformulations was created from the TREC 2007 QA test set • Ten question series were randomly selected • These questions were reformulated independently by two people • The results compared and discussion led to the creation of the guidelines • Another random set of ten questions were then independently reformulated • Reformulations were sufficiently close and the guidelines sufficiently stable that given limited resources the remaining questions were reformulated by a single person. IR4QA Workshop
Creating a Gold Standard • The gold standard currently covers the whole TREC 2007 QA test set • Each question series contains between 5 and 7 questions • Around 406 questions were reformulated • In total there are 448 individual reformulations • The maximum number of reformulations per question is 3 • The mean number of reformulations per question is 1.103 IR4QA Workshop
Guidelines (1) • Context Independence and Readability • The reformulation should be understandable outside the question series • The reformulation should be expressed as by a native speaker • Reformulate Questions to Maximise Search Results • Include as much information from the context as necessary to maximise search results • For example, “Who was Shakespeare?” should become “Who was William Shakespeare?” as William provides extra search terms IR4QA Workshop
Guidelines (2) • Target Matches a Sub-String of the Question • If a part of the target appears in the question this should be expanded to the full target • Stopwords should be ignored when searching for the target in the question • For example, given the target “Sony Pictures Entertainment (SPE)” and the question “What U.S. company did Sony purchase to form SPE?” should be reformulated to “What U.S. company did Sony purchase to form Sony Pictures Entertainment (SPE)?” • Rephrasing • Questions should not be unnecessarily rephrased • For example, there is no need to change “What was Nissan Corp. formally known as?” to “Nissan Corp. was formally known as what?” IR4QA Workshop
Guidelines (3) • Previous Questions and Answers • Questions which refer to a previous answer should have two reformulations • A reformulation which includes a PREVIOUS_ANSWER string • A reformulation to merge the two questions • For example, given the questions “Who was the first Imam of the Shiite sect of Islam?” and “Where is his tomb?” we get • Where is PREVIOUS_ANSWER’s tomb? • Where is the tomb of the first Imam of the Shiite sect of Islam? • It • The word it should be interpreted as referring to either the answer of the previous question or if found in the first question to the target IR4QA Workshop
Guidelines (4) • Targets that Contain Brackets • Three reformulations should be generated for such targets • The full target should be substituted into the question • The target without the bracketed section should be substituted into the question • The bracketed section only should be substituted into the question • For example given the target “Church of Jesus Christ of Latter-day Saints (Mormons)” and the question “Who founded the Church of Jesus Christ of Latter-day Saints?” we get • Who founded the Church of Jesus Christ of Latter-day Saints? • Who founded the Church of Jesus Christ of Latter-day Saints (Mormons)? • Who founded the Mormons? • Stemming and Synonyms • Reformulations should not stem or use synonyms unless these are seen in the target or other previous questions in the series • For example given the target “Chunnel” the question “How long is the Chunnel?” should not be reformulated to “How long is the Channel Tunnel?” IR4QA Workshop
Guidelines (5) • Pronouns (1) • If the pronoun is he or she • And the target is a person • The target is substituted for the pronoun in the reformulation • Unless the previous answer is a person • In which case the reformulation should follow the previous questions and answers guideline • Pronouns (2) • If the pronoun is his, hers, their • And the target is a person • The target followed by ‘s is substituted in the reformulation • Unless the previous answer is a person • In which case the reformulation should follow the previous questions and answers guideline IR4QA Workshop
Outline of Talk • Motivations • Creating a Gold Standard • Developing a Comparison Metric • System Evaluation • Conclusions and Future Work IR4QA Workshop
Developing a Comparison Metric • The production of the gold standard was motivated by the need for data for the development and evaluation of automatic methods to produce unambiguous questions. • Comparing system output against a gold standard requires an evaluation metric • We require a metric with the following properties: • The closer a reformulation to the gold standard the higher the score • A reformulation identical to the gold standard should give the maximum score – following conventional practice we would like a metric that scales between 0 and 1, where 1 is the highest • The ordering of the words in the reformulation is not as important as the content of the reformulation (partly because the questions are used as IR queries and hence the content is more important than word order) IR4QA Workshop
Developing a Comparison Metric • Many different metrics have been reported in the literature for computing string similarity including ROUGE and METEOR • These metrics are, however, usually used to compare long sentences or paragraphs and not questions. • We decided to develop our own evaluation metric tuned specifically to determining question similarity. • Many different string similarity metrics were investigated to produce the final comparison metric. Full details of these experiments, while interesting, are not central to this work – full details are, however, included in the paper. IR4QA Workshop
Developing a Comparison Metric • The comparison metric makes use of the Jaccard Similarity • This is a token based vector space similarity measure • Defined as • Where X and Y are the words in the two questions being compared • Whilst we are more concerned with word content than word ordering, reformulations that are closer in order to the gold standard should be prefferred • We use unigram’s to check for overlapping content • We use bigrams’s to check for similar word ordering • We use a weighted combination of unigrams to bigrams of 2:1 IR4QA Workshop
Outline of Talk • Motivations • Creating a Gold Standard • Developing a Comparison Metric • System Evaluation • Conclusions and Future Work IR4QA Workshop
System Evaluation • Our current question processing system • Substitutes the target for pronouns like he and she • Performs limited nominal co-reference to help with target substitution • Expands questions containing parts of the target to contain the full target • Does not take into account the answers to previous questions • If it cannot make a reformulation it simply appends the target to the end of the question • We performed an experiment to compare the gold standard reformulations against: • The unaltered questions • The question with the target appended to the end • The questions resulting from our automatic system IR4QA Workshop
System Evaluation • These results show that as more processing takes place the questions become more like the gold standard reformulations • Our processing system is quite effective • The default behaviour is to append the target (i.e. Produce the with target form) yet there is a big difference between the processed and with target scores. • These results also show that the similarity measure is acting as expected. IR4QA Workshop
System Evaluation • Average scores are useful but for development purposes score distributions may be more useful • Most of the questions have a high similarity • A few have very low similarity – these are the questions we need to look at in more detail • For example, the question with the lowest similarity was: • Target: Hindenburg disaster • Original: What type of craft was the Hindenburg? • Gold Standard: What type of craft was the Hindenburg? • Processed: What type of craft was the Hindenburg disaster? IR4QA Workshop
Outline of Talk • Motivations • Creating a Gold Standard • Developing a Comparison Metric • System Evaluation • Conclusions and Future Work IR4QA Workshop
Conclusions and Future Work • We have produced a gold standard set of question reformulations which can be used to help develop techniques for removing ambiguities from questions. • We have produced a set of guidelines to ensure that further gold standard reformulations are produced in a consistent fashion. • We provide a similarity metric to allow for rapid evaluation of processed questions against the gold standard and have shown the usefulness of this by finding questions our current system performs badly on. • Future work will involve the creation of a large gold standard and a more robust system for question processing. IR4QA Workshop
Questions? IR4QA Workshop
Motivations • One consequence of question series is that individual questions may no longer make sense if taken out of context: • When was he born? • When did it sink? • How tall is it? • Can just append target to every question but • Does not handle cases where anaphor is to something other than target • Loses information for approaches which may want to parse grammatical questions and exploit, e.g., syntactic relations between question elements 24th August 2008 IR4QA Workshop