Improving Intrinsic Evaluation for Question Generation in Natural Language Processing
This study explores the challenges and considerations in evaluating intrinsic methods for Question Generation (QG) focused on info-seeking questions, particularly within the context of current technological biases. It discusses the significance of triggers, the potential for deep questions without the need for them, and establishes criteria for evaluation. Key areas include the nature of the questions, the acceptability of mining for questions, and how to evaluate system-provided answers. The paper calls for better-defined question types and better annotation guidelines as well as the examination of rating disagreements.
Improving Intrinsic Evaluation for Question Generation in Natural Language Processing
E N D
Presentation Transcript
Task 1: Intrinsic Evaluation VasileRus, Wei Chen, Pascal Kuyten, Ron Artstein
Task definition • Only interested in info-seeking questions • Evaluation biased towards current technology • Asking for the “trigger” text is problematic: • Future QG systems may not employ a trigger • Trigger less important for deep/holistic questions • Need to define what counts as QG • Would mining for questions be acceptable? • Require generative component? (defined how?) • Internal representation? Structure?
Evaluation criteria • Evaluate question alone, or question+answer? • System provides question • Evaluator decides if answer is available • Separately, evaluate system answer if given • Answer = contiguous text? • Can this be relaxed? • Additional criteria: conciseness?
Annotation guidelines • Question type: need more detailed definition • Yao et al (submitted): • What category includes (what|which) (NP|PP) • Question type identified mechanically with ad-hoc rules
Terminology • For QG from sentences task: • “Ambiguity” is really specificity or concreteness • “Relevance” is really answerability
Rating disagreements • Many (most?) of the disagreements are between close ratings (e.g. 3 vs. 4) • Need a measure that considers magnitudes, such as Krippendorff’s α • Perhaps normalize ratings by rater? • Specific disagreement on in-situ questions • The codes are not what? • Needs to be addressed in the guidelines
New tasks • Replace QG from sentences with QG from metadata • Evaluates only the generation component • Finding things to ask remains a component of the QG from paragraphs task • Make all system results public for analysis • Required? Voluntary? • Use data to learn from others’ problems