Improving Intrinsic Evaluation for Question Generation in Natural Language Processing

Task 1: Intrinsic Evaluation VasileRus, Wei Chen, Pascal Kuyten, Ron Artstein

Task definition • Only interested in info-seeking questions • Evaluation biased towards current technology • Asking for the “trigger” text is problematic: • Future QG systems may not employ a trigger • Trigger less important for deep/holistic questions • Need to define what counts as QG • Would mining for questions be acceptable? • Require generative component? (defined how?) • Internal representation? Structure?

Evaluation criteria • Evaluate question alone, or question+answer? • System provides question • Evaluator decides if answer is available • Separately, evaluate system answer if given • Answer = contiguous text? • Can this be relaxed? • Additional criteria: conciseness?

Annotation guidelines • Question type: need more detailed definition • Yao et al (submitted): • What category includes (what|which) (NP|PP) • Question type identified mechanically with ad-hoc rules

Terminology • For QG from sentences task: • “Ambiguity” is really specificity or concreteness • “Relevance” is really answerability

Rating disagreements • Many (most?) of the disagreements are between close ratings (e.g. 3 vs. 4) • Need a measure that considers magnitudes, such as Krippendorff’s α • Perhaps normalize ratings by rater? • Specific disagreement on in-situ questions • The codes are not what? • Needs to be addressed in the guidelines

New tasks • Replace QG from sentences with QG from metadata • Evaluates only the generation component • Finding things to ask remains a component of the QG from paragraphs task • Make all system results public for analysis • Required? Voluntary? • Use data to learn from others’ problems

Improving Intrinsic Evaluation for Question Generation in Natural Language Processing

Improving Intrinsic Evaluation for Question Generation in Natural Language Processing

Presentation Transcript

Evaluation Task 3

Main Task Evaluation

PRELIMINARY TASK EVALUATION

Task 1

Task 1

TASK 1

Main Task Evaluation

TASK 1

Task 1

Task 1

Evaluation of Main Task

Task 1

Task 1

Task 1

Task 1

Task 1

Preliminary Task Evaluation

Evaluation Task Force Goals

GEOSS Evaluation Task Analysis

Task 1:

Task 1

Evaluation Task 1