210 likes | 322 Vues
This work outlines the creation of a dataset intended for use in both open and restricted question generation (QG) tasks. The motivation stems from utilizing sources like Wikipedia and community question answering (cQA) platforms such as Yahoo Answers, recognized for containing vast pools of question-answer pairs. We discuss data identification, automatic collection, and rigorous filtering processes aimed at ensuring high-quality question generation. The findings highlight the challenges associated with collecting data from various QG sources and suggest future work on enhancing data validation and representation of answers.
E N D
Resources for An Open Task on Question Generation Vasile Rus Eric Woolley, Mihai Lintean, and Arthur C. Graesser vrus@memphis.edu
Goal Build a data set that could be used in an open QG task as well as in a more restrictive task.
Outline • Open Tasks in Question Generation • Motivation • Data Collection • Identifying or creating a data source • Automatic Collection • Automatic Filtering • High-Quality Manual Filtering • Conclusions and Future Work
Open Tasks in QG • Special case of Text-to-Question tasks • Text-to-Question tasks • Input: raw or annotated text • Output: questions for which the text contains, implies, or needs answers; everyreasonableor good question should be generated
Open Tasks in QG • Open task: any, not necessarily every, good question is generated • the generation of questions is not restricted in any way, it is open • Open tasks may draw many participants
Motivation • Wikipedia, Learning Environments, community Question Answering (cQA) are preferred domains for Question Generation • Based on a survey of participants in the 1st Question Generation workshop • 16 valid responses out of 18 submitted • Respondents were asked to rank the choices from 1 to 5 • 1 - most preferred
Which of the following would you prefer as the primary content domain for a Question Generation shared task?
QG from Wikipedia • Ideally: Open task on texts from Wikipedia • Reality check: it would be costly and difficult to run experiments to generate benchmark questions from Wikipedia • Pooling could be a way to avoid running such experiments
QG from cQA • An approximation of QG from Wikipedia • Wikipedia and cQA are general/open knowledge repositories • cQA repositories have two advantages • Contain questions • Only one question per answer • Are big, i.e. there is a large pool of questions to select from (millions of question-answer pairs and growing)
Open Task Data Collection • Identification or creation of data source • Automatic Collection of Question-Text pairs • Automatic Filtering • High-Quality Manual Filtering
Data Identification or Creation • Source: Yahoo!Answers • Yahoo!Answers service • users can freely ask questions which are then answered by other users • The answers are rated over time by various users and the expectation is that the best answer will be ranked highest after a while
Automatic Collection • Six types of questions • What, where, when, who (shallow) • How, why (deep) • Questions in Yahoo!Answers are categorized based on their topic • Allergies, Dogs, Garden & Landscape, Software, and Zoology • a number of 244 Yahoo categories were queried on all six types of questions
Automatic Collection • A maximum of 150 questions was downloaded per each category and question type, resulting in a total of maximum 150*244*6 = 219.600 number of candidate questions to be collected
Automatic Filtering • Criteria • Length: number of words in a given question should be 3 or more • What is X? • Bad content: e.g. sexual explicit words • 55% reduction in the dataset
Manual Filtering • Goal: high-quality dataset • A tool was developed to help 3 human raters with the selection of good questions • It takes about 10 hours to select 100 good questions
Manual Filtering • Only 10% of the rated questions are retained by humans • Retaining rate can be as low as 2% for some categories in Y!A, e.g., Camcorders, and question types, e.g., when • When I transfer footage from my video camera to my computer why can't I get sound? • 500 question-answer pairs
Manual Filtering • The question is a compound question • How long has the computer mouse been around, and who is credited with its invention? • The question is not in interrogative form • I want a webcam and headset etc to chat to my friends who moved away? • Poor grammar or spelling • Yo peeps who kno about comps take a look?
Manual Filtering • The question does not solicit a reasonable answer for our purposes • Who knows something about digital cameras? • The question is ill-posed • When did the ancient city of Mesopotamia flourish? • the answer is Mesopotamia wasn't a city.
Conclusions and Future Work • Collect more valid instances • Better validation of collected questions • Annotating answers with deep representations
Conclusions and Future Work • collecting data from community Question Answering sources is more challenging than it seems • collecting Q-A pairs from sources of general interest such as Wikipedia through target experiments with human subjects may be a comparable rather than a much costlier effort
? ? ? ? ? ? ?