1 / 15

ADGEN: Advanced Generation for Question Answering

ADGEN: Advanced Generation for Question Answering. Kevin Knight and Daniel Marcu USC/Information Sciences Institute. Natural Language Generation for QA. Analysts create documents for other analysts; machines should also create documents for analysts. Goal is to produce new texts that:

Télécharger la présentation

ADGEN: Advanced Generation for Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute

  2. Natural Language Generation for QA • Analysts create documents for other analysts; machines should also create documents for analysts. • Goal is to produce new texts that: • contain useful answers and ancillary material • are brief • are coherent at the text level • are grammatical at the sentence level • These goals conflict, but we have no principled ways of reasoning about these trade-offs.

  3. ADGEN Research Focus • Of the myriad variations of a text that the machine might produce for an analyst, only a fraction are coherent. • What makes a text coherent? • New Approach: • We have millions of examples of coherent texts • We can validate ideas empirically, develop models • We can train models automatically

  4. Word-Level Language Models • Given an unordered bag of words, assign an order that yields a grammatical, sensible sentence. • For example, given: • “any aware company interest isn't it of said takeover the” • Produce: • “the company said it isn't aware of any takeover interest” • No algorithm for this “bag generation” task appears in linguistics texts, nor can one easily assemble an algorithm using published results as subroutines!

  5. Word-Level Language Models • Even if linguistic syntactic grammars were widely available, they would not distinguish between sensible sentences and nonsense ones, e.g.: • “the takeover said it isn't aware of any interest company” • However, statistical n-gram models (and other lexicalized models) perform surprisingly well by incorporating both syntactic and semantic constraints.

  6. Why care about bag generation? • It’s an acid test for any theory of language use. • We can automatically generate problem instances. • We can automatically evaluate proposed algorithms. • Good solutions are directly applicable to answer generation/aggregation problems • Good solutions are also directly applicable to word-ordering problems in statistical machine translation (SMT) and meaning-to-text generation.

  7. Text-Level Language Models • Given an unordered bag of answers/clauses/sentences/, assign an order that yields a coherent text. • Typical discourse study: “if we scramble sentences in an English document, the result is not coherent, so text has structure…” • Let’s do something about it!

  8. Sample Problem • 1. Terms weren't disclosed, but industry sources said the price was about $2.5 million. • 2. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern. • 3. Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp., a unit of London-based Beecham Group PLC. • 4. The sale includes the rights to Germaine Monteil in North and South America and in the Far East, as well as the worldwide rights to the Diane Von Furstenberg cosmetics and fragrance lines and U.S. distribution rights to Lancaster beauty products.

  9. Sample Problem • 1. Terms weren't disclosed, but industry sources said the price was about $2.5 million. • 2. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern. • 3. Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp., a unit of London-based Beecham Group PLC. • 4. The sale includes the rights to Germaine Monteil in North and South America and in the Far East, as well as the worldwide rights to the Diane Von Furstenberg cosmetics and fragrance lines and U.S. distribution rights to Lancaster beauty products. Correct order: 3, 1, 4, 2

  10. Is this problem too hard? • People can do it. • News articles 2-10 sentences long: • 50%: re-ordering matches original • 40%: one sentence out of place • 10%: large mismatches, but judges preferred original • Debriefings are very useful for getting insight.

  11. Models have multiple applications • Word-level ordering Machine Translation Meaning-to-Text Generation Text-level ordering Essay Grading Multi-document Summarization ?

  12. Redundancy • Model of text coherence must deal with redundancy. • This text is not coherent: • Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp. • Terms weren't disclosed, but industry sources said the price was about $2.5 million. • The sale includes the rights to Germaine Monteil in North and South America. • Terms were not disclosed by either party. • Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern, and neither elected to disclose the terms of the acquisition.

  13. Contradiction • Model of text coherence must deal with contradiction. • This text is not coherent: • Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques. • Terms weren't disclosed, but industry sources said the price was about $2.5 million. • Revlon said it paid $2.2 million for Germaine Monteil.

  14. Methods • Modeling of data in a one-billion word corpus of English, as well as in topical multi-document collections. • generative stories of how text gets produced • probability values that combine naturally with each other • strong local constraints expressed as conditional probabilities • automatic training procedures • statistical perplexity as a measure of how well the model fits the data • Features • Word correlations, cue-phrase patterns, syntactic patterns, tense-specific patterns, semantic wordnet-based patterns, coreference patterns

  15. ADGEN in AQUAINT • 1. Answer generation • Input: collection of text fragments (including phrases and paragraphs) • Fuse phrases into sentences, order sentences to form millions of possible texts • Rank and select most coherent presentation • 2. Text improvement • Input: existing text • Apply probabilistic rewriting operations • Select rewrite that most improves coherence without sacrificing any of the basic material

More Related