1 / 18

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. By Chin-Yew Lin and Eduard Hovy. The Document Understanding Conference. In 2002 there were two main tasks Summarization of single-documents Summarization of Multiple-documents. DUC Single Document Summarization.

wyatt
Télécharger la présentation

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics By Chin-Yew Lin and Eduard Hovy

  2. The Document Understanding Conference • In 2002 there were two main tasks • Summarization of single-documents • Summarization of Multiple-documents

  3. DUC Single Document Summarization • Summarization of single-documents • Generate a 100 word summary • Training 30 sets of 10 docs each with 100 word summaries • Test against 30 unseen documents

  4. DUC Multi-Document Summarization • Summarization of multiple documents about a single subject • Generate 50,100,200,400 word summaries • Four types: single natural disaster, single event, multiple instance of a type of event, info about an individual • Training: 30 sets of 10 documents with their 50,100,200,400 word summaries • Test : 30 unseen documents

  5. DUC Evaluation Material • For each document set, one human summary was created to be the ‘Ideal’ summary for each length. • Two additional human summaries were created at each length • Base line summaries were create automatically for each length as reference points • Lead base line took first n-words of last document for multi-doc task • Coverage baseline used first sentence of each doc until it reached its length

  6. SEE- Summary Evaluation Environment • A tool to allow assessors to compare system text (peer) with Ideal text (model). • Can rank quality and content. • Assessor marks all system units sharing content with model as {all,most,some, hardly any} • Assessor rate quality of grammaticality, cohesion, and coherence {all, most, some, hardly any ,none}

  7. SEE interface

  8. Making a Judgement • From Chin-Yew-Lin / MT summit IX 2003-09-27

  9. Evaluation Metrics • One idea is simple sentence recall, but it cannot differentiate system performance (it pays to be over productive) • Recall is measured relative to the model text • E is average of coverage scores

  10. Machine Translation Inputs Reference translation Candidate translation Methods Manually compare two translations in: Accuracy Fluency Informativeness Auto evaluation using: Blue/NIST scores Auto Summarization Inputs Reference summary Candidate summary Methods Manually compare two summaries in: Content Overlap Linguistic Qualities Auto Evaluation ? ? Machine Translation and Summarization Evaluation

  11. NIST BLEU • Goal: Measure the translation closeness between a candidate translation and set of reference translations with a numeric metric • Method: use a weighted average of variable length n-gram matches between system translation and the set of human reference translations • BLEU correlates highly with human assessments • Would like to make the same assumptions: the closer a summary is to a professional summary the better it is

  12. BLEU • Is a promising automatic scoring metric for summary evaluation • Basically a precision metric • Measures how well a source overlaps a model using n-gram co-occurrence statistics • Uses a Brevity Penalty (BP) to prevent short translation that try to maximize their precision score • In formulas c = candidate length, r= reference length

  13. Anatomy of BLEU Matching Score • From Chin-Yew-Lin / MT summit IX 2003-09-27

  14. ROUGE: Recall-Oriented Understudy for Gisting Evaluation • From Chin-Yew-Lin / MT summit IX 2003-09-27

  15. What makes a good metric? • Automatic Evaluation should correlate highly, positively, and consistently with human assessments • If a human recognizes a good system, so will the metric • The statistical significance of automatic evaluations should be a good predictor of the statistical significance of human assessments with high reliability • The system can be used to assist in system development in place of humans

  16. ROUGE vs BLUE • ROUGE – Recall based • Separately evaluate 1,2,3, and 4 –grams • No length penalty • Verified for extraction summaries • Focus on content overlap • BLUE-Precision based • Mixed n-grams • Use Brevity penalty to penalize system translations that are shorter than the average reference length • Favors longer n-grmas for grammaticality or word order

  17. By all measures

  18. Findings • Ngram(1,4) is a weighted variable length n-gram match score similar to BLEU • Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2) consistently outperformed Ngram(1,4) in single and multiple document tasks when stopwords are ignored • Weighted average n-grams are between bi-gram and tri-gram scores suggesting summaries are over-penalized by the weighted average due to lack of longer n-gram matches • Excluding stopword in computing n-gram statistics generally achieves better correlation than including them • Ngram(1,1) and Ngram(2,2) are good automatic scoring metrics based on statistical predictive power.

More Related