Text Summarization

Text Summarization http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/17/2013

Overview

What is summarization?

Columbia Newsblaster • The academic version

What is the input? • News, or clusters of news • a single article or several articles on a related topic • Email and email thread • Scientific articles • Health information: patients and doctors • Meeting summarization • Video

What is the output • Keywords • Highlight information in the input • Chunks or speech directly from the input or paraphrase and aggregate the input in novel ways • Modality: text, speech, video, graphics

Ideal stages of summarization • Analysis • Input representation and understanding • Transformation • Selecting important content • Realization • Generating novel text corresponding to the gist of the input

Most current systems • Use shallow analysis methods • Rather than full understanding • Work by sentence selection • Identify important sentences and piece them together to form a summary

Types of summaries • Extracts • Sentences from the original document are displayed together to form a summary • Abstracts • Materials is transformed: paraphrased, restructured, shortened

Extractive summarization • Each sentence is assigned a score that reflects how important and contenful they are • Data-driven approaches • Word statistics • Cue phrases • Section headers • Sentence position • Knowledge-based systems • Discourse information • Resolve anaphora, text structure • Use external lexical resources • Wordnet, adjective polarity lists, opinion • Using machine learning

What are summaries useful for? • Relevance judgments • Does this document contain information I am interested in? • Is this document worth reading? • Save time • Reduce the need to consult the full document

Recent development • 2013.3, Yahoo bought news reading app Summly for $30 million! • 2013.4, Google purchased Wavii for more than $30 million!

Multi-document summarization • Very useful for presenting and organizing search results • Many results are very similar, and grouping closely related documents helps cover more event facets • Summarizing similarities and differences between documents

How to deal with redundancy? Author JK Rowling has won her legal battle in a New York court to get an unofficial Harry Potter encyclopaedia banned from publication. A U.S. federal judge in Manhattan has sided with author J.K. Rowling and ruled against the publication of a Harry Potter encyclopedia created by a fan of the book series. • Shallow techniques not likely to work well

Global optimization for content selection • What is the best summary? vs What is the best sentence? • Form all summaries and choose the best • What is the problem with this approach?

Information ordering • In what order to present the selected sentences? • An article with permuted sentences will not be easy to understand • Very important for multi-document summarization • Sentences coming from different documents

Automatic summary edits • Some expressions might not be appropriate in the new context • References: • he • Putin • Russian Prime Minister Vladimir Putin • Discourse connectives • However, moreover, subsequently • Requires more sophisticated NLP techniques

Before Pinochet was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

After Gen. Augusto Pinochet, the former Chilean dictator, was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

Before Turkey has been trying to form a new government since a coalition government led by Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

After Turkey has been trying to form a new government since a coalition government led by Prime Minister Mesut Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Premier-designate Bulent Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. President Suleyman Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

Traditional Approaches

1) word frequency based method • Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958 0000011 Image: Courtesy IBM

Luhn’s method: basic idea • Target documents: technical literature • The method is based on the following assumptions: • Frequency of word occurrence in an article is a useful measurement of word significance • Relative position of these significant words within a sentence is also a useful measurement of word significance • Based on limited capabilities of machines (IBM 704)  no semantic information 0000100

Why word frequency? • Important words are repeated throughout the text • examples are given in favor of a certain principle • arguments are given for a certain principle • Technical literature  one word: one notion • Simple and straightforward algorithm  cheap to implement (processing time is costly) • Note that different forms of the same word are counted as the same word 0000101

When significant? • Too low frequent words are not significant • Too high frequent words are also not significant (e.g. “the”, “and”) • Removing low frequent words is easy • set a minimum frequency-threshold • Removing common (high frequent) words: • Setting a maximum frequency threshold (statistically obtained) • Comparing to a common-word list 0000110 Figure 1 from [Luhn, 1958]

Using relative position • Where greatest number of high-frequent words are found closest together  probability very high that representative information is given • Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters) 0000111

The significance factor • The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between • Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “ • Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example) 0001000

Generating the abstract • For every sentence the significance factor is calculated • The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned) • For large texts, it can also be applied to subdivisions of the text • No evaluation of the results present in the journal paper! 0001001

2) Position based method • H.P. Edmundson: New methods in Automatic Extracting - 1969 IBM 7090 - Courtesy IBM 0001010

Lead method • Claim: Important sentences occur at the beginning (and/or end) of texts. • Lead method: just take first sentence(s)! • Experiments: • In 85% of 200 individual paragraphs the topic sentences occurred in initial position and in 7% in final position (Baxendale, 58). • Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).

Cue-Phrase method • Claim 1: Important sentences contain ‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible. • Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97). • Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.

Four methods for weighting • Weighting methods: • Cue Method • Key Method • Title Method • Location Method • The weight of a sentence is a linear combination of the weights obtained with the above four methods • The highest weighing sentences are included in the abstract • Target documents: technical literature 0001011

Cue Method • Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”) • Three types of Cue words: • Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”, “Greatest”) • Stigma words: negatively affecting the relevance of a sentence (e.g. “Impossible”, “Hardly”) • Null words: irrelevant 0001100

Obtaining Cue words • The lists were obtained by statistical analyses of 100 documents: • Dispersion (λ): number of documents in which the word occurred • Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all sentences • Bonus words: η > thighη • Stigma words: η < tlowη • Null words: λ > tλ and tlowη< η < thighη 0001101

Resulting Cue lists • Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc. • Stigma list (73): anaphoric expressions, belittling expressions, etc. • Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc. 0001110

Cue weight of sentence • Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0 • Cue weight of sentence: Σ (Cue weight of each word in sentence) 0001111

Key Method • Principle based on [Luhn], counting the frequency of words. • Algorithm differs: • Create key glossary of all non-Cue words in the document which have a frequency larger than a certain threshold • Weight of each key word in the key glossary is set to the frequency it occurs in the document • Assign key weight to each word which can be found in the key glossary • If word is not in key glossary, key weight: 0 • No relative position is used ([Luhn]) • Key weight of sentence: Σ (Key weight of each word in sentence) 0010000

Title Method • Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs) • Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document • Words are given a positive title weight if they appear in this glossary • Title words are given a larger weight than heading words • Title weight of sentence: Σ (Title weight of each word in sentence) 0010001

Location Method • Based on the hypothesis that: • Sentences occurring under certain headings are positively relevant • Topic sentences tend to occur very early or very late in a document and its paragraphs • Global idea: • Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight • Give each sentence a certain weight based on its position - Ordinal weight • Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence 0010010

Location Method: Heading weight • Compare each word in a heading with the pre-stored Heading dictionary • If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary • Heading weight of a heading: Σ (heading weight of each word in heading) • Heading weight of a sentence = Heading weight of its heading 0010011

Creating the Heading dictionary • The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word: • Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings • Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.) • Weights were given to the words in the Heading dictionary proportional to the selection ratio • The resulting Heading dictionary contained 90 words 0010100

Location Method: Ordinal weight • Sentences of the first paragraph are tagged with weight O1 • Sentences of the last paragraph are tagged with weight O2 • The first sentence of a paragraph is tagged with weight O3 • The last sentence of a paragraph is tagged with weight O4 • Ordinal weight of sentence: O1 + O2 +O3 +O4 0010101

Generating the abstract • Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight • The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract • Return the highest N sentences under their proper headings as the abstract (including title) • N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used 0010110

Which combination is best? • All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract • As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract • Surprising result! (Luhn used only keywords to create the abstract) Figure 4 from [Edmundson, 1969] 0010111

Evaluation • Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts • Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge) • Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts • Another evaluation criterion: ‘extract-worthiness’ • Result: 84% of the sentences selected is extract-worthy • Therefore: for one document many possible abstracts (differing in length and content) 0011000

3) Machine-learning method • Ask people to select sentences • Use these as training examples for machine learning • Each sentence is represented as a number of features • Based on the features distinguish sentences that are appropriate for a summary and sentences that are not • Run on new inputs

Scoring sentences • For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule): • Assuming statistical independence of the features: • is constant, and and can be estimated directly from the training set by counting occurrences • This function assigns for each s a score which can be used to select sentences for inclusion in the abstract 0100100

Text Summarization