Corpora and Statistical Methods Lecture 12

Corpora and Statistical MethodsLecture 12 Albert Gatt

Part 2 Automatic summarisation

The task • Given a single document or collection of documents, return an abridged version that distils the most important information (possibly for a particular task/user) • Summarisation systems perform: • Content selection: choosing the relevant information in the source document(s), typically in the form of sentences/clauses. • Information ordering • Sentence realisation: cleaning up the sentences to make them fluent. • Note the similarity to NLG architectures. • Main difference: summarisation input is text, whereas NLG input is non-linguistic data.

Types of summaries • Extractive vs. Abstractive • Extractive: select informative sentences/clauses in the source document and reproduce them • most current systems (and our focus today) • Abstractive: summarise the subject matter (usually using new sentences) • much harder, as it involves deeper analysis & generation • Dimensions • Single-document vs. multi-document • Context • Query-specific vs. query-independent

Extracts vs Abstracts: Lincoln’s Gettsyburg Address Extract Abstract Source: Jurafsky & Martin (2009), p. 823

A Summarization Machine MULTIDOCS DOC QUERY 50% Very Brief Brief Headline 10% 100% Long ABSTRACTS Extract Abstract ? Indicative Informative CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS Generic Query-oriented EXTRACTS Just the news Background Adapted from: Hovy & Marcu (1998). Automated text summarization. COLING-ACL Tutorial. http://www.isi.edu/~marcu/

The Modules of the Summarization Machine MULTIDOC EXTRACTS E X T R A C T I O N I N T E R P R E T A T I O N G E N E R A T I O N F I L T E R I N G ABSTRACTS DOC EXTRACTS CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS ? EXTRACTS 7

Unsupervised single-document summarisation I “bag of words” approaches

Basic architecture for single-doc Less critical. Since we have only one document, we can rely on the order in which sentences occur in the source itself. The central task in single-document summarisation. Can be supervised or unsupervised.

Unsupervised content selection I: Topic Signatures • Simplest unsupervised algorithm: • Split document into sentences. • Select those sentences which contain the most salient/informative words. • Salient term = a term in the topic signature (words that are crucial to identifying the topic of the document) • Topic signature detection: • Represent sentences (documents) as word vectors • Compute the weight of each word • Weight sentences by the average weight of their (non-stop) words.

Vector space revisited Document collection Key terms * document Doc 1: To make fried chicken, take the chicken, chop it up and put it in a pan until golden. Remove the fried chicken pieces and serve hot. Doc 2: To make roast chicken, take the chicken and put in the oven until golden. Remove the chicken and serve hot. Columns = documents Rows = term frequencies NB: Stop list to remove v. high frequency words!

Term weighting: tf-idf • Common term weighting scheme used in the information retrieval literature. • tf (term frequency) = freq. of term in document • idf (inverse document frequency) = log(N/ni) • N = no. of documents • ni = no. of docs in which term i occurs Method: Count frequency of term in the doc being considered. Count inverse doc frequency over whole document collection Compute tf-idf score

Term weighting: log likelihood ratio • Requirements: • A background corpus • In our case, for a term w, LLR is the ratio between: • Prob. of observing w in the input corpus • Prob. of observing w in the background corpus • Since LLR is asymptotically chi-square distributed, if the LLR value is significant, we treat the term as a key term. • Chi-square values tend to be significant at p = .001 if they are greater than 10.8

Sentence centrality • Instead of weighting sentences by averaging individual term weights, we can compute pairwise distance between sentences and choose those sentences which are closer to eachother on average. • Example:represent sentences as tf-idf vectors and compute cosine for each sentence x in relation to all other sentences y • where K = total no. of sentences

Unsupervised single-document summarisation II Using rhetorical structure

Rhetorical Structure Theory • RST (Mann and Thompson 1988) is a theory of text structure • Not about what texts are about but • How bits of the underlying content of a text are structured so as to hang together in a coherent way. • The main claim of RST: • Parts of a text are related to eachother in predetermined ways. • There is a finite set of such relations. • Relations hold between two spans of text • Nucleus • Satellite

A small example You should visit the new exhibition. It’s excellent. It got very good reviews. It’s completely free. ENABLEMENT MOTIVATION EVIDENCE It’s excellent... It got ... It’s completely ... You should ...

An RST relation definition MOTIVATION • Nucleus represents an action which the hearer is meant to do at some point in future. • You should go to the exhibition • Satellite represents something which is meant to make the hearer want to carry out the nucleus action. • It’s excellent. It got a good review. • Note: Satellite need not be a single clause. In our example, the satellite has 2 clauses. They themselves are related to eachother by the EVIDENCE relation. • Effect: to increase the hearer’s desire to perform the nucleus action.

RST relations more generally • An RST relation is defined in terms of the • Nucleus + constraints on the nucleus • (e.g. Nucleus of motivation is some action to be performed by H) • Satellite + constraints on satellite • Desired effect. • Other examples of RST relations: • CAUSE: the nucleus is the result; the satellite is the cause • ELABORATION: the satellite gives more information about the nucleus • Some relations are multi-nuclear • Do not relate a nucleus and satellite, but two or more nuclei (i.e. 2 pieces of information of the same status). • Example: SEQUENCE • John walked into the room. He turned on the light.

Some more on RST • RST relations are neutral with respect to their realisation. • E.g. You can express EVIDENCE in lots of different ways. It’s excellent. It got very good reviews. You can see that it’s excellent from its great reviews. It’s excellence is evidenced by the good reviews it got. It must be excellent since it got good reviews. EVIDENCE It’s excellent... It got ...

RST for unsupervised content selection • Compute coherence relations between units (= clauses) • Can use a discourse parser and/or rely on cue phrases • Corpora annotated with RST relations exist • Use the intuition that the nucleus of a relation is more central to the content than the satellite to identify the set of salient units Sal: • Base case: If n consists of a leaf node, then Sal(n) ={n} • Recursive case: if n is non-leaf, then • Rank nodes in Sal(n): the higher the node of which n is a nucleus, the more salient it is

Rhetorical structure: example Ranking of a nodes: 2 > 8 > 3 ...

Supervised content selection

Basic idea • Input: a training set consisting of: • Document + human-produced (extractive) summaries • So sentences in each doc can be marked with a binary feature (1 = included in summary; 0 = not included) • Train a machine learner to classify sentences as 1 (extract-worthy) or 0, based on features.

Features • Position: important sentences tend to occur early in a document (but this is genre dependent). E.g. news articles: most important sentence is the title. • Cue phrases: sentences with phrases like to summarise give important summary info. (Again, genre dependent: different genres have different cue phrases). • Word informativeness: words in the sentence which belong to the doc’s topic signature • Sentence length: we usually want to avoid very short sentences • Cohesion: we can use lexical chains to compute how many words are in a sentence which are also in the document lexical chain • Lexical chain: a series of words that are indicative of the document’s topic

Algorithms • Once we have the feature set F, we want to compute: • Many methods we’ve discussed will do! • Naive Bayes • Maximum Entropy • ...

Which corpus? • There are some corpora with extractive summaries, but often we come up against the problem of not having the right data. • Many types of text in themselves contain summaries, e.g. scientific articles have abstracts • But these are not purely extractive! • (though people tend to include sentences in abstracts that are very similar to the sentences in their text). • Possible method: align sentences in an abstract with sentences in the document, by computing their overlap (e.g. using n-grams)

Realisation Sentence simplification

Realisation • With single-doc summarisation, realisation isn’t a big problem (we’re reproducing sentences from summaries). • But we may want to simplify (or compress) the sentences. • Simplest method is to use heuristics, e.g.: • Appositives: Rajam, 28, an artist who lives in Philadelphia, found inspiration in the back of city magazines. • Sentential adverbs: As a matter of fact, this policy will be ruinous. • A lot of current research on simplification/compression, often using parsers to identify dependencies that can be omitted with little loss of information. • Realisation is much more of an issue in multi-document summarisation.

Multi-document summarisation

Why multi-document • Very useful when: • queries return multiple documents from the web • Several articles talk about the same topic (e.g. a disease) • ... • The steps are the same as for single-doc summarisation, but: • We’re selecting content from more than one source • We can’t rely on the source documents only for ordering • Realisation is required to ensure coherence.

Content selection • Since we have multiple docs, we have a problem with redundancy: repeated info in several documents; overlapping words, sentences, phrases... • We can modify sentence scoring methods to penalise redundancy, by comparing a candidate sentence to sentences already selected. • Methods: • Modify sentence score to penalise redundancy: (sentence is compared to sentences already chosen in the summary) • Use clustering to group related sentences, and then perform selection on clusters. • More on clustering next week.

Information ordering • If sentences are selected from multiple documents, we risk creating an incoherent document. • Rhetorical structure: • *Therefore, I slept. I was tired. • I was tired. Therefore, I slept. • Lexical cohesion: • *We had chicken for dinner. Paul was late. It was roasted. • We had chicken for dinner. I was roasted. Paul was late. • Referring expressions: • *He said that ... . George W. Bush was speaking at a meeting. • George W. Bush said that ... . He was speaking at a meeting. • These heuristics can be combined. • We can also do information ordering during the content selection process itself.

Information ordering based on reference • Referring expressions (NPs that identify objects) include pronouns, names, definite NPs... • Centering Theory (Grosz et al 1995): every discourse segment has a focus (what the segment is “about”). • Entities are salient in discourse depending on their position in the sentence: SUBJECT >> OBJECT >> OTHER • A coherent discourse is one which, as far as possible, maintains smooth transitions between sentences.

Information ordering based on lexical cohesion • Sentences which are “about” the same things tend to occur together in a document. • Possible method: • use tf-idf cosine to compute pairwise similarity between selected sentences • attempt to order sentences to maximise the similarity between adjacent pairs.

Realisation • Compare: Source: Jurafsky & Martin (2009), p. 835

Uses of realisation • Since sentences come from different documents, we may end up with infelicitous NP orderings (e.g. pronoun before definite). One possible solution: • run a coreference resolver on the extracted summary • Identify reference chains (NPs referring to the same entity) • Replace or reorder NPs if they violate coherence. • E.g. use full name before pronoun • Another interesting problem is sentence aggregation or fusion, where different phrases (from different sources) are combined into a single phrase.

Evaluating summarisation

Evaluation baselines • Random sentences: • If we’re producing summaries of length N, we use as baseline a random extractor that pulls out N sentences. • Not too difficult to beat. • Leading sentences: • Choose the first N sentences. • Much more difficult to beat! • A lot of informative sentences are at the beginning of documents.

Some terminology (reminder) • Intrinsic evaluation: evaluation of output in its own right, independent of a task (e.g. Compare output to human output). • Extrinsic evaluation: evaluation of output in a particular task (e.g. Humans answer questions after reading a summary) • We’ve seen the uses of BLEU (intrinsic) for realisation in NLG. • A similar metric in Summarisation is ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

BLEU vs ROUGE BLEU ROUGE Precision-oriented Looks at n-gram overlap for different values of n up to some maximum. Measures the average n-gram overlap between an output text and a set of reference texts. Recall-oriented N-gram is fixed: ROUGE-1, ROUGE-2 etc (for different n-gram lengths) Measures how many n-grams an output summary contains from the source summary.

ROUGE • Generalises easily to any n-gram length. • Other versions: • ROUGE-L: measures longest common subsequence between reference summary and output • ROUGE-SU: uses skip bigrams

Intrinsic vs. Extrinsic again • Problem: ROUGE assumes that reference summaries are “gold standards”, but people often disagree about summaries, including wording. • Same questions arise as for NLG (and MT): • To what extent does this metric actually tell us about the effectiveness of a summary? • Some recent work has shown that the correlation between ROUGE and a measure of relevance given by humans is quite low. • See: Dorr et al. (2005). A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate? Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1–8, Ann Arbor, June 2005

The Pyramid method (Nenkova et al) • Also intrinsic, but relies on semantic content units instead of n-grams. • Human annotators label SCUs in sentences from human summaries. • Based on identifying the content of different sentences, and grouping together sentences in different summaries that talk about the same thing. • Goes beyond surface wording! • Find SCUs in the automatic summaries. • Weight SCUs • Compute the ratio of the sum of weights of SCUs in the automatic summary to the weight of an optimal summary of roughly the same length.

Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods Lecture 12

Presentation Transcript

Corpora and Statistical Methods – Lecture 3

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 13

Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 8

Corpora and Statistical Methods – Lecture 7

Corpora and Statistical Methods

Corpora and Statistical Methods – Lecture 8

Corpora and Statistical Methods Lecture 7

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 5

Corpora and Statistical Methods Lecture 9

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 10

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 9

Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 5

Corpora and statistical methods

Corpora and Statistical Methods