1 / 63

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS. Massimo Poesio Lecture 10: Summarization. What is summarization?. To take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s application needs.

minor
Télécharger la présentation

807 - TEXT ANALYTICS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 807 - TEXT ANALYTICS Massimo PoesioLecture 10: Summarization

  2. What is summarization? To take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s application needs

  3. Single-document summarization Flu stopper A new compound is set for human testing(Times) Running nose. Raging fever. Aching joints. Splitting headache. Are there any poor souls suffering from the flu this winter who haven’t longed for a pill to make it all go away? Relief may be in sight. Researchers at Gilead Sciences, a pharmaceutical company in Foster City, California, reported last week in the Journal of the American Chemical Society that they have discovered a compound that can stop the influenza virus from spreading in animals. Tests on humans are set for later this year. The new compound takes a novel approach to the familiar flu virus. It targets an enzyme, called neuraminidase, that the virus needs in order to scatter copies of itself throughout the body. This enzyme acts like a pair of molecular scissors that slices through the protective mucous linings of the nose and throat. After the virus infects the cells of the respiratory system and begins replicating, neuraminidase cuts the newly formed copies free to invade other cells. By blocking this enzyme, the new compound, dubbed GS 4104, prevents the infection from spreading. 3

  4. Single-document summarization Flu stopper A new compound is set for human testing(Times) Running nose. Raging fever. Aching joints. Splitting headache. Are there any poor souls suffering from the flu this winter who haven’t longed for a pill to make it all go away? Relief may be in sight. Researchers at Gilead Sciences, a pharmaceutical company in Foster City, California, reported last week in the Journal of the American Chemical Society that they have discovered a compound that can stop the influenza virus from spreading in animals. Tests on humans are set for later this year. The new compound takes a novel approach to the familiar flu virus. It targets an enzyme, called neuraminidase, that the virus needs in order to scatter copies of itself throughout the body. This enzyme acts like a pair of molecular scissors that slices through the protective mucous linings of the nose and throat. After the virus infects the cells of the respiratory system and begins replicating, neuraminidase cuts the newly formed copies free to invade other cells. By blocking this enzyme, the new compound, dubbed GS 4104, prevents the infection from spreading. 4

  5. Application: Headline news 5

  6. Application: TV-GUIDES 6

  7. Application: Abstracts of papers 7

  8. Multi-document summarization MULTI-DOCUMENT summarization (doing this from a large number of news items) a particularly popular application

  9. Human summarization and abstracting • What professional abstractors do • Ashworth: • “To take an original article, understand it and pack it neatly into a nutshell without loss of substance or clarity presents a challenge which many have felt worth taking up for the joys of achievement alone. These are the characteristics of an art form”.

  10. Original version:There were significant positive associations between the concentrations of the substance administered and mortality in rats and mice of both sexes.There was no convincing evidence to indicate that endrin ingestion induced and of the different types of tumors which were found in the treated animals. Edited version:Mortality in rats and mice of both sexes was dose related.No treatment-related tumors were found in any of the animals. Cremmins 82, 96

  11. Computational Approach: Basics • Bottom-Up: • I’m dead curious: what’s in the text? • User needs:anything that’s important • System needs: genericimportance metrics, used to rate content Top-Down: • I know what I want! — don’t confuse me with drivel! • User needs: only certain types of info • System needs: particular criteria of interest, used to focus search 13

  12. Query-Driven vs. Text-DRIVEN Focus • Top-down: Query-driven focus • Criteria of interest encoded as search specs. • System uses specs to filter or analyze text portions. • Examples: templates with slots with semantic characteristics; termlists of important terms. • Bottom-up: Text-driven focus • Generic importance metrics encoded as strategies. • System applies strategies over rep of whole text. • Examples: degree of connectedness in semantic graphs; frequency of occurrence of tokens. 14

  13. Types of summaries • Extracts • Sentences from the original document are displayed together to form a summary • Abstracts • Materials is transformed: paraphrased, restructured, shortened

  14. Ideal stages of summarization • Analysis • Input representation and understanding • Transformation • Selecting important content • Realization • Generating novel text corresponding to the gist of the input

  15. What current systems do • Most work bottom-up • Typically use shallow analysis methods • Rather than full understanding • Work by sentence extraction • Identify important sentences and piece them together to form a summary • More advanced work: move towards more abstractive summarization

  16. Shallow approaches • Relying on features of the input documents that can be easily computes from statistical analysis • Word statistics • Cue phrases • Section headers • Sentence position

  17. What is the input? • News, or clusters of news • a single article or several articles on a related topic • Email and email thread • Scientific articles • Health information: patients and doctors • Meeting summarization • Video

  18. What is the output • Keywords • Highlight information in the input • Chunks or speech directly from the input or paraphrase and aggregate the input in novel ways • Modality: text, speech, video, graphics

  19. Supervised methods • Ask people to select sentences • Use these as training examples for machine learning • Each sentence is represented as a number of features • Based on the features distinguish sentences that are appropriate for a summary and sentences that are not • Run on new inputs

  20. Cue method: stigma words (“hardly”, “impossible”) bonus words (“significant”) Key method: similar to Luhn Title method: title + headings Location method: sentences under headings sentences near beginning or end of document and/or paragraphs (also [Baxendale 58]) Edmundson 69

  21. Linear combination of four features:1C + 2K + 3T + 4L Manually labelled training corpus Key not important! Edmundson 69 •  1  C + T + L C + K + T + L LOCATION CUE TITLE KEY RANDOM 0 10 20 30 40 50 60 70 80 90 100 %

  22. Extracts of roughly 20% of original text Feature set: sentence length |S| > 5 fixed phrases 26 manually chosen paragraph sentence position in paragraph thematic words binary: whether sentence is included in manual extract uppercase words not common acronyms Corpus: 188 document + summary pairs from scientific journals Kupiec et al. 95

  23. Kupiec et al. 95 • Uses Bayesian classifier: • Assuming statistical independence:

  24. Kupiec et al. 95 • Performance: • For 25% summaries, 84% precision • For smaller summaries, 74% improvement over Lead

  25. A typical modern supervised summarization system • Or, what you could do if asked to do one …

  26. Features • Location • Absolute location of the sentence • Section structure: first sentence, last sentence, other • Paragraph structure • What section the sentence appeared in • Introduction, implementation, example, conclusion, result, evaluation, experiment etc

  27. More features • Sentence length • Very long and very short sentences are unusual • Title word overlap • Tf.idf word content • Binary feature • “yes” if the sentence contains one of the 18 most important words • “no” otherwise

  28. More features • Presence and type of citation • Formulaic expressions • “in traditional approaches”, “a novel method for”

  29. Problems with supervised methods for summarization • Annotation is expensive • Here---relevance and rhetorical status judgments • People don’t agree • So more annotators are necessary • And/or more training of the annotators

  30. Unsupervised methods for (extractive) summarization: basic idea • Compute word probability from input • Compute sentence weight as function of word probability • Pick best sentence

  31. Sentence ranking options • Based on word probability • S is sentence with length n • Pi is the probability of the i-th word in the sentence • Based on word tf.idf

  32. Centrality measures • How representative is a sentence of the overall content of a document • The more similar are sentence is to the document, the more representative it is

  33. Beyond word-based sentence extraction • Discourse information • Resolve anaphora, text structure • Use external lexical resources • Wordnet, adjective polarity lists, opinion • Using machine learning

  34. The role of discourse structure • Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’ of the textual units in this structure reflects their importance. • Tree-like representation of texts in the style of Rhetorical Structure Theory (Mann and Thompson,88). • Use the discourse representation in order to determine the most important textual units. Attempts: • (Ono et al., 94) for Japanese. • (Marcu, 97) for English. 42

  35. Rhetorical parsing (Marcu,97) [With its distant orbit {– 50 percent farther from the sun than Earth –} and slim atmospheric blanket,1] [Mars experiences frigid weather conditions.2] [Surface temperatures typically average about –60 degrees Celsius (–76 degrees Fahrenheit) at the equator and can dip to –123 degrees C near the poles.3] [Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,4] [but any liquid water formed that way would evaporate almost instantly5] [because of the low atmospheric pressure.6] [Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,7] [most Martian weather involves blowing dust or carbon dioxide.8] [Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.9] [Yet even on the summer pole, {where the sun remains in the sky all day long,} temperatures never warm enough to melt frozen water.10] 43

  36. Rhetorical parsing (3) 2 Elaboration 2 Elaboration 8 Example 10 Antithesis 8 Concession 2 Background Justification 3 Elaboration 7 8 10 1 2 3 4 5 Contrast 9 Summarization = selection of the most important units 2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6 4 5 Evidence Cause 5 6 44

  37. Argumentative zoning • What is the purpose of the sentence? To communicate • Background • Aim • Basis (related work) • How can we know which sentence serves each aim?

  38. Argumentative zones

  39. Selecting important sentences (relevance) • How well can it be performed by people? • Rather subjective; depends on prior knowledge and interests • Even the same person would select 50% different sentences if she performs the task at different times • Still, judgments can be solicited by several people to mitigate the problem • For each sentence in at article---say if it is important and interesting enough to be included in a summary

  40. Multi-document summarization • Very useful for presenting and organizing search results • Many results are very similar, and grouping closely related documents helps cover more event facets • Summarizing similarities and differences between documents

  41. Standard Approaches • Salient information = similarities • Pairwise similarity between all sentences • Cluster sentences using similarity score (Themes) • Generate one sentence for each theme • Sentence extraction (one sentence/cluster) • Sentence fusion: intersect sentences within a theme and choose the repeated phrases. Generate sentence from phrases • Salient information = important words • Important words are simply the most frequent in the document set • SumBasic simply chooses sentences with the most frequent words. Conroy expands on this

  42. MEAD Centroid-based Based on sentence utility Topic detection and tracking initiative [Allen et al. 98, Wayne 98] MEAD (Radevet al. 00) TIME

  43. ARTICLE 18853: ALGIERS, May 20 (AFP) ARTICLE 18854: ALGIERS, May 20 (UPI) 1. Eighteen decapitated bodies have been found in a mass grave in northern Algeria, press reports said Thursday, adding that two shepherds were murdered earlier this week.2. Security forces found the mass grave on Wednesday at Chbika, near Djelfa, 275 kilometers (170 miles) south of the capital.3. It contained the bodies of people killed last year during a wedding ceremony, according to Le Quotidien Liberte.4. The victims included women, children and old men.5. Most of them had been decapitated and their heads thrown on a road, reported the Es Sahafa.6. Another mass grave containing the bodies of around 10 people was discovered recently near Algiers, in the Eucalyptus district.7. The two shepherds were killed Monday evening by a group of nine armed Islamists near the Moulay Slissen forest.8. After being injured in a hail of automatic weapons fire, the pair were finished off with machete blows before being decapitated, Le Quotidien d'Oran reported.9. Seven people, six of them children, were killed and two injured Wednesday by armed Islamists near Medea, 120 kilometers (75 miles) south of Algiers, security forces said.10. The same day a parcel bomb explosion injured 17 people in Algiers itself.11. Since early March, violence linked to armed Islamists has claimed more than 500 lives, according to press tallies. 1. Algerian newspapers have reported that 18 decapitated bodies have been found by authorities in the south of the country.2. Police found the ``decapitated bodies of women, children and old men,with their heads thrown on a road'' near the town of Jelfa, 275 kilometers (170 miles) south of the capital Algiers.3. In another incident on Wednesday, seven people -- including six children -- were killed by terrorists, Algerian security forces said.4. Extremist Muslim militants were responsible for the slaughter of the seven people in the province of Medea, 120 kilometers (74 miles) south of Algiers.5. The killers also kidnapped three girls during the same attack, authorities said, and one of the girls was found wounded on a nearby road.6. Meanwhile, the Algerian daily Le Matin today quoted Interior Minister Abdul Malik Silal as saying that ``terrorism has not been eradicated, but the movement of the terrorists has significantly declined.''7. Algerian violence has claimed the lives of more than 70,000 people since the army cancelled the 1992 general elections that Islamic parties were likely to win.8. Mainstream Islamic groups, most of which are banned in the country, insist their members are not responsible for the violence against civilians.9. Some Muslim groups have blamed the army, while others accuse ``foreign elements conspiring against Algeria.’’

  44. MEAD • INPUT: Cluster of d documents with n sentences (compression rate = r) • OUTPUT: (n * r) sentences from the cluster with the highest values of SCORE SCORE (s) = Si (wcCi + wpPi + wfFi)

  45. Scientific article summarization • Not only what the article is about, but also how it relates to work it cites • Determine which approaches are criticized and which are supported • Automatic genre specific summaries are more useful than original paper abstracts

  46. Other uses • Document indexing for information retrieval • Automatic essay grading, topic identification module

  47. Evaluating summarization: the problem • Which human summary makes a good gold standard? Many summaries are good • At what granularity is the comparison made? • When can we say that two pieces of text match?

More Related