1 / 35

Document Summarization

Document Summarization . Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya. Outline. Introduction Single Document Summarization Multiple Document Summarization Application Evaluation Conclusion. Introduction. What is Summary?

oro
Télécharger la présentation

Document Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Summarization Vinayak Gagrani NeerajToshniwalAbhishekKabra Guide Pushpak Bhattacharya

  2. Outline • Introduction • Single Document Summarization • Multiple Document Summarization • Application • Evaluation • Conclusion

  3. Introduction • What is Summary? • Text produced from one or more texts • Conveys important information in the original texts, and that is no longer than half of the original texts. • 3 important aspects of summary are: • Summaries should be short • Summaries should preserve important information • Summaries may be produced from single/multiple documents

  4. Common terms in summarization dialect • Extraction • Procedure of identifying important sections of text and producing verbatim • Abstraction • Aim to produce material in a new way • Fusion • Combining extracted parts coherently • Compression • Aims at throwing out unimportant sections of text

  5. Single Document Summarization • Early Works • Machine Learning Methods • Naïve-Bayes Methods • Rich Features and Decision Trees • Deep Natural Language Analysis Methods • Lexical Chaining • Rhetorical Structure Theory (RST)

  6. Early Works • Luhn, 1958 • Summarization based on measuring significance of words depending on its frequency • Deriving significance factor of sentence, based on number of significance words in that sentence • Edmundson, 1969 • Word frequency and positional importance were incorporated • Presence of cue words, and skeleton of the document were also incorporated

  7. Naïve Bayes Method • Classifier based on applying Bayes theorem with strong independence assumption s-particular sentence S-set of sentences that make up the summary F1…, Fk -the features Assuming independence of features: P(s ε S | F1,F2….Fk)= • Evaluation is done by analyzing its match with the human extracted document summary

  8. Naïve Bayes Method • Term frequency-inverse document frequency • Increases proportionally to the number of times a word appears in the document • offset by the frequency of the word in the corpus • Takes into account that certain words are more common than others. For e.g.. “the”, “is” etc. • Idf(t,D)= log • |D|: total number of documents in the corpus • : number of documents where the term t appears i.e. tf(t,d) 0

  9. Rich Features and Decision Trees • Weighing sentences based on their position • Arises from the idea that texts generally follow a predictable discourse structure • Sentence position yield was calculated against the topic keywords later • Sentence position were then ranked by average yield to produce Optimal Position Policy for topic positions for the genre • Later, sentence extraction problem was modeled using decision trees • assumption that features are independent broke away

  10. Deep Natural Language Analysis Methods • Techniques aimed at modeling the text’s discourse structure • Use of heuristics to create document extracts • Lexical Chaining • independent of the grammatical structure of the text • list of words that captures a portion of the cohesive structure of the text • sequence of related words in the text, spanning short or long distances • technique used to identify the central theme of a document

  11. Forms of Cohesion • Ellipsis • Words are omitted when the phrase needs to be repeated • Example: • A: Where are you going? • B: To town. • Substitution • Word is not omitted but replaced by another • Example: • A: Which ice-cream would you like? • B: I would like the pink one.

  12. Forms of Cohesion • Conjunction • Relationship between two clauses • Few of them are: “and”, “then”, “however” etc. • Repetition • Mentioning of the same word again • Reference • Anaphoric reference • Refers to someone/something that has been previously identified • Cataphoric reference • Forward referencing . Example: Here he comes….It’s Brad Pitt

  13. Lexical chaining • Example:- John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it. • Steps involved in lexical chaining: a) Selecting a set of candidate words. b) For each candidate word, finding an appropriate chain relying on a relatedness criterion among members of the chain c) If it is found, inserting the word in the chain and updating it accordingly

  14. Lexical Chaining • relatedness measure-Wordnet Distance. • Weights assigned to chains based on their length and homogeneity • Determining the strength of a lexical chain by taking in consideration the distribution of elements in the chain throughout the text • Corresponds to the significance of the textual context it embodies. • Provides a basis identifying the topical units in a document which are of great importance in document summarization.

  15. Rhetorical Structure Theory(RST) • two non-overlapping pieces of text spans: the nucleus and the satellite • Nuclei expresses what is more essential to the writer's purpose than the satellite • Example: claim followed by evidence for the claim. RST posits an "Evidence" relation between the two spans. • claim is more essential to the text than the particular evidence • claim span a nucleus and the evidence span a satellite • Nucleus is independent of the satellite but not vice versa

  16. Rhetorical Structure Theory(RST)

  17. Multiple Document Summarization • Need and Encouragement • Extraction of single summary from multiple documents started in mid 1990s • Most of the application in news article • Google news (news.google.com) • Columbia news blaster (newsblaster.cs.columbia.edu) • News in Essence (NewsInEssence.com) • Multiple source of information which are :- • supplementary to each other • overlapping in content • even contradictory at time

  18. Early Work • Extended template driven message understanding system • Abstractive System, rely heavily on internal NLP tools • Earlier considered as knowledge of • Language Interpretation • Generation • Extractive Techniques have been applied - Similarity measures between sentences • identify common theme through clustering - select one sentence to represent each cluster • generate composite sentence from each cluster • Summarization differs on what the final goal is • MEAD : works based on extraction techniques on general domains • SUMMONS : build a briefing highlighting difference and updates on news report

  19. Abstractions and Information Fusion • SUMMONS is the first example of multi-document summarization • Considers event about a narrow domain • news articles about terrorism • It produces a briefing merging relevant information about event and their evolution over time • It reads a database built by template based message understanding system • Concatenation of two systems : Content Planner and Linguistic Generator

  20. SUMMONS - processing the text (Content Planner) • Content Planner : selects information to include in summary through combination of input templates • It uses summary operators - set of heuristics that perform operations like : • change of perspective, contradiction, refinement • Linguistic Generator :selects the right words to express the information in grammatical and coherent text. • Uses connective phrases to synthesize summary, adapting language generation tools like FUF/SURGE

  21. Theme based approach - McKeown et al., Barzilay et al. • Themes - set of similar text units (Paragraphs) - Clustering Problem • Text is mapped to vector of features including single words weighted by their TF-IDF scores, noun, pronoun, semantic classes of verbs • For each pair of paragraphs a vector is computed which represents matches on different features. • Decision rules learnt from data classify each pair as similar or dissimilar. An algorithm then places the most related paragraphs in same theme • Information Fusion - which sentences of the theme should be included in the final summary.

  22. Information Fusion • Algorithm - compares and intersects predicate argument structures of the phrases within each theme to find which are repeated often enough to be included in summary • Sentenced are parsed using Collins' statistical parser converted into dependency tree – captures predicate-argument structure, identify functional roles. • Comparison algorithm traverses the tree recursively, adding identical nodes to output tree. • Once full phrase are found, they are marked to be included in summary. • Once summary content is decided, a grammatical text is generated using FUF/SURGE language generating system.

  23. Decision Tree “McVeigh, 27,was charged with the bombing”

  24. Topic-Driven Summarization • MMR - Maximal Marginal Relevance introduced by Carbonell and Goldstein • Rewards relevant sentences and penalizes redundant ones by considering a linear combination of two similarity measures. • Q - query or user profile, R - Ranked list of documents, S - already selected documents . • Select a document one at a time and add them to S. • For each document in Di in R\S, MR(Di) = a * Sim1(Di,Q) - (1-a) * max Di in S Sim2(Di,Dj), where a lies in [0,1] • Document getting maximum MR(Di) is selected until maximum number is reached or threshold is reached, • a controls the relative importance between relevance and redundancy. • Sim1 and Sim2 are similarity measures ( cosine similarity measure )

  25. Graph Spreading Activation • Content is denoted as entities and relations as nodes and edges of a graph.   • Rather than extracting sentences, they detect salient regions of the graph.  • Topic Driven : topic is denoted by entry nodes in graph. • Graph : • Each node is single occurrence of word.   • Different kind of links – Adjacency links, Same links, Alpha Links and Phrase links, Name and Coref Links 

  26. Graph Spreading Activation • Topic nodes are identified through stem comparison and marked as entry node.   • Spreading activation: search for semantically related text is propagated from these to other nodes of the graph.  • Weight of neighboring node depends on node links traveled and is exponentially decaying function of the distance.  • Pair of document graph: identify common nodes and difference nodes. Highlight sentences having higher common and different scores.  • User is able to specify the maximal number to control the output. 

  27. Centroid-based Summarization • It does not use any language generation module. Easily scalable and domain-independent • Topic Detection - Group together news articles that describe the same event. • An agglomerative clustering algorithm is used, operates on TF-IDF vector representations, successively adding documents to clusters and re computing the centroids according to • cjis the centroid of the j-th cluster, Cj the set of documents that belong to that cluster • Centroids can thus be considered as pseudo-documents that include those words whose TF-IDF scores are above a threshold in their cluster.

  28. Centroid-based Summarization • Second Stage - Identify sentences that are central to topic of the entire cluster. • Two metrics similar to MMR(but not query dependent) are defined by Radev et al., 2000 • Cluster-based relative utility (CBRU) - how relevant a particular sentence to general topic of cluster • Cross-sentence Informational subsumption (CSIS) - measure of redundancy among sentences • Given a cluster segmented into n sentences, and compression rate R, we select nR sentences in order of appearance in chronologically arranged documents • Addition of the three scores minus redundancy penalty(Rs) for sentence that overlaps highly ranked sentence is the final score for each sentence • Centroid Value (Ci) sum of centroid values of all the words in sentence • Positional Value(Pi) makes leading sentences more important • First sentence Overlap (Fi) - inner product of word occurrence vector of sentence I and that of 1st sentence of document

  29. Application • Google News: • news aggregator, selecting most up-to-date(within the past 30 days) information from thousands of publications by an automatic aggregation algorithm • Different versions available for more than 60 regions in 28 languages • Ultimate research Assistant: • performs text mining on Internet search results • make it easier for the user to perform online research by organizing the output. • Type name of a topic and it will search the web for highly relevant resources, and organize the search results

  30. Application • Shablast • Universal search engine • Produces multi-document summaries from the top 50 results returned by Microsoft's Bing search engine for a set of keywords. • iResearch Reporter – • Commercial Text Extraction and Text Summarization system • Produces categorized, easily-readable natural language summary reports covering multiple documents retrieved by entering user query in google search engine

  31. Application

  32. Evaluation • A difficult task • Absence of a standard human or automatic evaluation metric • makes difficult to compare different systems and establish a baseline • Manual evaluation not feasible • Need for an evaluation metric having high correlation with human scores • human and automatic evaluation: • Comparison of automatic generated summaries with manually written "ideal" summaries decomposition of text into sentences • Rating between 1-4 to system unit(SU) which shares content with Model unit(MU) corresponding to ideal summaries

  33. Evaluation • ROUGE • based only on content overlap • can determine if the same general concepts are discussed between an automatic summary and a reference summary • cannot determine if the result is coherent or the sentences flow together in a sensible manner • Better in case of single document summarization • Information-theoretic Evaluation of Summaries • Central idea is to use a divergence measure between a pair of probability distributions • First distribution is derived from automatic summary • Second from a set of reference summaries • Suits both the single document and multi document summarization scenarios

  34. Conclusion • Need to develop efficient and accurate summarization systems due to enormous rate of information growth • Still a lot of research going on this field especially in evaluation techniques • Multi document summarization is more in use as compared to single-document summarization • Extractive techniques are employed usually rather than abstractive techniques as they are easy to employ and have produced satisfactory results

  35. References • A survey on Automatic Summarization – Dipanjan Das and Andre F.T. Martins (http://www.cs.cmu.edu/~afm/Home_files/Das_Martins_survey_summarization.pdf) • Wikipedia • Relevance of cluster size in MMR Based summarizer (http://www.cs.cmu.edu/~madhavi/publications/Ganapathiraju_11-742Report.pdf)

More Related