1 / 45

Text Summarization

Jagadish M(07305050) ‏ Annervaz K M (07305063) ‏ Joshi Prasad(07305047) ‏ Ajesh Kumar S(07305065) ‏ Shalini Gupta(07305R02) ‏. Text Summarization. Introduction. Summary: Brief but accurate representation of the contents of a document

winda
Télécharger la présentation

Text Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jagadish M(07305050)‏ Annervaz K M (07305063)‏ Joshi Prasad(07305047)‏ Ajesh Kumar S(07305065)‏ Shalini Gupta(07305R02)‏ Text Summarization

  2. Introduction Summary: Brief but accurate representation of the contents of a document Goal: Take an information source, extract the most important content from it and present it to the user in a condensed form and in a manner sensitive to the user’s needs. Compression: Amount of text to present or the length of the summary to the length of the source.

  3. MSWord AutoSummarize

  4. Presentation Outline • Motivation • Different Genres • Simple Statistical Techniques • Degree Centrality • Lex Rank • Lexical/Co-reference Chains • Rhetorical Structure Theory • WordNet Based Methods • DUC/TAC

  5. Motivation Abstracts for Scientific and other articles News summarization (mostly Multiple document summarization)‏ Classification of articles and other written data Web pages for search engines Web access from PDAs, Cell phones Question answering and data gathering

  6. Genres Indicative vs. informative used for quick categorization vs. content processing. Extract vs. abstract lists fragments of text vs. re-phrases content coherently. Generic vs. query-oriented provides author’s view vs. reflects user’s interest. Background vs. just-the-news assumes reader’s prior knowledge is poor vs. up-to-date. Single-document vs. multi-document source based on one text vs. fuses together many texts.

  7. Statistical scoring Scoring techniques Word frequencies throughout the text(Luhn58)‏ Position in the text(Edmundson69)‏ Title Method(Edmundson69)‏ Cue phrases in sentences (Edmundson69)‏

  8. Luhn58 Important words occur fairly frequently Earliest work in field

  9. Statistical Approaches(contd..)‏ Degree Centrality LexRank Continuous LexRank

  10. Degree Centrality Problem Formulation Represent each sentence by a vector Denote each sentence as the node of a graph Cosine similarity determines the edges between nodes

  11. Degree Centrality Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold.

  12. Degree Centrality Compute the degree of each sentence Pick the nodes (sentences) with high degrees

  13. Degree Centrality Disadvantage in Degree Centrality approach

  14. LexRank Centrality vector p which will give a lexrank of each sentence (similar to page rank) defined by :

  15. What Should B Satisfy? Stochastic Matrix and Markov Chain property. Irreducible. Aperiodic

  16. Perron-Frobenius Theorem An irreducible and aperiodic Markov chain is guaranteed to converge to a stationary distribution

  17. Reducibility

  18. Aperiodicity

  19. LexRank B is a stochastic matrix Is it an irreducible and aperiodic matrix? Dampness (Page et al. 1998)‏

  20. Matrix Form of p for Dampening Solve for p using Power method

  21. Continuous LexRank

  22. Linguistic/Semantic Methods Co-reference /Lexical Chain Rhetorical Analysis

  23. Co-reference/Lexical Chains Assumption/Observation :- Important parts in a text will be more related in a semantic interpretation Co-reference / Lexical Chains (Object-Action, Part-of relation, Semantically related)‏ Important sentences will be traversed by more number of such chains

  24. Co-reference/Lexical Chains Mr. Kenny is theperson that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achieve much closer monitoring of the pump feeding the anesthetic into the patient

  25. Rhetorical Structure Theory Mann & Thompson 88 Rhetoric Relation Between two non-overlapping text snippets Nucleus - Core Idea, Writers Purpose Satellite - Referred in context to nucleus for Justifying, Evidencing, Contradicting etc

  26. Rhetorical Structure Theory Nucleus of a rhetorical relation is comprehensible independent of the satellite, but not vice versa All rhetoric relations are not nucleus-satellite relations, Contrast is a multinuclear relationship Example: evidence [The truth is that the pressure to smoke in 'junior high' is greater than it will be any other time of one’s life:][ we know that 3,000 teens start smoking each day.]

  27. Rhetorical Structure Theory Rhetoric Parsing Breaks into elementary units Uses cue phrases(discourse markers) and notion of semantic similarity in order to hypothesize rhetorical relations Rhetorical relations can be assembled into rhetorical structure trees (RS-trees) by recursively applying individual relations across the whole text

  28. 2Elaboration 2Elaboration 8Example 2BackgroundJustification 3Elaboration 8Concession 10Antithesis With its distant orbit (50 percent farther from the sun than Earth) and slim atmospheric blanket,(1)‏ Mars experiences frigid weather conditions(2)‏ Surface temperatures typically average about -60 degrees Celsius (-76 degrees Fahrenheit) at the equator and can dip to -123 degrees C near the poles(3)‏ 4 5Contrast Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,(7)‏ Most Martian weather involves blowing dust and carbon monoxide.(8)‏ Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.(9)‏ Yet even on the summer pole, where the sun remains in the sky all day long, temperatures never warm enough to melt frozen water.(10)‏ Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,(4)‏ 5EvidenceCause but any liquid water formed in this way would evaporate almost instantly(5)‏ because of the low atmospheric pressure(6)‏

  29. RST Based Summarization Multiple RS-trees A built RS-tree captures relations in the text and can be used for high quality summarization Picking up the ‘K’ nodes nearest to the root Disadvantages

  30. WordNet based Approach for Summarization Preprocessing of text Constructing sub-graph from WordNet Synset Ranking Sentence Selection Principal Component Analysis

  31. Preprocessing Break text into sentences Apply POS tagging Identify collocations in the text Remove the stop words Sequence is important

  32. Constructing sub-graph from WordNet Mark all the words and collocations in the WordNet graph which are present in the text Traverse the generalization edges up to a fixed depth, and mark the synsets you visit Construct a graph, containing only the marked synsets

  33. Synset Ranking Rank synsets based on their relevance to text Construct a Rank vector, corresponding to each node of the graph, initialized to 1/√ (no_of_nodes, n in graph)‏ Create an authority matrix, A(i,j) = 1/(num_of_predecessors(j)), if j is a child of i.

  34. Synset Ranking Update the R vector iteratively as, Higher value implies better rank and higher relevance

  35. Sentence Selection Construct a matrix, M with m rows and n columns m is number of sentences and n is number of nodes For each sentence Si Traverse graph G, starting with words present in Si and following generalization edges Find set of reachable synsets, SYi For each syij∈ SYi set M[Si][syij] to rank of syij calculated in previous step

  36. Principal Component Analysis Apply PCA on matrix M and get set of principal components or eigen vectors Eigen value of each eigen vector is measure of relevance of eigen vector to the meaning Sort Eigen vectors according to Eigen values For each Eigen vector, find its projection on each sentence

  37. Principal Component Analysis Select top nnumselect sentences for each eigen vector nnumselect is proportional to the eigen values of the eigen vectors nnumselect = i/∑j(j)) where i is the eigen value corresponding to the eigen vector, i

  38. Document Understanding Conference(DUC) • Text Analysis Conference(TAC) • Interest and activity aimed at building powerful multi-purpose information systems • Evaluation results of various summarization techniques • www-nlpir.nist.gov/projects/duc/data.html

  39. Human Summary of Our Presentation :)‏ What is Text Summarization? Why Text Summarization? Methods to Summarization LexRank Lexical Chains Rhetorical Structure Theory Wordnet Based

  40. Challenges ahead.. Ensuring text coherency Sentences may have dangling anaphors Summarizing non-textual data Handling multiple sources effectively High reduction rates are needed Achieving human quality summarization!!

  41. References Erkan, Radev, 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Vol: 22, 457 – 479, Journal of Artificial Intelligence Research Barzilay, R. and M. Elhadad. 1997. Using Lexical Chains for Text Summarization. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the ACL/EACL Conference, 10–17. Madrid, Spain. Mann, W.C. and S.A. Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text 8(3), 243–281. Also available as USC/Information Sciences Institute Research Report RR-87-190.

  42. References Baldwin, B. and T. Morton. 1998. Coreference-Based Summarization. In T. Firmin Hand and B. Sundheim (eds). TIPSTER-SUMMAC Summarization Evaluation. Proceedings of the TIPSTER Text Phase III Workshop. Washington. Marcu, D. 1998. Improving Summarization Through Rhetorical Parsing Tuning. Proceedings of the Workshop on Very Large Corpora. Montreal, Canada. Ramakrishnan and Bhattacharya, 2003. Text representation with wordnet synsets. Eighth International Conference on Applications of Natural Language to Information Systems (NLDB2003)‏

  43. References Bellare,Anish S., Atish S., Loiwal, Bhattacharya, Mehta, Ramakrishnan, 2004. Generic Text Summarization using WordNet Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text. Summarization. MIT Press, 1999. ISBN 0-262-13359-8. www.wikipedia.com

  44. Thank You

More Related