1 / 18

Document Summarization

Document Summarization. Madhavi Ganapathiraju Graduate Student Language Technologies Institute Carnegie Mellon University. RR-NB Seminar Series October 17, 2003. In this talk…. What do we mean by Summarization Expectations Intuitive guesses on “how to” Current approaches

sheri
Télécharger la présentation

Document Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Summarization Madhavi GanapathirajuGraduate StudentLanguage Technologies InstituteCarnegie Mellon University RR-NB Seminar SeriesOctober 17, 2003

  2. In this talk… • What do we mean by Summarization • Expectations • Intuitive guesses on “how to” • Current approaches • One specific method in detail Document Summarization

  3. Objective of Summarization • Reduce length of document • But preserve: • Key Information • Style of writing • Expected Qualities: • Cohesiveness • Coherence • Readability Document Summarization

  4. How can we say a summary is good or bad? • Be able to answer questions • Compression ratio • Preserve Chronology • No Redundancy Document Summarization

  5. How is it done manually • Read document • Identify important “phrases” • Identify Chronology of events if any • Synthesis new sentences Document Summarization

  6. How to do it automatically:Edmundson’s method • His work at IBM • In 1969!! • Forms major component even in today’s systems!!!! Continued… Document Summarization

  7. Study Manual abstracts Specify characteristics “generate” Target abstracts Manually Collection of Documents & manual abstracts Edmundson’s method Document Summarization

  8. Design scoring schemes Score and pick sentences Compare with Manually chosen sentences Document Summarization

  9. Scoring schemes derived • Keyword-occurrence • Title-keyword • Location heuristic • Indicative phrases • “this report…”, “in conclusion…” • Short-length cutoff • Upper-case word feature • Acronyms, proper names Document Summarization

  10. Graph theoretic method 4 5 3 6 2 7 1 8 11 9 10 Node: A sentence Edge: Exists if similarity between pair of sentences greater than threshold Document Summarization

  11. How to put key information together? • Synthesis new sentences? • Too difficult… to synthesize accurately • Systems exist • Undesirable • Original style of writing lost • Subtle information like tone of presentation lost Document Summarization

  12. Summary = collection of sentences • Take top most scoring sentences • Arrange them by descending scores • Preserve chronology if exists Document Summarization

  13. Redundancy • Edmundson’s procedure: • No explicit mention of avoiding redundancy • The only difference to modern methods • Novel methods to avoid redundancy • Maximum “marginal relevance” (MMR) Document Summarization

  14. Similarity between sentences • Semester begins tomorrow • New semester is beginning on Monday S1 = [Semester(1) begin(1) tomorrow(1)] S2 = [New(1) semester(1) begin(1) Monday(1)] Similarity  1 + 1 3 + 4 Document Summarization

  15. MMR features • Clusters of sentences • Candidature of a sentence to be in summary: • Similarity to query. • Coverage of the passage • Content in the passage, eg., proper nouns, dates, etc. • Time Sequence: more recent ones • Undesirable features in sentences: • Similarity to passages already included in the summary • Belonging to the cluster/document that has already contributed a sentence to the summary Document Summarization

  16. 4 5 3 6 2 7 1 8 11 9 10 Document Summarization

  17. MMR algorithm Score of sentence =  * similarity of sentence to query + (1-) * similarity of sentence to summary Done iteratively Document Summarization

  18. Future presentations on Summarization & Contact persons for research in this area: Nikesh Garera (ng+@cs.cmu.edu) Learning Methods Ravindra G. (ravi@mmsl.serc.iisc.ernet.in) Statistical Methods Document Summarization

More Related