180 likes | 360 Vues
Document Summarization. Madhavi Ganapathiraju Graduate Student Language Technologies Institute Carnegie Mellon University. RR-NB Seminar Series October 17, 2003. In this talk…. What do we mean by Summarization Expectations Intuitive guesses on “how to” Current approaches
E N D
Document Summarization Madhavi GanapathirajuGraduate StudentLanguage Technologies InstituteCarnegie Mellon University RR-NB Seminar SeriesOctober 17, 2003
In this talk… • What do we mean by Summarization • Expectations • Intuitive guesses on “how to” • Current approaches • One specific method in detail Document Summarization
Objective of Summarization • Reduce length of document • But preserve: • Key Information • Style of writing • Expected Qualities: • Cohesiveness • Coherence • Readability Document Summarization
How can we say a summary is good or bad? • Be able to answer questions • Compression ratio • Preserve Chronology • No Redundancy Document Summarization
How is it done manually • Read document • Identify important “phrases” • Identify Chronology of events if any • Synthesis new sentences Document Summarization
How to do it automatically:Edmundson’s method • His work at IBM • In 1969!! • Forms major component even in today’s systems!!!! Continued… Document Summarization
Study Manual abstracts Specify characteristics “generate” Target abstracts Manually Collection of Documents & manual abstracts Edmundson’s method Document Summarization
Design scoring schemes Score and pick sentences Compare with Manually chosen sentences Document Summarization
Scoring schemes derived • Keyword-occurrence • Title-keyword • Location heuristic • Indicative phrases • “this report…”, “in conclusion…” • Short-length cutoff • Upper-case word feature • Acronyms, proper names Document Summarization
Graph theoretic method 4 5 3 6 2 7 1 8 11 9 10 Node: A sentence Edge: Exists if similarity between pair of sentences greater than threshold Document Summarization
How to put key information together? • Synthesis new sentences? • Too difficult… to synthesize accurately • Systems exist • Undesirable • Original style of writing lost • Subtle information like tone of presentation lost Document Summarization
Summary = collection of sentences • Take top most scoring sentences • Arrange them by descending scores • Preserve chronology if exists Document Summarization
Redundancy • Edmundson’s procedure: • No explicit mention of avoiding redundancy • The only difference to modern methods • Novel methods to avoid redundancy • Maximum “marginal relevance” (MMR) Document Summarization
Similarity between sentences • Semester begins tomorrow • New semester is beginning on Monday S1 = [Semester(1) begin(1) tomorrow(1)] S2 = [New(1) semester(1) begin(1) Monday(1)] Similarity 1 + 1 3 + 4 Document Summarization
MMR features • Clusters of sentences • Candidature of a sentence to be in summary: • Similarity to query. • Coverage of the passage • Content in the passage, eg., proper nouns, dates, etc. • Time Sequence: more recent ones • Undesirable features in sentences: • Similarity to passages already included in the summary • Belonging to the cluster/document that has already contributed a sentence to the summary Document Summarization
4 5 3 6 2 7 1 8 11 9 10 Document Summarization
MMR algorithm Score of sentence = * similarity of sentence to query + (1-) * similarity of sentence to summary Done iteratively Document Summarization
Future presentations on Summarization & Contact persons for research in this area: Nikesh Garera (ng+@cs.cmu.edu) Learning Methods Ravindra G. (ravi@mmsl.serc.iisc.ernet.in) Statistical Methods Document Summarization