A Trainable Document Summarizer Julian Kupiec, Jan Pedersen & Francine Chen ACM SIGIR ‘95

The Automatic Creation of Literature AbstractsH. P. Luhn IBM Journal of R&D, 1958 A Trainable Document SummarizerJulian Kupiec, Jan Pedersen & Francine ChenACM SIGIR ‘95 Presented by Mat KellyCS895 – Web-based Information Retrieval Old Dominion University November 22, 2011

Luhn’s Objectives • Exploration into automatic methods of obtaining abstracts • Selects sentences that are most representative of pertinent info • Citations of author’s own statements constitute “auto-abstract”

Which sentences are best? • Establish a significance factor • Freq. of word occurrence  word significance • Relative position of signif. word in sentence is a measure for determining signif. of sentence • Why does this work? • Writer repeats certain words as he elaborates

Over-Simplification • Method does not differentiate words with same stem • Letter-by-letter analysis to determine P() of same stem • While authors will opt for synonymous word choice, s/he’ll eventually run out and resort to repetition. polic = {policing policy police }

Premise • No consideration given to meaning of words. • Instead, the closer certain words are associated, the more specifically an aspect of the subject is being treated • Where the greatest number of freq. occurring different words are found close to each other, the prob. is high that information is most representative of the article.

Criterion is relationship of signif words to each other rather than distrib. over whole sentence. • Consider only portions of sentences that are bracketed by signif word, disregard those beyond limit from consideration of current bracket. • Useful limit found is 4-5 non-signif words between signif words

Computing Significance Factor [ ] • Determine extent of cluster by bracketing • Count # signif words in cluster • Divide square of # by total # words in cluster Tested on 50 articles of 300-4500 words each, compared against 100-person manual generation Significant Words * * * * 1 2 3 4 5 6 7 A portion of a sentence is bracketed If signif. words are not more than 4 apart, whole sentence is cited

Resolving power depends on total # words in article and decreases as total # of words increases • Overcome by running on subdivisions of article, highest ranking sentences combined to form abstract • Divisions might already exist with paper’s organization • Otherwise, divided arbitrarily and overlapping

Procedures • Abstracts prepared by first punching on cards(!) • Pronouns& prepositions deleted from lookup routine • Rest of words sorted alphabetically • Words with common beginnings consolidated (rudimentary form of stemming) • Produced errors up to 5% but did not affect results • Words with low frequency removed, remaining were marked as significant • Sentence signifthen computed with prev formula

Abstract Creation with Result • Apply cutoff value of sentence significance • Fixed number of sentences required irrespective of document length • Sentences could be weighted by assigning premium value to predetermined set of words if article is of special interest • If no sentences meet threshold, reject article as too general for purpose of auto-abstracting

Example Generated Abstract Two major recent developments have called the attention of chemists, physiologists, physicists and other scientists to mental diseases: It has been found that extremely minute quantities of chemicals can induce hallucinations and bizarre psychic disturbances in normal people, and mood-altering drugs (tranquilizers, for instance) have made long-institutionalized people amenable to therapy. (4.0) This poses new possibilities for studying brain chemistry changes in health and sickness and their alleviation, the California researchers emphasized. (5.4) The new studies of brain chemistry have provided practical therapeutic results and tremendous encouragement to those who must care for mental patients. (5.4)

Conclusions • Method proved feasible • Highly reliable, consistent and stable unlike manual creation • Possibility that author’s style causes inferior sentences to be promotes • Method helps to realize savings in human effort Significant Words Significant Sentences Inclusion in Abstract

Kupiec’s Objective • Motive: provide intermediate point between document title and full text (i.e. abstract) • Documents as short as 20% of the original can be as informative as the full text* • Extracts can be non-unique • Combination by numerous methods (including Luhn’s) would have the best performance. * A.H. Morris, G.M. Kasper, and D.A. Adams. The effects and limitations of automated text condensing on reading comprehension performance. Information Systems Research, pages 17-35, March 1992

A Statistical Classification Problem • Have training set of documents w/ manually extracted abstracts • Develop classification function that est. prob. That a given sentence is included in abstract • From this, generate new abstracts by ranking sentences according to this prob and select user-specified # of top scoring sentences. Determine P() of abstract inclusion } Inclusion Threshold Feature 1 Feature 2 Given Sentences … Feature n SCORE Contributes to S’s score Using Bayesian Classifier

Evaluation criterion: classification success rate/precision • Requires corpus (expensive) • Acquired from non-profit Engineering Information Co. – used as basis for experiments • All previous methods assume that documents exist in isolation

Features Experimentally Obtained • Sentence Len. Cutoff – short sentences are not usually included in summaries – 5 words • Fixed-Phrase – list of words and those after “Summary", "Conclusions”, etc are likely to be in summaries • Paragraph – Consider first 10 ¶and last 5 ¶ • Thematic Word – score sentences respective to inclusion of words within theme • Uppercase Word – e.g. proper names, scored similarly to thematic words, sentences that start with score double than later occurrences

Classifier • For each sentence, determine prob that it will be included in summary S given k features: • Since all features are discrete, equation can be put in terms of probs rather than likelihoods. • Results in simple Bayesian classification function that assigns s as score, used to select sentences for inclusion in summary

About the Corpus • Articles w/o abstracts, created manuallyafter the fact • 188 document/summary pairs from 21 publications in scientific/technical domains • Summary avg length is 3 sentences

Sentence Matching • Using manually created abstracts, match to sentences in orig. document • Direct match- Verbatim or w/o minor modifications • Direct join – 2 or more sentences used to make summary sentence • Unmatchable – suspected fabrication without using sentences in document • Incomplete – • Some overlap exists but content is not preserved in summary • Summary sentence includes content from original but contains other information that is not covered by a direct join Direct match Abstract Direct join

Evaluation • Insufficient data for separate test corpus, used cross-validation strategy for evaluation • Documents from a journal were selected for testing one at a time, all other document summary pairs were used for training • Results were summed over journals • Unmatchable/incomplete sentences were excluded from training and testing = 498 unique sentences

Evaluating Performance • Fraction of manual summary sentences that were reproduced, limited by text excerpting:(451+19)/568 =83% • Sentence produced is correct if: • Has direct sentence match & present in manual summary – or – • Is in manual summary as part of direct join and all other components of join have been produced Distribution of Correspondence in Training Corpus

Results • Of 568 Sentences • 195 direct matches, 6 direct joins 201 correctly ident. summary sentences (35% replication) • Manual summary generation has only 25% overlap between people and 55% for the same person over time. • 211/498(42%) sentences correctly identified by the summarizer

Conclusions • For summarizes 25% size of document • 84% sentences selected that were also selected by professionals • For smaller summaries, improvement of 74% observed vs. simply presenting beginning of document.

Comparing the Processes • Luhn Significant Words Significant Sentences Inclusion in Abstract • Kupiec } Determine P() of abstract inclusion |Sentence| < 5 Inclusion Threshold ↑ After fixed phrase Given Sentences Prior. 1st & Last ¶s ↑ Thematic Words ↑ Capitalized, non-unit words SCORE Contributes to S’s score Using Bayesian Classifier

References • H.P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Develop., 2:159-165, 1959 • Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. In Proc. of the 18th Annual International ACM/SIGIR Conference, pages 68-73, Seattle, WA, 1995

A Trainable Document Summarizer Julian Kupiec, Jan Pedersen & Francine Chen ACM SIGIR ‘95