1 / 32

Automatic text summarization

Automatic text summarization. Hercules Dalianis NADA-KTH Royal Institute of Technology 100 44 Stockholm ph: +46-8-790 91 05 mobile: +46 70 568 13 59 email: hercules@kth.se. Overview of talk. Background Other summarizers Technique Future improvements Applications Evaluation.

ringo
Télécharger la présentation

Automatic text summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic text summarization Hercules Dalianis NADA-KTH Royal Institute of Technology 100 44 Stockholm ph: +46-8-790 91 05 mobile: +46 70 568 13 59 email: hercules@kth.se

  2. Overview of talk • Background • Other summarizers • Technique • Future improvements • Applications • Evaluation

  3. Automatic text summarization • Automatic text summarization is the method where a computer summarizes a text. • A text is given to the computer and it returns an non-redundant shorter text- An extract from a longer original text. • The technique has it’s roots in the 60’s. • With the Internet and the WWW it has been an awakening interest in summarization techniques.

  4. Summarization tools http://www.nada.kth.se/~hercules/HDbookmarks.htm http://www.ics.mq.edu.au/~swan/summarization/projects_full.htm • Microsoft Word 97, 98 and Word 2000 have a summarizer for documents. • Intelligent Miner for Text -Summarization tool IBM • Inxight (XEROX) • Datahammer (Glucose Development Corporation)

  5. Corporum Summarizer- Cognit AS (Norway) • Pertinence (France) • Copernic Summarizer • MuST Prototype • Automated Text Summarization (SUMMARIST) • Columbia Newsblaster http://www1.cs.columbia.edu/nlp/newsblaster • OracleContext • Autonomy

  6. What is Automatic summarization good for? • News paper setting and printing • Sydsvenska Dagbladet, Bergens Tidene • Summarize Scientific texts • Danmarks Elektroniske Forskningsbibliotek • Telephone systems • Read summarized news synthetically

  7. Search engines • to summarize documents for hitlist c.f. Google, SiteSeeker. • NewsAgent - Business Intelligence • TDT Topic Detection Tracking and Columbia Newsblaster

  8. Summarization approaches • Extraction vs. Abstraction • Generic vs. Query based • Indicative vs. Informative • Restricted vs. Unrestricted domain • Background information vs. New information (TDT) • Single-document vs. Multiple-document • Monolingual vs. Multilingual • Textual vs. Multimedia

  9. Text summarization • Extraction is much easier than abstraction • Abstraction needs understanding and rewriting

  10. Techiques • Find what the text is about • Then decide what so say • Then decide how to say it • Text summarization (extraction) uses statistic, linguistic and heuristic methods

  11. Techiques • A text is divided into sentences • Sentence positions (News/Reports) • Title words • Bold text, Numerical values, Citations • Named Entities (Frequence based) • Keyword frequency and extraction (nouns, adverbs, adjectives) • Use morphological information-lemma

  12. Key word lexicon • Key words in news domain • Also called "open class word lexicon” • Key words can be noun, adjectives or adverbs

  13. Word which are present in all other sentences. • User adaptation • Use user keywords - Obtain slanted summaries • Combination function of all rankings with different weights gives the rank of each sentence. • Generate all high ranking sentences • Voilá the summary !

  14. SweSum • The first text summarizer for Swedish • Summarizes Swedish news paper text in HTML/text format on the WWW. • Uses a Swedish key word lexicon that contains 40 000 words and their possible 700 000 inflections. • During the text summarization are 5-10 key words produced which describes or categorizes the text - • Key words - A miniature summary.

  15. The Swedish keyword lexicon 700 000 words 40 000 words Inflected version Lemma statsminister statsminister statsministern statsminister statsministerns statsminister statsministrarna statsminister statsministrarnas statsminister .. ... regeringen regeringen regeringens regeringen regeringarna regeringen regeringarnas regeringen ... ....

  16. SweSum • SweSum is available to summarize news texts on Swedish, Danish, Norwegian, English, Spanish, French, German and in Farsi (Iranian).

  17. Textsammanfattningsbildspel

  18. Problems • Pronoun and other anafora referenser • Kalle sprang. Han sprang fort. • Pronoun resolution • Clauses can be too long or too short • Clause reductions- and clause combination rules • Aggregation

  19. SweSum without PRM Analysera mera! Regi: Harold Ramis Medv: Robert De Niro, Billy Crystal, Lisa Kudrow Längd: 1 tim, 45 min … Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Han accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och han är tämligen oemotståndlig. Här har han åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination. SvD 99-10-08

  20. SweSum with PRM Analysera mera! Regi: Harold Ramis Medv: Robert De Niro, Billy Crystal, Lisa Kudrow Längd: 1 tim, 45 min … Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Robert accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och Robert är tämligen oemotståndlig. Här har Harold åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination. SvD 99-10-08

  21. Evaluation • We found that if one summarizes the text to 30 percent of original length one will obtain around 70-80 percent accuracy on 3-4 pages news articles. • .. but query based evaluations are based on subjective opinions • These evaluation need large human effort • Small overlap of opinions • We need man-made extracts to compare the machine made extracts automatically

  22. There are some man-made extracts for English news texts. • We had to create such extract for Swedish news text. • We created KTH Extract Corpus- Corpus created manually once by voting • Then one can compare the texts from SweSum and KTH Extract Corpus manually or soon automatically

  23. KTH extract corpus • http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php and • http://www.nada.kth.se/iplab/hlt/kthxc/ • Visa celltexten

  24. http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php?cutoff=30&fileid=svenska-%3Etest-%3Etext001.htmhttp://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php?cutoff=30&fileid=svenska-%3Etest-%3Etext001.htm

  25. Future improvements of SweSum • Tagging instead of static lexicons • Clause level summarization • ImprovedNamed Entity recognition • Improved Pronominal Resolution • Lexical chains using SIMPLE and/or EuroWordNet • Automatic evaluation method

  26. Demonstrators • SweSum – Standard version http://swesum.nada.kth.se/index-eng.html • SweSum – Experimental NE version http://www.nada.kth.se/~xmartin/swesum_lab/index-eng.html (SweSum uses a Perl-CGI script, there is also a standalone version for plain text/html)

More Related