1 / 104

Automatic Text Summarization Introduction and Research Problems

Automatic Text Summarization Introduction and Research Problems. Talk Outline. Why automatic text summarization? What is automatic text summarization? Classification of methodologies. Multiple Document Summarization. Evaluation. Research Problems. Demo. Why Summarization?.

sarah
Télécharger la présentation

Automatic Text Summarization Introduction and Research Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Text SummarizationIntroduction and Research Problems

  2. Talk Outline • Why automatic text summarization? • What is automatic text summarization? • Classification of methodologies. • Multiple Document Summarization. • Evaluation. • Research Problems. • Demo.

  3. Why Summarization? • Information overload. • The problem: • 4 Billion URLs indexed by Google • 200 TB of data on the Web [Lyman and Varian 03] • Possible approaches: • information retrieval • document clustering • information extraction • visualization • question answering • text summarization

  4. Summarization Examples

  5. Summarization Examples

  6. Summarization Examples

  7. Summarization Examples

  8. Summarization Examples

  9. What’s Summarization for? • Need to access textual information • indexing • search • Decision making process • Should I read the document? • Summary as surrogate • I don’t need to read the document! • Crossing the language barrier • Should I ask for a translation of the document?

  10. What’s Summarization for? • User oriented summaries (“slanted”) • E-mail me summaries of the news I like • Summaries in hand-held devices • Spoken summaries • Government analysts • Need profiles of persons and organizations • Scientists and academics • Need summaries of state of the art • students • Need summary for tomorrow’s exams

  11. What is Summarization? • Definition: A brief but accurate representation of the contents of a document.

  12. Types of Summaries • Purpose • Indicative, informative, and critical summaries • Form • Extracts (representative paragraphs/sentences/phrases) (majority of approaches) • Abstracts: “a concise summary of the central subject matter of a document” [Paice90]. • Dimensions • Single-document vs. multi-document • Context • Query-specific vs. query-independent • Monolingual vs. Multilingual.

  13. Types of Summaries Genres: • headlines • outlines • minutes • biographies • abridgments • sound bites • movie summaries • chronologies, etc.

  14. People Involved • Author of the document • Expert in a field • Professional abstractor • An expert in abstract writing

  15. Stages in Summarization • Three stages (typically) • content identification • conceptual organization • realization

  16. Stages in Automatic Summarization Summarization system Document Preprocessing Document Representation Summary representation Summary Summary generation

  17. Stages in Automatic Summarization Preprocessing: • morphological analysis or stemming • found => find + ed • morphological features (time, person, aspect, • etc) • parsing • syntactic tree • total or partial (chunking) • symbolic vs. statistical • rhetorical parsing

  18. Stages in Automatic Summarization Document representation: • Set of linguistic components (paragraph, sentences, word, …). • Boolean representation. • Term vector space. Summary representation: • Subset of sentences. • Transformed sentences. • Extraction Templates.

  19. Summarization by Extraction • Easy to implement and robust • Select the most relevant sentences… • How to discover what type of linguistic/semantic information contributes with the notion of relevance? • How extracts should be evaluated?

  20. Methodologies for Automatic Summarization Classification of Methods: • Traditional Methods • Term, Word, phrase frequencies. • Corpus-based Approaches • Combination of statistical features. • Learning to extract. • Exploiting Discourse Structures • E.g WordNet, RTF • Knowledge Rich Approaches • For particular domain

  21. Classical Methods Keyword method (Luhn’ 58): • Very first work in automated summarization • Computes measures of significance • Words: • stemming • bag of words E FREQUENCY WORDS Resolving power of significant words

  22. Classical Methods Keyword method (Luhn’ 58):

  23. Classical Methods Keyword method (Luhn’ 58):

  24. Classical Methods Position/Location method (Edmundson’ 69): • Important sentences occur in specific positions • “lead-based” summary (Brandow’95) • Inverse of position in document works well for the news • Important information occurs in specific sections of the document (introduction/conclusion)

  25. Classical Methods Position/Location method (Edmundson’ 69): • Extra points for sentences in specific sections • Make a list of important sections • LISTA = “introduction”, “method”, “conclusion”, • “results”, ... • Position evidence (Baxendale’58) • First/last sentences in a paragraph are topical • Give extra points to = initial | middle | final

  26. Classical Methods Position/Location method (Edmundson’ 69): • Position depends on type of text! • “Optimum Position Policy” (Lin & Hovy’97) method to learn “positions” which contain relevant information OPP= { (p1,s2), (p2,s1), (p1,s1), ...} • Pi = paragraph num; si = sentence num • Learning method uses documents + abstracts + keywords provided by authors

  27. Classical Methods Title method (Edmundson’ 69): • Hypothesis: title of document indicates its content. Therefore, words in title help to find relevant contents. • Create a list of title words, remove “stop words”

  28. Classical Methods Cue method (Edmundson’ 69, Paice’ 81): • Important sentences contain cue words ‘This paper presents…’or ‘Results show…’ • Some words are considered bonus others stigma • bonus: comparatives, superlatives, conclusive expressions, etc. • stigma: negatives, pronouns, etc. • Paice implemented a dictionary of <cue,weight> • Grammar for indicative expressions In + skip(0) + this + skip(2) + paper + skip(0) + we + ... • Cue words can be learned (Teufel’98)

  29. Classical Methods Experimental Combination of Features (Edmundson’ 69): • Linear combination of four features: 1C + 2K + 3T + 4L • First the parameters are adjusted using Manually labeled training data. • Testing all possible of combinations. • Produce summaries. • Evaluate the resultant summaries.

  30. Classical Methods Experimental Combination of Features (Edmundson’ 69): Result obtained • Best system • cue + title + position • Individual features • Position is best, then • cue • title • keyword

  31. Methodologies for Automatic Summarization Classification of Methods: • Traditional Methods • Term, Word, phrase frequencies. • Corpus-based Approaches • Combination of statistical features. • Learning to extract. • Exploiting Discourse Structures • E.g WordNet, RTF • Knowledge Rich Approaches • For particular domain

  32. Corpus-based Methods Learning to extract:

  33. Corpus-based Methods Learning to extract: (Kupiec et al. 95) • Extracts of roughly 20% of original text • Feature set: • sentence length • |S| > 5 • fixed phrases • 26 manually chosen • paragraph • sentence position in paragraph • thematic words • binary: whether sentence is included in manual extract • uppercase words • not common acronyms • Corpus: • 188 document + summary pairs from scientific journals

  34. Corpus-based Methods • Uses Bayesian classifier: Learning to extract: (Kupiec et al. 95) • Assuming statistical independence:

  35. Problems with Extraction Using statistics, (key)-word based, learning classifier for sentence extraction has limitation: • Lack of cohesion

  36. Problems with Extraction Using statistics, (key)-word based, learning classifier for sentence extraction has limitation: • Lack of coherence

  37. Problems with Extraction Some solutions • Rules for the identification of anaphora • Corpus-based heuristics • Aggregation techniques • IF sentence contains anaphor THEN include preceding sentences • Anaphora resolution is more appropriate but • Programs for anaphora resolution are far from perfect • BLAB project (Johnson & Paice’93) • Selection (indicator) & rejection & aggregation rules • Reported success: abstract > aggregation > extract.

  38. Methodologies for Automatic Summarization Classification of Methods: • Traditional Methods • Term, Word, phrase frequencies. • Corpus-based Approaches • Combination of statistical features. • Learning to extract. • Exploiting Discourse Structures • E.g WordNet, RTF • Knowledge Rich Approaches • For particular domain

  39. Exploiting Discourse Structures • Lexical Chain: • Word sequence in a text where the words are related by one of the relations previously mentioned. • Use: • ambiguity resolution • identification of discourse structure

  40. Exploiting Discourse Structures • WordNet – lexical database • synonymy • dog, can • hypernymy • dog, animal • antonym • dog, cat • meronymy (part/whole) • dog, leg

  41. Exploiting Discourse Structures Extract by Lexical Chain (Barzilay & Elhadad’97; Silber & McCoy’02) • A chain C represents a “concept” in WordNet • Financial institution “bank” • Place to sit down in the park “bank” • Sloppy land “bank” • A chain is a list of words, the order of the words is that of their occurrence in the text • A noun N is inserted in C if N is related to C

  42. Exploiting Discourse Structures Extract by Lexical Chain (Barzilay & Elhadad’97; Silber & McCoy’02) Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achieve much closer monitoring of the pump feeding the anesthetic into the patient

  43. Exploiting Discourse Structures Extract by Lexical Chain (Barzilay & Elhadad’97; Silber & McCoy’02) • Compute the contribution of N to C as follows • If last element of C is M, identify relation of N with M • If C is empty consider the relation to be “repetition” • Compute distance between N and M in number of sentences ( 1 if N is the first word of chain) • Contribution of N is looked up in a table with entries given by type of relation and distance e.g., hyper & distance=3 then contribution=0.5. • How to determine the table entries?

  44. Exploiting Discourse Structures Extract by Lexical Chain (Barzilay & Elhadad’97; Silber & McCoy’02) • After inserting all nouns in chains there is a second step. • For each noun, identify the chain where it most contributes; delete it from the other chains and adjust weights Select sentences that belong or are covered by “strong chains”

  45. Exploiting Discourse Structures Extract by Lexical Chain (Barzilay & Elhadad’97; Silber & McCoy’02) • Strong chain: • weight(C) > thr • thr = average(weight(Cs)) + 2*std(weight(Cs)) • Sentence Selection: • H1: select the first sentence that contains a member of a strong chain • H2: select the first sentence that contains a “representative” member of the chain • H3: identify a text segment where the chain is highly dense (density is the proportion of words in the segment that belong to the chain)

  46. Exploiting Discourse Structures IR technique (Salton et al., 97) • Vector Space Model • Similarity metric • Construct a graph of paragraphs. Strength of link is the similarity metric. • Use threshold to decide upon similar paragraphs.

  47. Exploiting Discourse Structures IR technique (Salton et al., 97) • Identify regions where paragraphs are well connected. • Paragraph selection heuristics • bushy path • depth-first path • Segmented bushy path • Co-selection evaluation • optimistic, pessimistic, union, intersection

  48. Exploiting Discourse Structures Rhetorical Analysis: • Rhetorical Structure Theory (RST) • Mann & Thompson’88 • Descriptive theory of text organization • Relations between two text spans • Nucleus & satellite (hypotactic) • Nucleus & nucleus (paratactic)

  49. Exploiting Discourse Structures Rhetorical Analysis: • Relations can be marked on the syntax • John went to sleep because he was tired. • Mary went to the cinema and Julie went to the theatre. • RST authors say that markers are not necessary to identify a relation. • However all RTS analyzers rely on markers • “however”, “therefore”, “and”, “as a consequence”, etc.

  50. Exploiting Discourse Structures Rhetorical Analysis: • (A) Smart cards are becoming more attractive • (B) as the price of micro-computing power and storage continues to drop. • (C) They have two main advantages over magnetic strip cards. • (D) First, they can carry 10 or even 100 times as much information • (E) and hold it much more robustly. • (F) Second, they can execute complex tasks in conjunction with a terminal.

More Related