340 likes | 519 Vues
Text Summarization -- In Search of Effective Ideas and Techniques. Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University, Finland & Univercity Berkeley Modified By Shinta P., 2012. Headline news — informing. TV-GUIDES — decision making.
E N D
Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University, Finland & Univercity Berkeley Modified By Shinta P., 2012
What is text summarization? • To reduce (long) textual information to its most essential points • to distill the most important information from a source or sources to produce an abridged version of it (Endres-Niggemeyer, 1998; Mani and Maybury, 1999; Spärck-Jones, 1999).
‘Genres’ of Summary? • Indicative vs. informative ...used for quick categorization vs. content processing. • Extract vs. abstract ...lists fragments of text vs. re-phrases content coherently. • Generic vs. query-oriented ...provides author’s view vs. reflects user’s interest. • Background vs. just-the-news ...assumes reader’s prior knowledge is poor vs. up-to-date. • Single-document vs. multi-document source ...based on one text vs. fuses together many texts. 8
Text summarization • Key issues: • how to identify the most important content out of the rest of the text? • how to synthesize the substance and formulate a summary text based on the identified content? • Major approaches: • Selection based: produce ”extracts” • Text understanding based: produce ”abstracts” Shuhua Liu, IIS/IAMSR, ÅA
Selection based summarization: how does it work? • The most content-bearing sentences or passages are identified and selected to compose a summary. • Compute a significance value for each sentence: (Luhn, 1958; Edmundson, 1969) • Count word frequency • the keywords, title words, cue words it contains; • the position of the sentence • RST (Rhetorical structute theory) based discourse analysis (Marcu, 1997) • Passage and sentence similarity analysis (Goldstein et al, 2000; CMU) Shuhua Liu, IIS/IAMSR, ÅA
MSWord AutoSummarize Shuhua Liu, IIS/IAMSR, ÅA
Text understanding system • A text understanding task often aims to recover all of the information that there is in a text, including what is only implicit in what is actually written. • “All the richness of natural language becomes fair game, including metaphor, metonymy, discourse structure, and the recognition of the author's underlying intentions, and the full interplay between language and world knowledge becomes central to the task.” Shuhua Liu, IIS/IAMSR, ÅA
Text understanding based summarization • Depend on complete sentence analysis and discourse analysis with full knowledge support • Syntactic pasrer, semantic interpreter • Linguistic knowledge, world knowledge, domain knowledge • Reasoning mechnisms that work effectively over huge knowledge collections. Shuhua Liu, IIS/IAMSR, ÅA
Selection based vs. Understanding based • Selection based: general applicable, but incoherent content, poor readability due to unclear relationships between the selected text excerpts, dangling references, and so on. • Understanding based: high precision, but very slow, large amount of wasted computation, highly domain specific. • Endres-Niggenger (2000) found that, people prefer (sometimes) extractive summaries instead of gloss-over abstractive summaries! Shuhua Liu, IIS/IAMSR, ÅA
The reality: • The dominant approach in practice is still selection-based; • Understanding based systems only exist in theory, and will continue to be so for quite a while; • However, certain text understanding tasks in small scale or restricted domains can be done. Shuhua Liu, IIS/IAMSR, ÅA
Topic guided text summarization • Text summarization as a process of topic analysis, passage extraction, and text understanding, information integration/fusion, and text generation proces. • Passage extraction guided by topic structure will expect to keep the logic relationships between the extracted text parts: e.g. sentences are arranged logically according to topic structure • Topic representation will also be very helpful in next phase text analysis and information integration. Shuhua Liu, IIS/IAMSR, ÅA
Phase 1: Theme detection, topic labels, sentence/passage selection • Theme detection through passage pairwise similarity analysis • Vector space model of term and document • TF-IDF: baseline method Shuhua Liu, IIS/IAMSR, ÅA
Passage similarity analysis with LSA method • LSA (Latent Sematic Analysis) • Similar results as using TF-IDF • Fuzzy LSI approach (Nikravesh, 2002) Shuhua Liu, IIS/IAMSR, ÅA
Passage adjacency matrix (partial) Shuhua Liu, IIS/IAMSR, ÅA
Passage Relation Map Shuhua Liu, IIS/IAMSR, ÅA
Passage Extraction Rules • Passage clusters help us to identify themes and topics; unconnected passages form distinct topics covered in a document. • The MMR algorithm (CMU) (Goldstein et al, 2000) • A sentence/passage closest to the centroid of the cluster be chosen to be included in the summary. • Sentences that are maximally similar to the document and maximally dissimilar to sentences already in the summary are selected to compose a summary. Shuhua Liu, IIS/IAMSR, ÅA
Creating theme labels • Keywords (TF based) • Word families (semantic related words in a passage cluster) • Key phrases • Linguistic approach • Statistical + simple heuristics (Kelledy and Smeaton, 1997) – seems quite effective. Shuhua Liu, IIS/IAMSR, ÅA
Next step Shuhua Liu, IIS/IAMSR, ÅA
WordNet, since 1985 • Lexical database developed at Princeton University, led by George Miller • Hand-coded, freely available • Word knowledge of: nouns, verbs, adjectives, adverbs • Semantic network representation with only a few semantic relations: • Synonym, hypernynm, • Categorization relation: Is-a • Widely used in query expansion, word similarity determination (based on synsets) Shuhua Liu, IIS/IAMSR, ÅA
ConceptNet, MIT Media Lab • Common sense knowledge base with NLP capability • Extracted automatically from common sense knowledge expressed in semi-structured NL sentences from OMCSNet (open mind common sense) – applying about 50 extraction rules • ”The Effect of [falling off a bike] is [you get hurt].” • ”A lime is a very sour fruit” at OMCS is extracted into two assertations: IsA (lime, fruit) PropertyOf (lime, very sour) Shuhua Liu, IIS/IAMSR, ÅA
ConceptNet (Liu and Singh, 2004a, 2004b) • Inference • Spreading activation: node-activation radiating outward from an origin code • GetContext (node) • GetAnalogousConcept (node) • Graph traversal: • FindPathBetweenNodes (node1, node2) Shuhua Liu, IIS/IAMSR, ÅA
ConceptNet (Liu and Singh, 2004a, 2004b) • Support • Topic sensing • Query expansion • Semantic similarity of words • Lexical generalization • Thematic generalization • Much needs to be examined; • Uncontrolled vocabulary, can be biased in terms of content; but seems quite reliable knowledge. Shuhua Liu, IIS/IAMSR, ÅA
Topic-Sensing Shuhua Liu, IIS/IAMSR, ÅA
Eurovoc: multilingual thesaurus • Controlled vocabulary, 20 languages, broad fields • politics, international relations, European Communities, law, economics, trade, finance, social questions, education, science, international organizations, employment and working conditions • industry, business and competition, production, technology and research, • transport, environment, energy, • agriculture, forestry and fisheries, agri-foodstuffs, • geography Shuhua Liu, IIS/IAMSR, ÅA