1 / 18

Comparing Redundancy Removal Techniques for Multi–Document Summarisation

Comparing Redundancy Removal Techniques for Multi–Document Summarisation. By Frank. Index. 1 Introduction 2 Text Summarisation 3 Multi–Document Summarisation 4 Redundancy Removal Techniques 5 Evaluation Methodology 6 Results and Conclusions 7 Future Work. 1 Introduction.

Télécharger la présentation

Comparing Redundancy Removal Techniques for Multi–Document Summarisation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Redundancy Removal Techniquesfor Multi–Document Summarisation By Frank

  2. Index • 1 Introduction • 2 Text Summarisation • 3 Multi–Document Summarisation • 4 Redundancy Removal Techniques • 5 Evaluation Methodology • 6 Results and Conclusions • 7 FutureWork

  3. 1 Introduction • Document summarisation systems automatically produce summaries of full-length documents. • Summarisation systems can be broadly divided into two categories: the single-documentsummariser and the more complex multi-document system.

  4. 1 Introduction • Multi–document systems can be viewed as augmented single–document summarisers. • In extending a single–document summariser, the first step we have chosen is to implementRedundancy Removal. • This paperdescribes our experiment to determine the efficacy of certain similarity measures. Return

  5. 2 Text Summarisation • Automatic summarisation is the process by which the information in a source text is expressed in a more concise fashion, with a minimal loss of information. • Summaries can be eitherextractiveor abstractive. • An extractive summary is built from sentences, phrases or words(depending on the degree of compression required) extracted from the source text. • Abstractivesummaries, on the other hand,involve an analysis of the text at a deeper level to determinethe subject matter, and thereafter a reformulation of text to form a coherent passage, resultingin an accurate summary.

  6. 2 Text Summarisation • Topic Detection • i.e.in TREC Topic Detection and Tracking tasks, a Topic Detectormust be able to find the topic of a text based on previously–seen articles, and to modifyits representation of the theme if necessary, where any changes in focus or fact may haveoccurred due to the temporal nature of the news domain. • The nextstep attempts to reduce the size of this set without any reduction in information content. This is achieved by removing the sentences which duplicate information given in other sentences.

  7. 2 Text Summarisation • text reformulationstage is concernedwith reconstituting the extracted sentences into a coherent summary. • a chronology • anaphors will have to be resolved to ensure that any pronouns used havea referent in the summary; • contradictionresolution may be required, if two or more source documents contain conflicting “facts”. Return

  8. 3 Multi–Document Summarisation • Sentence ordering an ordering is derived from any textual clues that may be available • contradiction resolution Articles written by different groups, from different viewpoints, or at different times, may differ significantly in the information presented. Return

  9. 4 Redundancy Removal Techniques • SimFinderis a clustering tool that finds the similarity between text units using a statistical measurewhich incorporates various linguistic features. • Barzilay describes a very interesting rule-based approach to paraphrase identification in Information Fusion for MultiDocument Summarization

  10. 4 Redundancy Removal Technique • WordNet Distance • In this WordNet–based measure, we define the similarity between twosentences as the sum of the maxmimal similarity scores between component words in theWordNet, using the Leacock-Chodorow measure

  11. 4 Redundancy Removal Techniques • Cosine Similarity • a vector-based method of zero–knowledge method for measuring sentence similarities • A count of all the words used in the sentences of the corpus is calculated. This count providesus with the information to construct the termspace, an n-dimensional vector space,where n is the number of unique terms found in the corpus.With a vector representation of all of the sentences of the corpus, we can take a simplemeasure of the similarity of any pair of sentences by looking at the size of the anglebetween their vectors; the smaller the angle, the greater the similarity.

  12. 4 Redundancy Removal Techniques • Latent Semantic Indexing • Latent Semantic Indexing takes as input a term–document matrix constructed in exactly the same way as for Cosine Similarity. • Before applyinga similarity measure between the vectors, an information spreadingtechnique known asSingular Value Decomposition is applied to the matrix.This technique scrutinises the term–document matrix for significant levels of word cooccurence and modifies magnitudes along appropriate dimensions (i.e. scores for particularwords) accordingly.In terms of matrices and vectors, SVD can be crudely expressed as finding the dimensionswhich lie close together (co–occur) and compressing them onto one composite dimension. Return

  13. 5 Evaluation Methodology • Precisionis a measure of the proportion of relevant sentences that were retrieved. • Recallis a measure of the proportion of retrieved sentences that are relevant. • a high precisionscore means that most of the retrieved sentences are relevant, and highrecall implies thatmost of the relevant sentences in the corpus were retrieved. Return

  14. 6 Results and Conclusions • The poor Precision/Recall performance ofWordNet can be accounted for by its very higherror rate • There is no difference in performance between Cosine Similarity andLatent Semantic Indexing. • An analysis of the results suggests that the size of the corpus is the cause of the poorperformance of LSI Return

  15. 7 FutureWork • Schiffmandescribe a system for determining possible candidate sentences for the summary. Another feature: verb specificity • Maximal Marginal Relevanceattempt to eliminate redundancy in the set of retrieved texts by giving a low score to any textswhich are highly similar to previously–seen items. • Allan describe a system which attempts to make temporal summaries of newsstreams Return

  16. THX Any Q?

  17. Leacock-Chodorow measure • 与HS算法相同的是,此算法也基于c1和c2间路径的长度计算相似度。不同的是,LC算法将重点放在继承关系上,并且引入了总分类深度D来扩展路径长度。用公式表达这个思想如下: • 其中,sim表示similarity,即相似度;LC是此算法名的缩写;c1和c2表示两个单词的某两个Synset;len(c1, c2)表示连接c1和c2间的路径长度;D是总分类深度。 Return

  18. Hirst-St-Onge算法 • Hirst和St-Onge是这个算法的两个提出者,因此此算法用他们的名字命名。这个算法的基本思想是,认为如果两个单词在语义上足够近,那么必须满足如下要求:在Wordnet中连接这两个单词Synset的路径不能过长;并且这条路径上改变方向的次数不能过多。用公式表达这个思想如下: • 其中rel表示relation,即关系的相似度;HS是此算法名的缩写;c1和c2表示两个单词的某两个Synset;C和k都是常数;pathlength指c1和c2两个Synset间路径长度;d指这条路径上方向变化的次数。如果不存在路径,那么为0,表示c1和c2没有联系。 Return

More Related