1 / 42

A summarization Journey … From Extraction to Abstraction

A summarization Journey … From Extraction to Abstraction. Vasudeva Varma www.iiit.ac.in /~ vasu. About IIITH- >LTRC->SIEL. IIIT Hyderabad is a 12 year young Research University Research that makes difference – to the society and industry

ryu
Télécharger la présentation

A summarization Journey … From Extraction to Abstraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A summarization Journey…From Extraction to Abstraction Vasudeva Varma www.iiit.ac.in/~vasu

  2. About IIITH->LTRC->SIEL • IIIT Hyderabad is a 12 year young Research University • Research that makes difference – to the society and industry • It was set up as a not-for-profit public private partnership (NPPP) and is the first IIIT to be set up (under this model) in India. • IIIT-H is organized as research centers and Labs – not departments • Providing multi-disciplinary teams to tackle research problems • IIIT-H has large research group in country working in NLP, Speech and Computer Vision • Combines pioneering research with top class education • IIIT-H faculty have won many awards including document summarization, Robocup and eGovernance, Stockholm …

  3. Research Centres/Labs • Technology • Communications (CRC) • Data Engineering (CDE) • Languages Technologies (LTRC) • Natural Language Processing & Machine Translation (NLP-MT) • Search and Information Extraction (SIEL) • Speech • Anusaaraka • Robotics (RRC) • Security, Theory and Algorithms (C-STAR) • Software Engineering (SERL) • Visual Information Technology (CVIT) • VLSI and Embedded System (C-VEST) • Compilers (CL)

  4. Research Centres/Labs (Contd..) • Domains • Agriculture and Rural Development (ARD) • Building Science (CBS) • Cognitive Science (CS) • Computational Linguistics (See under LTRC) • Computational Natural Sciences and Bioinformatics (CCNSB) • Earthquake Engineering (EERC) • Education (cITe) • Education Technology and Learning Sciences (CETLS) • Exact Humanities (CEH) • Power Systems (PSRC) • Spatial Informatics (LSI) • Development Centers • Engineering Technology and Innovation Centre (ENTICE) • Innovation and Entrepreneurship (CIE) • Open Software (COS) • Societal and Human Applications of Artificial Intelligence (SAHAAI)

  5. About IIITH->LTRC->SIEL • About Language Technologies Research Center (LTRC) • One of the largest groups in South Asia working on NLP (about 175 researchers) • Four labs • Core NLP/Machine Translation • Speech • Search and Information Extraction • Anusaaraka • Synergy within the centre • Closely working with various other centers and groups

  6. About IIITH->LTRC->SIEL • Industry focus • Technology transfers to Amazon.com, Nokia, TCS, ADRIN, Department of Space, Intel, Rediff.com, Zicorp • Government funding: DST, MCIT, Dept of Space • Industry Funding: Amazon.com, AOL, TCS, Yahoo, Nokia, TCS, Rediff.com, Intel, several start-ups • Major achievements: • #1 in Automatic summarization system (DUC-2006, 2007) • #1 in Squishy QA task (TAC-2008) • #1 in Knowledge Base Population Task (TAC-2009) • #1 in guided summarization (TAC 2010) • India’s first cross language search engine • Only Academic group from India to present in WWW developer track • First team from India to participate in CLEF, DUC, TAC

  7. About IIITH->LTRC->SIEL • Major Research areas • Summarization • Cross Language Information Access • Indian Language Search • Enterprise Search • Question Answering • Semantic Web • Distributed and Large Scale IR – Cloud Computing • Computational Advertising • Published in: WWW, ACL, SIGIR, CIKM, ECIR, OOPSLA, CICLing, IJCNLP, COLING, NAACL, RANLP, ....

  8. Information Overload Explosive growth of information on web Failure of information retrieval systems to satisfy user’s information need Need for sophisticated information access solutions

  9. Summarization Summary is a condensed version of a source document having a recognizable genre and a very specific purpose: to give the reader an exact and concise idea of the contents of the source.

  10. Summaries Can Help !

  11. Flavors of Summarization Squishy Question Answering Comparative Cross Language Guided Code Opinion/ Sentiment Query Independent MDS Progressive Query Focused MDS Single document Personalized

  12. Towards Abstraction Abstractive • Guided Summarization • Code Summarization • Comparison Summarization Blog summarization Progressive Summarization Personalized , Cross Lingual Summarization Single Document, Query Focused Multi Document Summarization

  13. Underlying Technology

  14. Extractive Summarizers

  15. Single document summarization Document Text Analysis  Text Normalization  Sentence Marker  Logical Analysis  Parsing of sentences Document graph generation Summary Generation Graph clustering into topics Graph scoring Sentence transformation rules Sentence selection Summary • Graph Based Approach • Logical Analysis and Graph clustering • System will generate the summary by understanding the logical structure of the sentence. • Scoring of nodes and relations is based on how central it is for the whole document. • Able to identify important sentences more accurately than any other statistical techniques.

  16. Query Focused Summarization • Documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the system • Query Dependent ranking: Relevance Based Language models • Language models (PHAL) • Query Independent ranking: Sentence Prior

  17. RBLM is an IR approach that computes the conditional probabilities of relevance from document and query Overcomes the problem of sparseness of document LM PHAL- probabilistic extension to HAL spaces HAL constructs dependencies of a term w on other terms based on their occurrence in its context in the corpus

  18. Sentence prior captures importance of sentence explicitly using pseudo relevant documents (Web, Wikipedia) • Based on Domain knowledge, Background Information, Centrality • Log Linear Relevance • Information Measure in a sentence • Entropy is a measure of information contained in a message

  19. DUC 2005 and 2006 Performance 38 systems participated in 2006 Significant difference between first two systems 5th Rank in linguistic quality

  20. Extract vs. Abstract Summarization • We conducted a study (2005) • Generated best possible extracts • Calculated the scores for these extracts • Evaluation with respect to the reference summaries

  21. Cross Lingual Summarization • A bridge between CLIR and MT • Extended our mono-lingual summarization framework to a cross-lingual setting in RBLM framework • Designed a cross-lingual experimental setup using DUC 2005 dataset • Experiments were conducted for Telugu-English language pair • Comparison with mono-lingual baseline shows about 90% performance in ROUGE-SU4 and about 85% in ROUGE-2 f-measures

  22. Cross Lingual Summarization

  23. Progressive Summarization • Emerging area of research in summarization • Summarization with a sense of prior knowledge • Introduced as “Update Summarization” at DUC 2007, TAC 2008, TAC 2009 • Generate a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles. • To keep track of news stories, reviews of products

  24. Key challenge • To detect information that is not only relevant but also new given the prior knowledge of reader • Relevant and new Vs • Non-Relevant and new Vs • Relevant and redundant

  25. Novelty Detection • Identifying sentences containing new information (Novelty Detection) from cluster of documents is the key of progressive summarization • Shares similarity with Novelty track at TREC from 2002 – 2004 • Task 1: Extract relevant sentences from a set of documents for a topic • Task 2: Eliminate redundant sentences from relevant sentences • Progressive summarization differs, as in producing summary from novel sentences (requires scoring and ranking)

  26. Three level approach to Novelty Detection • Sentence Scoring • Developing new features that capture novelty along with relevance of a sentence • NF, NW • Ranking • Sentences are re ranked based on the amount of novelty it contains • ITSim, CoSim • Summary Generation • A selected pool of sentences that contain novel facts. All remaining sentences are filtered out

  27. Evaluations • TAC 2008 Update Summarization data for training: 48 topics • Each topic divided into A, B with 10 documents • Summary for cluster A is normal summary and cluster B is update summary • TAC 2009 update Summarization for testing: 44 topics • Baseline summarizer generates summary by picking first 100 words of last document • Run1 – DFS + SL1 • Run2 – PHAL + KL

  28. Personalized Summarization • Perception of text differs with background of the reader • Need of incorporating user background in the summarization process • Summarization not only a function of input text but also the reader

  29. Estimate Model P(w/Mu) to incorporate user in sentence extraction process Experiments 5 Users, 25-Doc Clusters Each User was asked to asked to give his relevance score to the summary on a 5-point scale.. Web-based profile creation: Personal information available on web- a conference page, a project page, an online paper, or even in a Weblog.

  30. Evaluation Average Scores for different Uses Scores for different topics for a user

  31. Comparative summarization • Summaries for comparing multiples items belonging to a category • Category of “Mobile phones“ will have “Nokia”, “Black berry’ as its items • Comparative summaries provide the properties or facts common to these items and their corresponding values with respect to each item. • “Memory”, “Display”, “Battery Life”, Memory Battery Life

  32. Comparative Summaries Generation • Attribute Extraction • Find the attributes of the product class • Attribute Ranking • Rank the attributes according to importance in comparison • Summary Generation • Find the occurrence of attributes in various products

  33. Guided Summarization • Query Focused Summarization • User’s information need expressed as a query along with a narrative • Set of documents related to the topic • Goal is to produce a shot coherent summary focusing answer to the query • Guided Summarization • Each topic is classified into a set of predefined categories • Each category has a template of important aspects about the topic • Summary is expected to answer all the aspects of template while containing other relevant information

  34. Docs Summary Summarizer Query Guided Summary When What Where Who How

  35. Guided summarization • Encourage deeper linguistic and semantic analysis of the source documents instead of relying only on document word frequencies to select important concepts • Shares similarity with information extraction • Specific information from unstructured text is identified and consequently classified into a set of semantic labels (templates) • Makes information more suitable for other information processing tasks • A guided summarization system has to produce a readable summary encompassing all the information about the templates • Very few investigations exploring the potential of merging summarization with information extraction techniques

  36. Our approach • Building a domain model • Essential background knowledge for information extraction • Sentence Annotations • To identify sentences having answers to aspects of template • Concept Mining • To use semantic concepts instead of words to calculate sentence importance • Summary Extraction • Modification of summary extraction algorithm to adapt to the requirements using sentence annotations

  37. Run1 is successful in producing informative summaries for cluster A Ranked first in all evaluation metrics including pyramids and ROUGE Difficulty of task depends on the type of category. Summarizing Health and safety, Endangered resources is relatively hard

  38. Knowledge Base Population

  39. Inconsistency • Incompleteness • Accuracy of facts • Novel information • Cost of Manual efforts Solution: Automatically updating information of the entities in knowledge bases

  40. Knowledge Base Population Summarization and KBP are complementing tasks Summaries help in filling the slot values more effectively Slot values enhance the quality of guided summaries • Knowledge Base Population can be fundamentally broken down into two sub problems • Entity Linking : Linking entity mentions in documents to Knowledge Base nodes • Slot Filling : Extracting attribute information for query entities

  41. Web Guided summary Summarizer Entity When What Where Who How

  42. Thank You – Questions? Vasudeva Varma vv@iiit.ac.in www.iiit.ac.in/~vasu

More Related