1 / 81

DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES

DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES. Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul, Turkey ( Visiting Professor at TALP Research Center, UPC ). OUTLINE. INTRODUCTION LITERATURE SURVEY Search Engines and Query Types

caron
Télécharger la présentation

DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul, Turkey (Visiting Professor at TALP Research Center, UPC)

  2. OUTLINE • INTRODUCTION • LITERATURE SURVEY • Search Engines and Query Types • Automatic Analysis of Documents • Automatic Summarization • OVERVIEW OF METHODOLOGY • System Architecture • Implementation • Data Collection • STRUCTURAL PROCESSING • Rule-based Approach • Machine Learning Approach • SUMMARY EXTRACTION • DISCUSSION • FUTURE RESEARCH

  3. INTRODUCTION

  4. Introduction • Rapid growth of information sources • World Wide Web • “information overload” • 50% of documents viewed in search engine results • not relevant (Jansen and Spink, 2005) • Users are interested in different types of search • rather than queries with commonplace answers • e.g. capital city of Sweden • specific and complex queries • e.g. best countries for retirement • tasks such as background search • e.g. literature survey on Mexican air pollution

  5. Introduction (cont.) • Available search engines • results in response to a user query • each presented with a short ‘summary’ • 2-3 line extracts • document fragments containing query words • fail to reveal their context within the whole document • The users • scroll down the results • click those that seem relevant to their real information need • inadequate summaries • missing relevant documents • spending time with irrelevant documents • not feasible to open each link

  6. Example Output of Google

  7. Introduction (cont.) • Automatic summarization • as successful as humans • long-term research direction (Sparck Jones, 1999) • improve effectiveness of other tasks • e.g. information retrieval • Traditionally, automatic summarization research: • general-purpose summaries • e.g. the “abstract page” of a report • But, need to bias towards user queries • in an information retrieval paradigm • a document is seen as a flat sequence of sentences • ignoring the inherent structure • But, Web documents • complex organization of content • sections and subsections with different topics and formatting

  8. Research Goals • a novel summarization approach for Web search • combining these two aspects • Document structure • Query-biased techniques • not investigated together in previous studies • Intuition • providing the context of searched terms • preserving the structure of the document • Sectional hierarchy and heading structure • may help the users to determine the relevancy of results better • Two-stage approach • Structural processing • Summary extraction

  9. Research Goals (cont.) • Web documents • no domain restriction • typically heterogeneous • images, text in different formats, forms, menus, etc. • diverse content • with sections on different topics, advertisements, etc. • Structural and semantic analysis of Web documents • Heading-based sectional hierarchy • Use of this structural and semantic information • during summarization process • in the output summaries • query-biased techniques

  10. Part of an Example Web Document

  11. LITERATURE SURVEY

  12. Search Engines • Information retrieval (IR) • storage, retrieval and maintenance of information • differences on the Web • distributed architecture • the heterogeneity of the available information • its size and growth rate, etc. • Search engine • allows the user to enter search terms (queries) • run against a database • retrieves Web pages that match the search terms

  13. Query Types • Boolean search • keywords separated by (implicit or explicit) Boolean operators • Phrase search • a set of contiguous words • Proximity search • Range searching • Field searching • Natural language search • Thesaurus search • Fuzzy search

  14. Information Needs of Users • Categorization (Ingwersen & Järvelin, 2005) • intentionality or goal of the searcher • the kind of knowledge currently known by the searcher • the quality of what is known • well-defined knowledge of the user • specific information sources are searched • in ill-defined (muddled) cases • the search process is exploratory • Types of information need in Web search (White et al., 2003) • search for a fact • search for a number of items • decision search • background search

  15. General Document Analysis • physical components • paragraphs, words, figures, etc. • logical components • titles, authors, sections, etc. • as a syntactic analysisproblem • physical and logical components of a document • ordered tree • transformation-based learning • generalized n-gram model • probabilistic grammars • incremental parsing • syntactic parsing (Collins and Roark, 2004) • generating table-of-contents for a long document (Branavan et al., 2007)

  16. Web Document Analysis • Web documents • HTML (Hypertext Markup Language) • presentation of content • semi-structured documents • Motivations • to filter important content • to convert HTML documents into semantically-rich XML documents • obtaining a hierarchical structure for the documents • display content in small-screen devices such as PDAs • more intelligent retrieval of information, summarization, etc • Approaches • HTML tags and DOM tree • rule-based or machine learning-based • certain domain or domain-independent

  17. Web Document Analysis (cont.) • Different from most previous work • section and subsection headings • HTML • Markup tags, attributes and attribute values • e.g. <font size = 3> • Two types of HTML tags • container tags (e.g. <table>, <td>, <tr>, etc.) • contain other HTML tags or text • format tags (e.g. <b>, <font>, <h1>, <h2>, etc.) • usually concerned with the formatting of text • DOM (Document Object Model) • provides an interface as a tree

  18. Automatic Summarization • Process of distilling the most important information • from a source (or sources) to produce a shortened version • for particular users and tasks • Uses • as an aid for browsing • single large documents or sets of documents • in sifting process • to locate useful documents in a large collection • as an aid for report writers • by providing abstracts • related to and influenced by • information retrieval • information extraction • text mining

  19. Automatic Summarization (cont.) • Types of summaries • “Extract” vs “abstract” • “Generic” vs “query-relevant” • “Single-document” vs “multi-document” • “Indicative” vs “informative” • Phases of summarization • Analysis of input text • Transformation into a summary representation • Synthesis of output summary

  20. Automatic Summarization (cont.) • Approaches • Surface-level approaches • use shallow features to identify important information in the text • thematic features, location, background, cue words and phrases, etc. • Entity-level approaches • build an internal representation of the text • by modeling text entities and their relationships • e.g. using graph topology • Discourse-level approaches • global structure of the text and its relation to communicative goals • Hybrid approaches • Evaluation • intrinsic • the summary itself is evaluated • extrinsic • i.e. task-based evaluation

  21. Recent Work on Summarization • Mostly generic summaries • based on sentence weighting • Tombros & Sanderson, 1998 • query-biased summaries in information retrieval • Google, Altavista • White et al, 2003 • longer query-biased summaries • summary window • Alam et al, 2003 • structured and generic summaries • “table of content”-like hierarchy of sections and subsections

  22. Recent Work on Summarization (cont.) • Yang & Wang, 2008 • fractal summarization • hierarchical structure of document • levels, chapters, sections, subsections, paragraphs, sentences and terms • generic summaries • Varadarajan & Hristidis, 2005 • adding structure • document is divided into fragments (paragraphs) • connecting related fragments as a graph (implicit structure) • query-biased • In this research, combining • explicit document structure and query-biased techniques

  23. OVERVIEW OF METHODOLOGY

  24. System Architecture

  25. Structural Processing • Rule-based and machine learning-based approaches • Input • a Web document in HTML format • Output • a tree representing the sectional hierarchy of the document • intermediate nodes: headings and subheadings, • leaves: other text units

  26. Summarization • Using the output of structural processing • document tree • indicative summaries • extractive approach • longer summaries • in a separate frame

  27. Implementation • GATE (A General Architecture for Text Engineering) • open source project using component-based technology in Java • commonly used natural language functionalities • Tokeniser, Sentence Splitter, Stemmer, etc. • Cobra Java HTML Renderer and Parser • open source project • supports HTML 4, Javascript and Cascading Style Sheets (CSS) • Implemented modules • Structural analysis of HTML documents • Summarization engine

  28. Data Collection English queries • Users • mostly Boolean queries with 2-3 words • Current search interests • various domains • English Collection • Turkish Collection • Extended English Collection Turkish queries

  29. RULE-BASED APPROACH FOR STRUCTURAL PROCESSING

  30. The Method • A heuristic approach based on DOM processing • Heading-based sectional hierarchy identification • nontrivial task • heterogeneity of Web documents • the underlying HTML format • Three steps • DOM tree processing • Heading identification • Hierarchy restructuring

  31. Step 1: DOM Tree Processing • Semantically related parts • same or neighboring container tags • Traverse DOM tree in a breadth-first way • Sentence boundaries • Format tags such as <font> are passed as features • Output: a simplified version of the original tree

  32. DOM Tree of an Example Document

  33. Example Output of DOM Tree Processing

  34. Step 2: Heading Identification • Heading tags in HTML • <h1>through <h6> • rarely used for this purpose • Headings • formed by formatting them differently from surrounding text • more emphasized than following content • Heuristics • if-then rules

  35. Features for Identifying Text Format

  36. Step 3: Hierarchy Restructuring • Headings + feature set • to differentiate different levels of heading • Restructure the document tree • bottom-up approach

  37. Step 3: Hierarchy Restructuring (cont.)

  38. Performance Measures Heading Extraction • Hierarchy Extraction • Parent-child relationships in the document tree • Heading-subheading • Heading- underlying text

  39. English Collection Heading extraction • Baseline • using only heading tags <h1> through <h6> • High value for heading recall • Precision is lower • cluttered organization in Web documents

  40. English Collection (cont.) Hierarchy extraction • a significant improvement to accuracy • compared to the baseline

  41. Turkish Collection Heading extraction Hierarchy extraction • Baseline method failed • no <h> tags used • Additional analysis • 50 documents on boun.edu.tr domain • 71% accuracy

  42. MACHINE LEARNING APPROACHFOR STRUCTURAL PROCESSING

  43. The Approach • Machine learning • can be more flexible • by combining several features using a training corpus • rather than predefined rules • Extraction of sectional hierarchy of a Web document • A tree-based learning approach needed • as in syntactic parsing • exponential search space • incremental algorithm • making a sequence of locally optimal choices • to approximate a globally optimal solution • Document • as a sequence of text units

  44. Example HTML document

  45. Heading Extraction Model • Binary classification • As a sequence of text units • Headings: positive examples • Non-headings: negative examples

  46. Hierarchy Extraction Model • Learn a mapping from X (a set of documents) to Y (a set of possible sectional hierarchies of documents) • Training examples (xi, yi)for i = 1…n • A function GEN(x) enumerating a set of possible outputs for an input x • A representation Φ mapping each (xi,yi) to a feature vector Φ(xi, yi) • A parameter vector α • Estimate α such that it will give highest scores to correct outputs:

  47. Features • Unit features • Formatting features • e.g. font size, boldness, color, etc. • DOM tree features • e.g. DOM address, DOM path, etc. • Content features • e.g. cue words / phrases, number of characters, punctuation mark, etc. • Other features • Visual position in the rendered Web document • Contextual features • composite features of two units in context • distance and difference between features • uij : unit i levels above a unit u, and j units to its left • Global features • e.g. the depth of sectional hierarchy

  48. Incremental Learning Approach • Document graph • left to right based on the order of appearance • Positive and negative examples • Parent-child relationships (based on golden standard hierarchy) • Two constraints • Document order • Projectivity rule • “When searching for the parent of a unit uj, consider only the previous unit (uj-1), the parent of uj-1, that unit’s parent, and so on to the root of the tree.

  49. Incremental Learning Approach (cont.) • Training set • Web documents and corresponding golden standard hierarchies • Algorithm • works on units sequentially

  50. Testing Approach • Beam search • Set of partial trees • Beam width • Two operations • ADV (i.e. Advance) • potential attachments of current unit to partial trees • FILTER • to prevent exponential growth of the set

More Related