1 / 13

C hrono S earch

Pete Bohman Adam Kunk. C hrono S earch. ChronoSearch. ChronoSearch : A System for Extracting a Chronological Timeline. C h r o n o. Motivation. Current search engines do not provide a complete picture Latest events dominate top results

kipp
Télécharger la présentation

C hrono S earch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pete Bohman Adam Kunk ChronoSearch

  2. ChronoSearch • ChronoSearch: A System for Extracting a Chronological Timeline Chrono

  3. Motivation • Current search engines do not provide a complete picture • Latest events dominate top results • The user is forced to parse through lots of pages to find a complete list of information • ChronoSearch aims to summarize search results into a concise list of important events related to an entity

  4. Problem Definition • Input: An entity E (most likely a person) • Output: A sorted list of events, L, which are related to E L = { li| li is unique and li occurred before li+1}

  5. Problem Statement • Tuple extraction: (Event, Entity, Date) • Difficulties of Extraction • Dates • No standard format, relative dates • Events • Hard due to random input, unstructured data • Entity • Pronouns (“He” / “She”) • Entity Event Association

  6. Our Approach • Baseline Approach – Web Redundancy • Date extraction based on absolute dates • Entity extraction by literal entity • Association based on sentence boundary • Event is implicitly described by the sentence itself • We consider sentences containing the entity being searched as well as an absolute time

  7. Our Approach • Baseline Approach • Leverages Web Redundancy

  8. Initial Results • Demo time…

  9. Results Analysis • Information Retrieval (IR) performance characteristics: • Precision – fraction of documents retrieved that are relevant to query • Recall – fraction of documents that are relevant to query that are successfully retrieved

  10. Ultimate Approach • Improving precision: • (Part 1) Eliminating duplicates • (Part 2) Eliminating unimportant results

  11. Eliminating Duplicates • Improving precision: • (Part 1) Eliminating duplicates • Cosine similarity duplicate detection • The probability that s and s’ are the same event: • P(s' reports the same event as s) = cosine( s ' ,s ) • Term frequency vectors: s and s ’

  12. Eliminating Unimportant Results • Improving precision: • (Part 2) Eliminating unimportant results • Important results occur more frequent • Utilize term frequency to eliminate unimportant events • Option 1: Term frequency calculations based on results returned from initial search query • Results that do not occur frequently in the returned corpus will be eliminated • Option 2: Leverage Google search

  13. Eliminating Unimportant Results Cont. • Eliminate results outside of “-x” standard deviations based on search results returned for the given result

More Related