130 likes | 221 Vues
Pete Bohman Adam Kunk. C hrono S earch. ChronoSearch. ChronoSearch : A System for Extracting a Chronological Timeline. C h r o n o. Motivation. Current search engines do not provide a complete picture Latest events dominate top results
E N D
Pete Bohman Adam Kunk ChronoSearch
ChronoSearch • ChronoSearch: A System for Extracting a Chronological Timeline Chrono
Motivation • Current search engines do not provide a complete picture • Latest events dominate top results • The user is forced to parse through lots of pages to find a complete list of information • ChronoSearch aims to summarize search results into a concise list of important events related to an entity
Problem Definition • Input: An entity E (most likely a person) • Output: A sorted list of events, L, which are related to E L = { li| li is unique and li occurred before li+1}
Problem Statement • Tuple extraction: (Event, Entity, Date) • Difficulties of Extraction • Dates • No standard format, relative dates • Events • Hard due to random input, unstructured data • Entity • Pronouns (“He” / “She”) • Entity Event Association
Our Approach • Baseline Approach – Web Redundancy • Date extraction based on absolute dates • Entity extraction by literal entity • Association based on sentence boundary • Event is implicitly described by the sentence itself • We consider sentences containing the entity being searched as well as an absolute time
Our Approach • Baseline Approach • Leverages Web Redundancy
Initial Results • Demo time…
Results Analysis • Information Retrieval (IR) performance characteristics: • Precision – fraction of documents retrieved that are relevant to query • Recall – fraction of documents that are relevant to query that are successfully retrieved
Ultimate Approach • Improving precision: • (Part 1) Eliminating duplicates • (Part 2) Eliminating unimportant results
Eliminating Duplicates • Improving precision: • (Part 1) Eliminating duplicates • Cosine similarity duplicate detection • The probability that s and s’ are the same event: • P(s' reports the same event as s) = cosine( s ' ,s ) • Term frequency vectors: s and s ’
Eliminating Unimportant Results • Improving precision: • (Part 2) Eliminating unimportant results • Important results occur more frequent • Utilize term frequency to eliminate unimportant events • Option 1: Term frequency calculations based on results returned from initial search query • Results that do not occur frequently in the returned corpus will be eliminated • Option 2: Leverage Google search
Eliminating Unimportant Results Cont. • Eliminate results outside of “-x” standard deviations based on search results returned for the given result