80 likes | 210 Vues
This paper discusses a method for extracting structured values from web pages generated by templates. By inferring the underlying template from a set of pages and applying it to unseen pages, we aim to capture the semantic meaning encoded in the structure. The proposed approach utilizes observations about token co-occurrences and properties of strings derived from templates to construct and evaluate these templates. Examples are drawn from IMDB movie pages, showcasing the extraction of attributes like title, rating, and actors, leading to a categorization of results as correct, partially correct, or incorrect.
E N D
Implementing Automatic Value Extraction from Structured Web Pages Varun Ganapathi, Jonathan Pines, Josh Wiseman
Problem • Context: • Many web pages are generated by applying a template to structured data • Goal: • Given a set of pages generated from a template, infer the template. • Extract values from previously unseen pages generated from the template • Why? • The template encodes structure that usually has semantic meaning. • The structured values that back a page are all the important information in the page.
What is a Template? • It is a special case of a context free grammar • Tuple ( fixed-length ordered lists ) • Sets ( arbitrary-length lists denoted by separators ) • Example of Instantiated Template: <elem>Ethan Hunt comes face to face with a dangerous and … </elem> <elem>6.8</elem> <set> <tuple><elem>Tom Cruise</elem><elem>Ethan Hunt</elem></tuple> <tuple><elem>Ving Rhames</elem><elem>Luther Strickell</elem></tuple> </set>
Learning Templates • Use the following observations: • When tokens occur frequently together, it might be because they are derived from the same template • The strings derived from templates have certain properties • Ordered • Nested • Loop • Find equivalence classes of differentiated tokens • Increase partial template • Differentiate tokens based on partial template • Construct Template using Patterns
Evaluation • We manually extracted “interesting” data from several IMDB movie pages. <elem>Ethan Hunt comes face to face with a dangerous and … </elem> <elem>6.8</elem> <set> <tuple><elem>Tom Cruise</elem><elem>Ethan Hunt</elem></tuple> <tuple><elem>Ving Rhames</elem><elem>Luther Strickell</elem></tuple> </set> • Some attributes: title, writers, directors, plot summary, rating, actors, languages, trivia, … • Attributes were either: • Correct: Our system was perfect. • Partially Correct: Our system got a bit too much. • Incorrect: Our system missed some data.
Results • Attributes: • 5 correct • 5 partially correct • 6 incorrect