Extracting Structured Data from Web Pages

Extracting Structured Data from Web Pages By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan Gündem

General • Underlying Terminology • Modules and their operations Presentation Outline • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

Motivation • There are many web sites that contain a large collection of “structured” pages. • Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data. • This paper focuses on the problem of automatically extracting structured data from a collection of pages.

Example Pages • In the real world there are many examples for structured web pages. • amazon web site, e-bay web site etc. • Two examples from www.amazon.com • My System • An Eternal Golden Braid

Example Pages (My System: 21st Century Edition)

Example Pages (An Eternal Golden Braid)

Underlying Problems • Complex Schema:The “schema” of the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on. • Template vs. Data:Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.

x extracted from the database How is a page created with template?

Basic Type, Tuples and Sets • Basic Type: b,Basic unit of text • Tuple: Ordered List of types, <T1,T2,…,Tn> • Set: {T1} < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Schema and Instance < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Template Definition • Own example: • Schema: S = <b, {b}, b> • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr

Encoding l(T1,x1) Template

General Description of EXALG

Set of Reviewers Multiple Pages

Correct Solution for those pages

Some Terminology (1) • The occurrence-vectorof a token t, is defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page • An equivalence classis a maximal set of tokens having the same occurrence-vector. • A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.

No unique role <1,1,1,1> <1,2,1,0> Some Terminology (2)

Some Terminology (3) • For real pages, an equivalence class of large size and support is usually valid, where supportof a token is defined as the number of pages in which the token occurs. • Example for invalid equivalence class: • {Data, Mining, Jeff, 2, Jane, 6} has occurrence vector <0, 1, 0, 0>

Some Terminology (4) • The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”. • Threshold for size and support is set by the user (SizeThres, SupThres).

Some Terminology(5) • Validequivalence class properties: Ordering and Nesting • Back to own example: • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Ordered: A > B > C > D • Nesting: B > E > C

Important Observations • In practice, two page-tokens with different occurrence-paths have different roles: html-parser • Two page-tokens having same occurrence paths, but with different neighbours also have different roles

Explanation of observations

Modules and their operations

Constructing Template (1) • The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty. • A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.

Constructing Template (2) • The tokens connected by empty positions belong to the template. • In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type • This unknown type can be determined by inspecting input pages

Constructing Template(3)

Experimental Results (1) • Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions. • The first 6 web pages are obtained from RoadRunner site. • The last three web pages have more complex structure.

Experimental Results(2)

Concluding Remarks • EXALG first discovers the unknown template that generated the pages and uses the discovered template to extract the data from the input pages. • Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection. • No human intervention – automatically getting template and data

Future Work • Automatically locate collections of pages that are structured • Check, whether it is feasible to generate some large database from these pages

Questions & Answers

Extracting Structured Data from Web Pages

Extracting Structured Data from Web Pages

Presentation Transcript

Extracting and Structuring Web Data

Extracting Collection Data From Websites

Automatically Extracting Structured Data for Web Search

Bootstrapping information extraction from semi-structured web pages

Extracting structure information from data

Implementing Automatic Value Extraction from Structured Web Pages

DATA MINING Extracting Knowledge From Data

Extracting and Structuring Web Data

Extracting and Structuring Web Data

Extracting Schema From Data

Extracting Schema from Semistructured Data

Extracting Structured Data from Web Page

Bootstrapping Information Extraction from Semi-Structured Web Pages

Automatically Extracting Structured Data for Web Search

Semalt: Extracting URLs From Web Pages With Beautiful Soup

Which software is the best at extracting data from yellow pages?

Data-Driven Web Pages

Extracting Products Data from Homedepot

Product Data Extracting from Safeway

The Data Records Extraction from Web Pages

Extracting Review Data From Amazon

Extracting Business Data from Truelocal