370 likes | 507 Vues
This paper presents a method for automatically extracting structured data from web pages, focusing on the complexities of varied web page schemas. It discusses underlying terminology, module operations, and motivation for the extraction process. By analyzing examples from prominent sites like Amazon, we illustrate the effectiveness of our approach in handling complex data structures. The study covers algorithmic strategies, experimental results, and potential future developments in the realm of web data extraction, aiming to enable more efficient querying of web information.
E N D
Extracting Structured Data from Web Pages By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan Gündem
General • Underlying Terminology • Modules and their operations Presentation Outline • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Motivation • There are many web sites that contain a large collection of “structured” pages. • Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data. • This paper focuses on the problem of automatically extracting structured data from a collection of pages.
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Example Pages • In the real world there are many examples for structured web pages. • amazon web site, e-bay web site etc. • Two examples from www.amazon.com • My System • An Eternal Golden Braid
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Underlying Problems • Complex Schema:The “schema” of the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on. • Template vs. Data:Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.
x extracted from the database How is a page created with template?
Basic Type, Tuples and Sets • Basic Type: b,Basic unit of text • Tuple: Ordered List of types, <T1,T2,…,Tn> • Set: {T1} < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >
Schema and Instance < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >
Template Definition • Own example: • Schema: S = <b, {b}, b> • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr
Encoding l(T1,x1) Template
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Set of Reviewers Multiple Pages
Some Terminology (1) • The occurrence-vectorof a token t, is defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page • An equivalence classis a maximal set of tokens having the same occurrence-vector. • A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.
No unique role <1,1,1,1> <1,2,1,0> Some Terminology (2)
Some Terminology (3) • For real pages, an equivalence class of large size and support is usually valid, where supportof a token is defined as the number of pages in which the token occurs. • Example for invalid equivalence class: • {Data, Mining, Jeff, 2, Jane, 6} has occurrence vector <0, 1, 0, 0>
Some Terminology (4) • The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”. • Threshold for size and support is set by the user (SizeThres, SupThres).
Some Terminology(5) • Validequivalence class properties: Ordering and Nesting • Back to own example: • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Ordered: A > B > C > D • Nesting: B > E > C
Important Observations • In practice, two page-tokens with different occurrence-paths have different roles: html-parser • Two page-tokens having same occurrence paths, but with different neighbours also have different roles
Constructing Template (1) • The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty. • A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.
Constructing Template (2) • The tokens connected by empty positions belong to the template. • In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type • This unknown type can be determined by inspecting input pages
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Experimental Results (1) • Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions. • The first 6 web pages are obtained from RoadRunner site. • The last three web pages have more complex structure.
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Concluding Remarks • EXALG first discovers the unknown template that generated the pages and uses the discovered template to extract the data from the input pages. • Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection. • No human intervention – automatically getting template and data
Future Work • Automatically locate collections of pages that are structured • Check, whether it is feasible to generate some large database from these pages