1 / 37

Extracting Structured Data from Web Pages

Extracting Structured Data from Web Pages. By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan G ündem. General Underlying Terminology Modules and their operations. Presentation Outline. Motivation Example Pages. Model & Problem Formulation. Approach in Detail.

mohawk
Télécharger la présentation

Extracting Structured Data from Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Structured Data from Web Pages By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan Gündem

  2. General • Underlying Terminology • Modules and their operations Presentation Outline • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  3. General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  4. Motivation • There are many web sites that contain a large collection of “structured” pages. • Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data. • This paper focuses on the problem of automatically extracting structured data from a collection of pages.

  5. General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  6. Example Pages • In the real world there are many examples for structured web pages. • amazon web site, e-bay web site etc. • Two examples from www.amazon.com • My System • An Eternal Golden Braid

  7. Example Pages (My System: 21st Century Edition)

  8. Example Pages (An Eternal Golden Braid)

  9. General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  10. Underlying Problems • Complex Schema:The “schema” of the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on. • Template vs. Data:Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.

  11. x extracted from the database How is a page created with template?

  12. Basic Type, Tuples and Sets • Basic Type: b,Basic unit of text • Tuple: Ordered List of types, <T1,T2,…,Tn> • Set: {T1} < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

  13. Schema and Instance < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

  14. Template Definition • Own example: • Schema: S = <b, {b}, b> • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr

  15. Encoding l(T1,x1) Template

  16. General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  17. General Description of EXALG

  18. Set of Reviewers Multiple Pages

  19. Correct Solution for those pages

  20. Some Terminology (1) • The occurrence-vectorof a token t, is defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page • An equivalence classis a maximal set of tokens having the same occurrence-vector. • A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.

  21. No unique role <1,1,1,1> <1,2,1,0> Some Terminology (2)

  22. Some Terminology (3) • For real pages, an equivalence class of large size and support is usually valid, where supportof a token is defined as the number of pages in which the token occurs. • Example for invalid equivalence class: • {Data, Mining, Jeff, 2, Jane, 6} has occurrence vector <0, 1, 0, 0>

  23. Some Terminology (4) • The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”. • Threshold for size and support is set by the user (SizeThres, SupThres).

  24. Some Terminology(5) • Validequivalence class properties: Ordering and Nesting • Back to own example: • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Ordered: A > B > C > D • Nesting: B > E > C

  25. Important Observations • In practice, two page-tokens with different occurrence-paths have different roles: html-parser • Two page-tokens having same occurrence paths, but with different neighbours also have different roles

  26. Explanation of observations

  27. Modules and their operations

  28. Constructing Template (1) • The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty. • A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.

  29. Constructing Template (2) • The tokens connected by empty positions belong to the template. • In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type • This unknown type can be determined by inspecting input pages

  30. Constructing Template(3)

  31. General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  32. Experimental Results (1) • Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions. • The first 6 web pages are obtained from RoadRunner site. • The last three web pages have more complex structure.

  33. Experimental Results(2)

  34. General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion

  35. Concluding Remarks • EXALG first discovers the unknown template that generated the pages and uses the discovered template to extract the data from the input pages. • Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection. • No human intervention – automatically getting template and data

  36. Future Work • Automatically locate collections of pages that are structured • Check, whether it is feasible to generate some large database from these pages

  37. Questions & Answers

More Related