1 / 30

Generating Data-Extraction Ontologies By Example

Generating Data-Extraction Ontologies By Example. Joe Zhou Data Extraction Group Brigham Young University. Background. World Wide Web contains a huge amount of useful information. Web data-extraction is necessary for querying the data of interest.

dalecook
Télécharger la présentation

Generating Data-Extraction Ontologies By Example

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University

  2. Background • World Wide Web contains a huge amount of useful information. • Web data-extraction is necessary for querying the data of interest. • Most of wrappers generate extraction patterns based on delimiters or HTML tags. So they are source-dependent. • BYU ontology-based technique is resilient.

  3. Problem and Solution • BYU Onto approach requires that ontology experts generate data-extraction ontologies for the domains of interest to ordinary users • A principal effort of our research is to automate ontology- generation process as much as possible • We developed a system OntoByE (Ontology By Example) to generate data-extraction ontologies semi-automatically

  4. Extraction Ontology • Object sets, Relationship sets and Constraints • Data frames for Lexical Object Sets

  5. Extraction Ontology Object sets, Relationship sets and Constraints Data frame for Digital Zoom

  6. Sample Pages Data Frame Library Ontology Generator Marked Pages Forms Extraction Ontology User Interface Populated Database Extraction Engine Target Pages OntoByE System Overview and Architecture

  7. OntoByE – User Interface

  8. Form Editor – Basic Form Elements

  9. Form Editor – Nesting Forms

  10. Form Editor – Creating Forms for Digital Camera Application

  11. Training Web Document Preparation

  12. Data Frame Library Users Data Frame Matcher Marked HTML Pages Data Frame Editor Context Phrase Locator Keyword and Context Expression Recognizer User-definedForms Form Analyzer Ontology Generator Extraction Ontology Ontology Generator– Workflow Data Frames Object Sets, Relationship Sets and Constraints Extraction Ontology

  13. Ontology Generator –Form Analyzer Sample Form Object and Realationship Sets and Constraints BaseForm [0:1] A [1:*] BaseForm [0:3] B [1:*] BaseForm [0:*] C [1:*] BaseForm [0:3] D1 [1:*] D2 [1:*] D3 [1:*] BaseForm [0:*] E1 [1:*] E2 [1:*] E3 [1:*]

  14. Digital Camera application Forms Object and Relationship Sets and Constraints Ontology Generator–Form Analyzer

  15. Ontology Generator –Context Phrase Locator

  16. Ontology Generator–Data Frame Matcher Data Frame Matching Heuristics: • Number of matched data Data Frame Ranking Heuristics: • Number of matched data • Keywords and/or Contexts Matching • Order of Specialization/Generalization

  17. Ontology Generator–Keyword and Context Recognizer

  18. Ontology Generator– Data Frame Editor

  19. Extraction Ontology

  20. Experimental Preparation • Selected two domains of interest • Digital Camera Application and Apartment Rental Application • Constructed an initial data frame library • Integer (any integer value), SmallPositiveInteger (from 1 to 99), SingleDigit (from 0 to 9), RealNumber (any real value), SmallPositiveReal (from 0.01 to 99.99), Date, Email, PhoneNumber, and Price • Created application-dependent forms for each application • Collected 5 sample pages from different web sites for each domain • Marked desired data on sample pages

  21. Experimental Results– Digital Camera Application

  22. Experimental Results– Digital Camera Application

  23. Experimental Results– Apartment Rental Application

  24. Experimental Results– Apartment Rental Application

  25. Experimental Results – Apartment Rental Application

  26. Experimental Observations – Strengths of OntoByE • OntoByE provides a friendly and intuitive interface to help ordinary users describe data of interest without exposing them to abstract ontology concepts • With a small initial data frame library and a small set of sample pages, OntobyE works well to search for and suggest appropriate existing data frames for object sets with application-independent values • OntoByE successfully recognizes possible keywords and contexts for user marked-data from sample pages and helps users to create new data frames with the keywords and contexts

  27. Experimental Observations – Limitations of OntoByE • The performance of searching for or constructing data frames by OntoByE is limited by the scope and the quality of prior knowledge • The accuracy and completeness of keyword and context expression construction are limited by the number and representativeness of user samples • Constructing value expressions for application-dependent data frames requires that users know how to write regular expressions.

  28. Conclusion • We implemented a user-friendly interface for ordinary users to take advantage of our ontology-based web data-extraction approach. • We developed a framework for interacting with ordinary users to generate ontologies by example. • Our experiments demonstrate that OntoByE works well to generate ontologies with assistance of a limited prior knowledge. As time goes by, along with the expansion of prior knowledge, OntoByE will achieve better performance.

  29. Future Work • Have OntoByE learn to build application-dependent lexicons for users’ applications • Improve the sub-components of the back-end ontology generator, e.g. Context Phrase Locator

  30. The end

More Related