1 / 28

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System. Alan Wessman Brigham Young University MS Thesis Defense. Based in part on research funded by the National Science Foundation. Presentation Overview. Background of legacy Ontos

deliz
Télécharger la présentation

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based in part on research funded by the National Science Foundation.

  2. Presentation Overview • Background of legacy Ontos • Assumptions, challenges, concerns • Framework as solution • Explain framework • Explain reference implementation • Evaluation of system • Future work and conclusion

  3. Data Extraction • Goals of data extraction • Find relevant data in unstructured or semi-structured documents • Map extracted data to a formal structure • Approaches • Wrappers (ROADRUNNER, TSIMMIS) • NLP and machine learning (RAPIER, WHISK) • Ontologies (Ontos)

  4. Ontos • Developed by Data Extraction Group (DEG) at BYU • Based on OSM ontologies and data frames • Focuses on multiple-record extraction • Good precision/recall • Resilient to document changes

  5. How Ontos Works

  6. Ontos Assumptions • OSML ontologies • Single- or multiple-record text documents • Each document/record relevant to domain • Heuristics produce accurate mappings • Output to relational database

  7. Some Current Challenges

  8. Architectural Concerns • Variety of technologies • Different OSM representations • Highly coupled code • Difficult to install elsewhere • Difficult to upgrade or extend

  9. Thesis Statement A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research. We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.

  10. Frameworks • Abstract architecture • Decouple independent functions • Define interfaces • Use abstract classes, interfaces, declarative configuration files • Allow quick adjustment of system settings without re-coding • Make a system customizable Image from http://www.mcoe.org

  11. Creating an Extraction Framework • Analyze systems • Generalize functionality • Define interfaces • Create supporting code • Document framework

  12. Managing the Process • DataExtractionEngine • Main class • Initialize, perform extraction, finalize • ExtractionPlan • Defines order of steps in the extraction process • Can be imperative, declarative, or dynamic (like SQL execution plan)

  13. Handling Documents • DocumentRetriever • Responsible for locating relevant documents • Search engine, local filesystem, CMS • DocumentStructureRecognizer • Decides which DocumentStructureParser to use • DocumentStructureParser • Breaks document into individual records or sub-documents • Record separator, table analyzer • ContentFilter • Normalizes document text • Strips out unwanted markup, stopwords, etc.

  14. Extracting Values • ValueRecognizer • Uses matching rules defined in ontology • Produces set of candidate matches (like data record table) • ValueMapper • Accepts or rejects candidate matches • Assigns accepted matches to elements of the ontology (e.g., object sets) • OntologyWriter • Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)

  15. Implementing the Framework

  16. OSMX • Legacy Ontos: OSML • OntologyEditor: OSM.dtd • New standard is OSMX • XML Schema (better constraints; validation) • JAXB generates corresponding Java classes • Common language for DEG tools • Allows data to be stored inline with model

  17. Managing the Process • OntosEngine • Main class for Ontos system • Takes parameters from command line or configuration file • OntosExtractionPlan • Sequentially retrieves, parses, filters, and extracts from individual documents • Imperative (hard-coded) algorithm

  18. Handling Documents • LocalDocumentRetriever • Retrieves documents from local filesystem • Filename filter excludes irrelevant files • FanoutRecordSeparator • Implements DocumentStructureParser • Locates record boundaries and creates sub-documents • HTMLFilter • Removes all HTML markup from documents

  19. Recognizing Values: DataFrameMatcher • Uses data frame enhancements: • Keyword affinity (left and right) • Require context for left, right, or both • Value phrase-specific keywords • Link matches back to specific patterns • Other improvements: • Consistent regular expression handling • Unlimited recursive macro definition

  20. Mapping Values: HeuristicBasedMapper • New algorithm • Fully recursive wrt ontology structure • ContextualHeuristic generates objects • Connection-based heuristics (singleton, nested-group, etc.) generate relationships • See paper for additional details

  21. Output • Human-readable HTML format • Easier to count correct, partial, incorrect mappings

  22. Using the Framework and Reference Implementation • Adding new features • Create new implementation classes • Extend (subclass) existing implementations • Switching feature set • Change class name in config file • Override class on command line

  23. Evaluating the Framework • Input: • Obituaries ontology • 25 obituaries from two newspapers Four of eighteen object sets shown above. Data from Salt Lake Tribune and Arizona Daily Star

  24. Statistics about the System * Includes comments and whitespace. ** JAXB-generated classes add 197 files and 62,888 lines of code.

  25. Future Work • Algorithm improvements • On-the-fly lexicons • Machine learning techniques • Confidence values • Canonicalization • Expected participation cardinality • Negative-indicator keywords • Integration • Online search engines • Semantic Web annotator and query engine • Web interface to extraction engine

  26. Contributions • Design and construction of a data-extraction framework • Reference implementation • Ontos upgrade • Pattern for future use of framework • OSMX • Standardized storage format • http://www.deg.byu.edu/xml/osmx.xsd

  27. Contributions • Uniform codebase and language • OntologyEditor migration • New graphics classes • Extended data frame support • Modular heuristic-based mapper • Concept of extraction plans • Flexible research platform

  28. Conclusion • Framework gives us the flexibility we need for further data-extraction research • Framework is capable of supporting Ontos functionality • OSMX and reference implementation provide solid base for future research applications

More Related