A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based in part on research funded by the National Science Foundation.

Presentation Overview • Background of legacy Ontos • Assumptions, challenges, concerns • Framework as solution • Explain framework • Explain reference implementation • Evaluation of system • Future work and conclusion

Data Extraction • Goals of data extraction • Find relevant data in unstructured or semi-structured documents • Map extracted data to a formal structure • Approaches • Wrappers (ROADRUNNER, TSIMMIS) • NLP and machine learning (RAPIER, WHISK) • Ontologies (Ontos)

Ontos • Developed by Data Extraction Group (DEG) at BYU • Based on OSM ontologies and data frames • Focuses on multiple-record extraction • Good precision/recall • Resilient to document changes

How Ontos Works

Ontos Assumptions • OSML ontologies • Single- or multiple-record text documents • Each document/record relevant to domain • Heuristics produce accurate mappings • Output to relational database

Some Current Challenges

Architectural Concerns • Variety of technologies • Different OSM representations • Highly coupled code • Difficult to install elsewhere • Difficult to upgrade or extend

Thesis Statement A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research. We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.

Frameworks • Abstract architecture • Decouple independent functions • Define interfaces • Use abstract classes, interfaces, declarative configuration files • Allow quick adjustment of system settings without re-coding • Make a system customizable Image from http://www.mcoe.org

Creating an Extraction Framework • Analyze systems • Generalize functionality • Define interfaces • Create supporting code • Document framework

Managing the Process • DataExtractionEngine • Main class • Initialize, perform extraction, finalize • ExtractionPlan • Defines order of steps in the extraction process • Can be imperative, declarative, or dynamic (like SQL execution plan)

Handling Documents • DocumentRetriever • Responsible for locating relevant documents • Search engine, local filesystem, CMS • DocumentStructureRecognizer • Decides which DocumentStructureParser to use • DocumentStructureParser • Breaks document into individual records or sub-documents • Record separator, table analyzer • ContentFilter • Normalizes document text • Strips out unwanted markup, stopwords, etc.

Extracting Values • ValueRecognizer • Uses matching rules defined in ontology • Produces set of candidate matches (like data record table) • ValueMapper • Accepts or rejects candidate matches • Assigns accepted matches to elements of the ontology (e.g., object sets) • OntologyWriter • Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)

Implementing the Framework

OSMX • Legacy Ontos: OSML • OntologyEditor: OSM.dtd • New standard is OSMX • XML Schema (better constraints; validation) • JAXB generates corresponding Java classes • Common language for DEG tools • Allows data to be stored inline with model

Managing the Process • OntosEngine • Main class for Ontos system • Takes parameters from command line or configuration file • OntosExtractionPlan • Sequentially retrieves, parses, filters, and extracts from individual documents • Imperative (hard-coded) algorithm

Handling Documents • LocalDocumentRetriever • Retrieves documents from local filesystem • Filename filter excludes irrelevant files • FanoutRecordSeparator • Implements DocumentStructureParser • Locates record boundaries and creates sub-documents • HTMLFilter • Removes all HTML markup from documents

Recognizing Values: DataFrameMatcher • Uses data frame enhancements: • Keyword affinity (left and right) • Require context for left, right, or both • Value phrase-specific keywords • Link matches back to specific patterns • Other improvements: • Consistent regular expression handling • Unlimited recursive macro definition

Mapping Values: HeuristicBasedMapper • New algorithm • Fully recursive wrt ontology structure • ContextualHeuristic generates objects • Connection-based heuristics (singleton, nested-group, etc.) generate relationships • See paper for additional details

Output • Human-readable HTML format • Easier to count correct, partial, incorrect mappings

Using the Framework and Reference Implementation • Adding new features • Create new implementation classes • Extend (subclass) existing implementations • Switching feature set • Change class name in config file • Override class on command line

Evaluating the Framework • Input: • Obituaries ontology • 25 obituaries from two newspapers Four of eighteen object sets shown above. Data from Salt Lake Tribune and Arizona Daily Star

Statistics about the System * Includes comments and whitespace. ** JAXB-generated classes add 197 files and 62,888 lines of code.

Future Work • Algorithm improvements • On-the-fly lexicons • Machine learning techniques • Confidence values • Canonicalization • Expected participation cardinality • Negative-indicator keywords • Integration • Online search engines • Semantic Web annotator and query engine • Web interface to extraction engine

Contributions • Design and construction of a data-extraction framework • Reference implementation • Ontos upgrade • Pattern for future use of framework • OSMX • Standardized storage format • http://www.deg.byu.edu/xml/osmx.xsd

Contributions • Uniform codebase and language • OntologyEditor migration • New graphics classes • Extended data frame support • Modular heuristic-based mapper • Concept of extraction plans • Flexible research platform

Conclusion • Framework gives us the flexibility we need for further data-extraction research • Framework is capable of supporting Ontos functionality • OSMX and reference implementation provide solid base for future research applications

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

Presentation Transcript

Integration of Information Extraction with an Ontology

Web Data Extraction

Ontology-based Extraction of Information from the Internet

Ontology-based Information Extraction

Integration of Information Extraction with an Ontology

Measurement and data extraction.

ODE: Ontology-Assisted Data Extraction

Data extraction

Ontology-based Information Extraction for Business Intelligence

Ontology Based Extraction of RDF Data from the World Wide Web

Data-Extraction Ontology Generation by Example

Nesstar: A Web-based Data Extraction and Analysis System

Deployment and Evaluation Issues in Ontology-Based Information Extraction

ODE: Ontology-assisted Data Extraction

Semi-Automatically Generating Data-Extraction Ontology

Data extraction services

Data Extraction

An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System

Data-Extraction Ontology Generation by Example

Data Extraction

Data Extraction