1 / 13

An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System

An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System. Alan Wessman Brigham Young University. Based on research supported by NSF. Data Extraction. Goal: Find useful information in documents without known formal structure Primary tasks:

jmuse
Télécharger la présentation

An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported by NSF

  2. Data Extraction • Goal: Find useful information in documents without known formal structure • Primary tasks: • Locate data of interest to application • Map identified data to an ontology

  3. Ontos • BYU approach to data extraction • Domain knowledge encoded as ontology • Defines target data structure • Contains data recognition rules (“data frames”) • Heuristics map extracted values to ontology • Populate sets of objects and relationships • Infer nonlexical objects • Satisfy ontology constraints • Ontos algorithm puts it all together

  4. Current Heuristics • Object sets processed in order of appearance • Accept-or-reject: Early bad choice prevents later better choices --- OBITUARIES ONTOLOGY --- Marriage Date matches [20] keyword "\bmarried\b"; end; Funeral Date matches [20] keyword "\bfuneral\b"; end; -- Deceased Person Deceased Person [-> object]; Deceased Person [0:1] has Marriage Date [1:*]; Deceased Person [0:1] has Funeral [1]; ... -- Funeral Funeral [0:1] is on Funeral Date [1:*]; ... -- Generalization/Specializations Marriage Date, Funeral Date : Date; Lemar K. Adamsonage 84, of Tucson, died September 30, 1998. He was born June 12, 1914 in Salt Lake City, Utah. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by BRING'S MEMORIAL CHAPEL, 236 S. Scott

  5. Additional Problems • Generalization/specialization • Previously extracted data • Complex document structure • Overlapping value domains • Tunable parameters and extraction algorithm

  6. Generalization/Specialization

  7. Previously Extracted Data 235. Foundations of Computer Science 1. (4:4:1) F, W, Sp, Su Prerequisite: CS 142. Iteration, induction, recursion, lists, trees, sets, relations, functions; mathematical analysis of algorithms and data models; object-oriented implementation of abstract data types. 236. Foundations of Computer Science 2. (4:4:1) F, W, Sp, Su Prerequisite: CS 235. Continuation of CS 235; relations, graphs, automata, grammars, propositional and predicate logic. Implementation of object-oriented algorithms.

  8. Complex Document Structure • Major sections with varying internal structures • Nested lists with unstructured text • Headings interspersed among records • Icons, hyperlinks, etc.

  9. Overlapping Value Domains student at Lincoln High School, won the state thought Lincoln himself was probably rolling over in his grave at the idea drove all the way to Lincoln, where we ate at When his history lesson about Abraham Lincoln finally ended, Steve left Lincoln High and drove his Lincoln Continental down to Lincoln, Nebraska.

  10. Tunable Parameters & Algorithm • Confidence values • Names: William = 0.9; Rose = 0.6; Spatula = 0.03 • Weighted heuristics • Empirically, heuristic A is 2.3 times better than heuristic B • Acceptance thresholds • “If ConfidenceValue(Name) > 0.5, accept” • Candidate ranking • Heuristics vote; combine results; order candidate values and accept top n • Algorithm • When to retrieve, parse, extract, or populate target

  11. Our Approach We can remedy deficiencies in the Ontos heuristics by defining an abstract framework that allows the ontology designer to: • Implement more accurate and powerful heuristics (specific to the ontology’s needs), and • Control elements of the extraction plan (order in which documents are retrieved and parsed, heuristics are applied, etc.)

  12. Framework Overview

  13. Progress • Researched HMM-based heuristics • Constructed XML Schema for ontologies • Solidified specialization semantics • Provided for directly populating ontology with extracted values • Implementation is proceeding…

More Related