1 / 9

Data Frames Version 3 Proposal

This proposal suggests enhancements for data frames version 3, including improvements to extracting mileage information and handling different types of data. It also introduces the idea of required context and discusses the internal representation, methods, canonicalization, inheritance, general constraints, and other related issues.

abates
Télécharger la présentation

Data Frames Version 3 Proposal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Frames Version 3 Proposal

  2. Data Frames Version 2 • Year matches [2] • constant { extract "\d{2}"; • context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, • { extract "\d{2}"; • context "([^\$\d]|^)\d{2},[^\d]"; } 0.6, • { extract "\d{2}"; • context "\b'\d{2}\b"; } 0.8; • end; • Mileage matches [8] • constant { extract "\b[1-9]\d{1,2}k"; } 0.6, • { extract "[1-9]\d?,\d{3}"; } 0.3; • keyword "\bmiles\b", "\bmi\.", "\bmi\b"; • end; • Also: except, substitute, filter phrases; lexicons

  3. Still allow negation Introduce idea of “required context” Each phrase may be labeled Strong separation of value and keyword phrases Allow keyword to be specific to a subset of the value phrases for this data frame Expressions are richer than regular expressions. Supports Boolean and proximity operators; also lexicons and macros. Kimball’s Ontology Editor

  4. Internal Representation • Replace SQL field length with arbitrary type field • This is the “internal representation” • Type is either lexical or nonlexical • Type could be the name of an object set in the ontology • Or it could be the name of a type in whatever language will be used to implement methods (more on this later), together with a units name (e.g. “miles”, “meters”, “grams”, “pounds”)

  5. Methods • Add a method phrase to data frames • Conceptually they are restricted derived object sets and relationship sets • We only declare method signatures in data frames • Another language (e.g. Java) is used to define the method body • Our tool will generate a template in which the programmer can write method bodies • The template will have OO structures that allow read-only access to the seamless model/data instance • Keyword phrases may also apply to methods

  6. Canonicalization Methods • Each value phrase may have an associated canonicalization method • The purpose is to convert the extracted value string into a common form • The data frame may have a default canonicalization method that applies if there is no individual method for a value phrase

  7. Inheritance • Inheritance is defined more cleanly • Generalization/specialization will indicate inheritance hierarchy • The internal representation cannot be overridden in specializations • Multiple parents must have the same internal representation • Individual inherited phrases can be deleted or overridden • New phrases can be added • In the case of name conflict, we require fully qualified names to be used (no automatic disambiguation)

  8. General Constraints • We may decide to implement a limited form of general constraint in the ontology • E.g. “Birth Date <= Death Date” • Or “Event Distance.toMiles() <= 26 • If so, we may want to implement operator overloading (something like C++) • The general constraint issue is not core to the current data frame discussion, but it has interesting ramifications

  9. Other Issues • How to integrate methods and confidence values into record-assembly heuristics • Ontos system will have to be rewritten • Extract into model instance, not SQL tables • We can always generate database tables later if we’d like • Ontologies created graphically and stored as XML

More Related