Inference of Concise DTDs from XML Data for Schema Optimization

Inference of Concise DTDs from XML data Geert Jan Bex1 Frank Neven1 Thomas Schwentick2 Karl Tuyls3 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg

Outline • Goals & motivation • Problem setting • iDTD: Sample  SOA  SORE • CRX: Sample  CHARE • Experiments • Extensions • Conclusions

DTD Aims & requirements XML • Problem: infer DTD from XML corpus • Requirements: • Concise: humans can interpret/validate • Work on large data sets • Work on small data sets • Robust to noise

Why DTD inference? • Schema inference • ≈ 50 % of XML documents : no schema [Barbosa et al. 2005] • ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005] • Improving existing schemas • “Noisy” XML documents ≈ 90 % of XHTML docs : not valid • Related work • Fails on real-world, large data sets • Results not concise

Why schemas? • Validation : efficiency, security • Optimization : search, processing • Static analysis, type checking (e.g., XQuery) • Software development : modeling,OR-mapping • Integration : (meta-)data sources • Schema matching • Semantics

… … book book title editor year isbn title author author year title (author+ + editor+) year isbn? … … … … … … XML documents Learning regular expression from set of strings

((b?(a+c))+ d)+ e Learning automata? Well studied, but… Learning automata≠learning regular expressions

< ? a (b* + c) d? ??? < Learning regular languages? S = { abbb, abbd, acd, ac } • abbb + abbd + acd + ac • most specific regex for S • (a + b + c + d)* • most general regex for S positive examples only! generalization vs. specificity Impossible…in general

Subclasses • SingleOccurrenceRegularExpressions • 99 % of regular expression in DTDs/XSDs • CHAinRegularExpressions • 90 % of regular expression in DTDs/XSDs  Infer with iDTD Infer with CRX

duplicate element names SOREs • What’s a SOREheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?title . (author . affiliation?)+ . abstract • … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract

a b 2T-Inf d e [Garcia & Vidal 1990] c Sample  SOA W = {bacacdacde, cbacdbacde, abccaadcde} SingleOccurrenceAutomaton

< < in general: |S| |L(SOA)| Sample  SOA • SOA size • || + 2 states • O(||2) transitions • Complexity of algorithm • O(||W||) • streaming • Algorithm sound • W L(SOA)

a a b d d d d d b? b? e e e e e c c a+c b? (a+c) ((b? (a+c))+ d)+ e ((b? (a+c))+ SOA  SORE: REWRITE optional b disjunction a, c self-loop b? (a+c) concatenation b?, a+c

REWRITE: properties • Theorem • REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete) • Complexity: O(||4) • SORE size • || symbols • O(||) operators

a a b b d d e e c c ((b? (a+c))+ d)+ e REWRITE + repairs = iDTD W = {bacacdacde, cbacdbacde} no rules apply !!! almost disjunction a, c Fix: enable-disjunctionenable-optional

iDTD: properties • Theorem • iDTD transforms SOA into SORE such that L(SOA) L(SORE) • iDTD can be parameterized for performance

CHAREs • Definition: A chain regular expression is a sequence of factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of • (a1 + … + ak) • (a1 + … + ak)? • (a1 + … + ak)+ • (a1 + … + ak)* CRX derives CHAin Regular Expressions Chain Regular expressioneXtraction

not a factor duplicate element names CHAREs • What’s a chainheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs? • … and what’s nottitle . (author . affiliation?)+ . abstracttitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract

Pre-order relation W a b b c c d d e h i c f b d g a e c a a d b f f e e g f h h i CRX run: pre-order relation Sample W a b c c d e c c c a d b f e g b f h i

f e d g h i a b c CRX run: transitive closure a W b and b W c then a W c Sample W a b c c d e c c c a d b f e g b f h i

a,b,c f e d g h i a b c equivalence class CRX run: transitive closure a W b and b W a then a W b Sample W a b c c d e c c c a d b f e g b f h i Symbol occurs in exactly one equivalence class

a,b,c f e d g h i predecessor set successor set CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ |  W ’} Sample W a b c c d e c c c a d b f e g b f h i

a,b,c e g h i d,f CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ |  W ’} Sample W a b c c d e c c c a d b f e g b f h i W: partial order W

a,b,c e g h i ? + ? d,f ? ? . . . . . (a + b + c)+ (d + f) e? g? h? i? CRX run: multiplicity & RE topological sort Sample W a b c c d e c c c a d b f e g b f h i Chain Regular Expression

CRX algorithm: properties • Optimality:W linearly ordered  CHARE r,WL(r) and L(r)L(rW): rW = r • Performance : O(||W|| + |Σ|3) • Training set size:Any CHARE r can be learned from{w | wL(r)w’L(r): |w|  |w’| + 2}

Related work • XTRACT [Garofalakis et al. 2000] • Pioneer • More general than iDTD • Focuses on regular expressions that don’t occur in real DTDs no concise schemas • Trang: roughly equivalent to CRX • Inconsistent results

Data • Real world regular expressions • SOREs • Non SOREs • Real world data when available • Synthetic data otherwise

real world data

real world regexes

CRX iDTD no repairs Experiments: generalization

CRX iDTD Experiments: generalization

Extensions • Incremental computation • new data  update internal representation (SOA or partial order) • Noise • Support for element name too small  ignore element • SOA: support for edges too small  delete edges before repair • Numerical predicates • Bookkeeping: minOccurs, maxOccurs • Generating XSDs • Infer data types (integer, double, date,…)

Conclusions • iDTD + CRX • learns robust class of regexes from positive examples • complete in their target class for sufficient data • deals with insufficient data • performs well on real world data • runs efficiently • Future work: inferring XML Schemas

Inference of Concise DTDs from XML Data for Schema Optimization

Inference of Concise DTDs from XML Data for Schema Optimization

Presentation Transcript

XML Syntax: DTDs

XML Data Management 10. Deterministic DTDs and Schemas

XML Data Management Deterministic DTDs and Schemas

Inferring XML Schema Definitions from XML Data

XML Data Management 5. Extracting Data from XML: XPath

All of Statistics: A Concise Course in Statistical Inference

Introduction to the NCIP DTDs and XML Schemas

More XML: semantics, DTDs, XPATH

XML 2 DTDs, Schemas and Namespaces

XML Validation I DTDs

CIS 451: XML DTDs

XML DTDs and Schemas

From Semistructured Data to XML

All of Statistics: A Concise Course in Statistical Inference

XML Validation I DTDs

XML Validation I DTDs

From Semistructured Data to XML

Inference of Concise DTDs from XML data

XML Validation II Advanced DTDs

XML Validation II Advanced DTDs + Schemas

XML Data Management 10. Deterministic DTDs and Schemas