390 likes | 496 Vues
Explore the process of inferring concise DTDs from XML data for efficient schema optimization, focusing on robustness and effectiveness in noisy environments. Learn about the significance of schema inference, improving existing schemas, and the implications for validation and optimization tasks.
E N D
Inference of Concise DTDs from XML data Geert Jan Bex1 Frank Neven1 Thomas Schwentick2 Karl Tuyls3 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
DTD Aims & requirements XML • Problem: infer DTD from XML corpus • Requirements: • Concise: humans can interpret/validate • Work on large data sets • Work on small data sets • Robust to noise
Why DTD inference? • Schema inference • ≈ 50 % of XML documents : no schema [Barbosa et al. 2005] • ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005] • Improving existing schemas • “Noisy” XML documents ≈ 90 % of XHTML docs : not valid • Related work • Fails on real-world, large data sets • Results not concise
Why schemas? • Validation : efficiency, security • Optimization : search, processing • Static analysis, type checking (e.g., XQuery) • Software development : modeling,OR-mapping • Integration : (meta-)data sources • Schema matching • Semantics
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
… … book book title editor year isbn title author author year title (author+ + editor+) year isbn? … … … … … … XML documents Learning regular expression from set of strings
((b?(a+c))+ d)+ e Learning automata? Well studied, but… Learning automata≠learning regular expressions
< ? a (b* + c) d? ??? < Learning regular languages? S = { abbb, abbd, acd, ac } • abbb + abbd + acd + ac • most specific regex for S • (a + b + c + d)* • most general regex for S positive examples only! generalization vs. specificity Impossible…in general
Subclasses • SingleOccurrenceRegularExpressions • 99 % of regular expression in DTDs/XSDs • CHAinRegularExpressions • 90 % of regular expression in DTDs/XSDs Infer with iDTD Infer with CRX
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
duplicate element names SOREs • What’s a SOREheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?title . (author . affiliation?)+ . abstract • … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract
a b 2T-Inf d e [Garcia & Vidal 1990] c Sample SOA W = {bacacdacde, cbacdbacde, abccaadcde} SingleOccurrenceAutomaton
< < in general: |S| |L(SOA)| Sample SOA • SOA size • || + 2 states • O(||2) transitions • Complexity of algorithm • O(||W||) • streaming • Algorithm sound • W L(SOA)
a a b d d d d d b? b? e e e e e c c a+c b? (a+c) ((b? (a+c))+ d)+ e ((b? (a+c))+ SOA SORE: REWRITE optional b disjunction a, c self-loop b? (a+c) concatenation b?, a+c
REWRITE: properties • Theorem • REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete) • Complexity: O(||4) • SORE size • || symbols • O(||) operators
a a b b d d e e c c ((b? (a+c))+ d)+ e REWRITE + repairs = iDTD W = {bacacdacde, cbacdbacde} no rules apply !!! almost disjunction a, c Fix: enable-disjunctionenable-optional
iDTD: properties • Theorem • iDTD transforms SOA into SORE such that L(SOA) L(SORE) • iDTD can be parameterized for performance
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
CHAREs • Definition: A chain regular expression is a sequence of factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of • (a1 + … + ak) • (a1 + … + ak)? • (a1 + … + ak)+ • (a1 + … + ak)* CRX derives CHAin Regular Expressions Chain Regular expressioneXtraction
not a factor duplicate element names CHAREs • What’s a chainheader . protein . organism . reference* . comment* . genetics* . complex* . function* . classification? . keywords? . feature* . summary . sequenceauthors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs? • … and what’s nottitle . (author . affiliation?)+ . abstracttitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract
Pre-order relation W a b b c c d d e h i c f b d g a e c a a d b f f e e g f h h i CRX run: pre-order relation Sample W a b c c d e c c c a d b f e g b f h i
f e d g h i a b c CRX run: transitive closure a W b and b W c then a W c Sample W a b c c d e c c c a d b f e g b f h i
a,b,c f e d g h i a b c equivalence class CRX run: transitive closure a W b and b W a then a W b Sample W a b c c d e c c c a d b f e g b f h i Symbol occurs in exactly one equivalence class
a,b,c f e d g h i predecessor set successor set CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ | W ’} Sample W a b c c d e c c c a d b f e g b f h i
a,b,c e g h i d,f CRX run: folding partial order W pred() = {’ | ’ W } succ() = {’ | W ’} Sample W a b c c d e c c c a d b f e g b f h i W: partial order W
a,b,c e g h i ? + ? d,f ? ? . . . . . (a + b + c)+ (d + f) e? g? h? i? CRX run: multiplicity & RE topological sort Sample W a b c c d e c c c a d b f e g b f h i Chain Regular Expression
CRX algorithm: properties • Optimality:W linearly ordered CHARE r,WL(r) and L(r)L(rW): rW = r • Performance : O(||W|| + |Σ|3) • Training set size:Any CHARE r can be learned from{w | wL(r)w’L(r): |w| |w’| + 2}
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
Related work • XTRACT [Garofalakis et al. 2000] • Pioneer • More general than iDTD • Focuses on regular expressions that don’t occur in real DTDs no concise schemas • Trang: roughly equivalent to CRX • Inconsistent results
Data • Real world regular expressions • SOREs • Non SOREs • Real world data when available • Synthetic data otherwise
CRX iDTD no repairs Experiments: generalization
CRX iDTD Experiments: generalization
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
Extensions • Incremental computation • new data update internal representation (SOA or partial order) • Noise • Support for element name too small ignore element • SOA: support for edges too small delete edges before repair • Numerical predicates • Bookkeeping: minOccurs, maxOccurs • Generating XSDs • Infer data types (integer, double, date,…)
Outline • Goals & motivation • Problem setting • iDTD: Sample SOA SORE • CRX: Sample CHARE • Experiments • Extensions • Conclusions
Conclusions • iDTD + CRX • learns robust class of regexes from positive examples • complete in their target class for sufficient data • deals with insufficient data • performs well on real world data • runs efficiently • Future work: inferring XML Schemas