html5-img
1 / 27

Efficient Incremental Validation of XML Documents

Efficient Incremental Validation of XML Documents. Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas. Presented by Daria Barger. Outline. Introduction Types of constraints Update operations Incremental validation Experiments

laasya
Télécharger la présentation

Efficient Incremental Validation of XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas Presented by Daria Barger Daria Barger – DB Seminar

  2. Outline • Introduction • Types of constraints • Update operations • Incremental validation • Experiments • Conclusions • Future work Daria Barger – DB Seminar

  3. Introduction • The problems of storing and querying XML documents have attracted a great deal of interest. • Other aspects of XML data management, however, have not yet been satisfactorily explored. • Among them is the problem of checking that documents are valid with respect to their specifications, and that they remain valid after updates. Daria Barger – DB Seminar

  4. DTD • One popular form of XML document specification is the Document Type Definition (DTD). • A DTD D is a grammar that defines a set of documents L(D). • Each document in L(D) is said to be valid with respect to D . Daria Barger – DB Seminar

  5. The Validation Problem The validationproblem is: Given a DTD D and an XML document X, is it the case that X  L(D) ? The incrementalvalidationproblem is: Let U be some update operation. Given X  L(D), is it the case that U(X)  L(D)? Daria Barger – DB Seminar

  6. Validation of structural constraints Content Model: Element- valid iff the string formed by concatenating its children elements belongs to L(E), the language denoted by E. Elements are declared in DTD by rules of the form: <!ELEMENT e c> <?xml version="1.0"?> <!ELEMENT db (person*)> <!ELEMENT person(name, dep, email, tel*)> <!ELEMENT name (#PCDATA)> <!ELEMENT dep(#PCDATA)> <!ELEMENT email(#PCDATA)> <!ELEMENT tel(#PCDATA)> Content Model: #PCDATA – validation can be done trivially Daria Barger – DB Seminar

  7. Validation of attributes Attributes validation is trivial, except for ID and IDREF attribute types. Valid XML document should hold: • Values of all ID attributes are unique • Value of each IDREF attribute must be equal to the value of some ID attribute Daria Barger – DB Seminar

  8. 1-unambiguous regular expressions Marking: The specification of XML DTDs restricts the regular expression used for defining element content to be 1- unambiguous (deterministic). Position – subscripted symbol in E`. For given position x, Χ (x) denotes a corresponding (unmarked) symbol in Σ. For example: pos(E’) = {a,b1,b2,c} Χ (b1) =b Daria Barger – DB Seminar

  9. 1-unambiguous regular expressions A regular expression E is 1- unambiguous if and only if for all words u,v,w over the subscripted alphabet pos(E) and all x,y in pos(E), the conditions uxv, uyw  L(E`) and x≠y imply Χ(x) ≠ Χ(y) Which regular expression is deterministic? • (ab)|(ac) • a(b|c) • a(a+b)*ac Daria Barger – DB Seminar

  10. The Glushkov automaton for Regular Expressions set of positions that appear as the first symbol of some word in L(E’) set of positions that appear immediately after position x in some word in L(E’) set of positions that appear as the last symbol of some word in L(E’) Daria Barger – DB Seminar

  11. Update operations A p A A y A A A A A A A A A A A A • Append(p,y) - insert element y as the last child of element p. Append Daria Barger – DB Seminar

  12. Update operations (2) A A A A A • InsertBefore(x,y) – insert element y as immediate left sibling of element x.(This operation is not defined if x is the root of the document). A A A x A A A y A A Insert Before A A Daria Barger – DB Seminar

  13. Update operations(3) A A A A A A A A A A A A A A A x • Delete(x) – delete element x from the document. Note that if x is the root of the document the operation is trivially valid. Delete(x) Daria Barger – DB Seminar

  14. Observation The incremental validation concerns only the content of the element where the update takes place. For example, after an Append(p,y) operation only the content of p needs to be revalidated. Daria Barger – DB Seminar

  15. The approach wk w2 w1 p w3 … • Together with the i-th child of p we store the value of for the automaton that validates the content model of p. • This requires auxiliary storage of size O(n log d), where n is a size of XML document, d is size of DTD Daria Barger – DB Seminar

  16. Append at the end wk w2 w1 p y w3 … Append(p,y) operation Daria Barger – DB Seminar

  17. Arbitrary insertions and deletions wk w2 w1 wi Delete(x) operation p … … Problem: Complexity Daria Barger – DB Seminar

  18. 1,2 Conflict Free Regular Expression Possible solution: Let’s consider E=a(b1*|cb2*) W=acb…b. All b’s match state b2 Delete c from w, receive w’=ab…b Now all b’s match state b1 We should re - validate the entire string This condition does not hold always, e.g. Daria Barger – DB Seminar

  19. Definition of 1,2 Conflict-free Let E be regular expression over alphabet Σ Follow(E,x) – set of position in E that can follow x in some path through E. Define such that E is 1,2 conflict - free regular expression if: Daria Barger – DB Seminar

  20. Restricted forms of DTD • 1,2 Conflict Free DTD • There is no “flipping” between automata states after the update. • The per update complexity for 1,2 Conflict Free DTD is O(log n + log d) time and O(n log d) auxiliary space. • Conflict-free DTD: • No repeated symbols. • The per update complexity: O(log n + log d) and constant auxiliary space. Daria Barger – DB Seminar

  21. Incremental validation of ID and IDREF for adding element Append(p,y) and InsertBefore(x,y) operations require checking that no two ID attributes are the same and every IDREF attribute in y refers to some existing document values. The complexity: O(|y|log n) time and linear auxiliary space. |y| = size of added subtree. Daria Barger – DB Seminar

  22. Incremental validation of ID and IDREF for deleting element After Delete(x) operation we have to check that there is no subtree rooted at x that contains a node that has an ID attribute referenced by some other node that is not a descendant of x. c b a Checking reference counter in delete requires O(log n) time. Updating reference counter in insert/removing IDREF attribute: O(h log n) time. Daria Barger – DB Seminar

  23. Valid Insertion 1e+08 Incr CF – Incr 1.2 CF – Incr Arb – Full Arb – Full CF - 1e+06 Time [micro sec] 10000 100 64K 512K 4M 32M 256M 2G Document size Daria Barger – DB Seminar

  24. Valid Deletion 1e+08 Incr CF – Incr 1.2 CF – Incr Arb – Full Arb – Full CF - 1e+06 Time [micro sec] 10000 100 64K 512K 4M 32M 256M 2G Document size Daria Barger – DB Seminar

  25. Invalid Deletion Incr CF – Incr 1.2 CF – Incr Arb – Full Arb – Full CF - 1000 Time [micro sec] 100 10 64K 512K 4M 32M 256M 2G Document size Daria Barger – DB Seminar

  26. Conclusions • Handled insertion and deletion of subtrees (not leaf nodes only). • Validated ID and IDREF attributes. • Characterize a class of DTDs appearing to capture most real life DTDs that admits a log time and constant space incremental validation algorithm. • Conducted experiments showing that the method is practical for large data documents and behaves much better than full revalidation. Daria Barger – DB Seminar

  27. Future Work Handling complex updates, involving several insertions and deletions as a single transactions. Daria Barger – DB Seminar

More Related