1 / 32

Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation

Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation. Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010. Summary.

izzy
Télécharger la présentation

Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superset Me—Not:Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010

  2. Summary We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron Superset Me—Not JATS-Con Nov 2, 2010

  3. Contents • Why we built a JPTS superset • DTD vs. Schematron • Attribute values • Number of element occurrences • Element position & sequence • References • Lessons learned Superset Me—Not JATS-Con Nov 2, 2010

  4. Why we built a JPTS superset • No generic book model • Lack of familiarity with Schematron • Lack of mature tool support (running SVRL not a viable option in Production environment) • Lack of expertise on integrating Schematron with validation against relational DB • JATS v2.3: no Compound Keywords, not all content models parameterized Superset Me—Not JATS-Con Nov 2, 2010

  5. DTD vs. Schematron:Attribute values Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt) Strict DTD <!ATTLIST article article-type (rga | cor | edt) #REQUIRED > JPTS <!ATTLIST article article-type CDATA #IMPLIED > Superset Me—Not JATS-Con Nov 2, 2010

  6. DTD vs. Schematron:Attribute values (cont’d) XML instance (contains non-allowed article type) <article article-type='xxx'/> Schematron <rule context="article"> <assert test="@article-type=('rga','cor','edt')"> @article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule> Schematron message @article-type 'xxx' not allowed, must be 'rga', 'cor', or 'edt' Superset Me—Not JATS-Con Nov 2, 2010

  7. DTD vs. Schematron:Number of element occurrences Requirement:Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs Strict DTD <!ELEMENT ack (p, p?) > JPTS <!ELEMENT ack (p*) > Superset Me—Not JATS-Con Nov 2, 2010

  8. DTD vs. Schematron:Number of occurrences (cont’d) XML instance (wrong number of paragraphs) <article> ... <journal-id>jb</journal-id> ... <ack> <p>Blah</p> <p>Blah-blah</p> </ack> </article> Superset Me—Not JATS-Con Nov 2, 2010

  9. DTD vs. Schematron:Number of occurrences (cont’d) Schematron <rule context="ack[ancestor::*/journal-id=('ja','rg')]"> <assert test="count(p) eq 2"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain exactly two paragraphs</assert></rule> <rule context="ack"> <assert test="count(p) eq 1"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain only one paragraph</assert></rule> Superset Me—Not JATS-Con Nov 2, 2010

  10. DTD vs. Schematron:Number of occurrences (cont’d) Schematron message 'ack' in 'jb' must contain only one paragraph Superset Me—Not JATS-Con Nov 2, 2010

  11. DTD vs. Schematron:Element position & sequence Requirement:If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info Strict DTD <!ELEMENT article-categories (subject-group*, special-collection?) > JPTS <!ELEMENT article-categories (subj-group*) > Superset Me—Not JATS-Con Nov 2, 2010

  12. DTD vs. Schematron:Element position & sequence (cont’d) XML instance (wrong sequence of subject groups) <article-categories> <subj-group subj-group-type="special-section"> <subject content-type="EARLYWARN1">New Methods and Applications of Earthquake Early Warning</subject> </subj-group> <subj-group subj-group-type="toc-category"> <subject content-type="SDE">Solid Earth</subject> </subj-group> </article-categories> Superset Me—Not JATS-Con Nov 2, 2010

  13. DTD vs. Schematron:Element position & sequence (cont’d) Schematron <rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling:: subj-group[@subj-group-type=('toc-category','subset')])"> <name/>/@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present</assert></rule>  Schematron message subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present Superset Me—Not JATS-Con Nov 2, 2010

  14. DTD vs. Schematron:References Validating references is a challenge: • Variety vs. the need to enforce editorial style Strict DTD: • Fixed element order, no mixed content • Punctuation, spacing, face markup – on output JPTS: • Lots of elements, any order, mixed content • Punctuation, spacing, face markup included Superset Me—Not JATS-Con Nov 2, 2010

  15. DTD vs. Schematron:References (cont’d) Strict DTD <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > <!ATTLIST book-standalone-citation id ID #REQUIRED > Superset Me—Not JATS-Con Nov 2, 2010

  16. DTD vs. Schematron:References (cont’d) JPTS <!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc | ... | ...)* > <!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED > Superset Me—Not JATS-Con Nov 2, 2010

  17. DTD vs. Schematron:References (cont’d) Example: Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York. Superset Me—Not JATS-Con Nov 2, 2010

  18. DTD vs. Schematron:References (cont’d) XML instance (strict DTD) <book-standalone-citation id="mood63"> <person-group person-group-type="author"> <name><surname>Mood</surname> <given-names>A. M.</given-names></name> <name><surname>Graybill</surname> <given-names>F. A.</given-names></name> </person-group> <year>1963</year> <source>Introduction to the Theory Statistics</source> <edition>2nd</edition> <size units="page">295 pp<size/> <publisher-name>McGraw-Hill</publisher-name> <publisher-loc>New York</publisher-loc> </book-standalone-citation> Superset Me—Not JATS-Con Nov 2, 2010

  19. DTD vs. Schematron:References (cont’d) XML instance (JPTS) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names> <surname>Graybill</surname> </string-name> (<year>1963</year>), <source><italic>Introduction to the Theory Statistics</italic></source>, <edition>2</edition>nd ed., <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Superset Me—Not JATS-Con Nov 2, 2010

  20. DTD vs. Schematron:References (cont’d) Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition,if present, follows source): <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > Superset Me—Not JATS-Con Nov 2, 2010

  21. DTD vs. Schematron:References (cont’d) • Schematron can check that all required elements are present: <rule context="mixed-citation[@publication-type='book-standalone']"> <assert test="(person-group | string-name) and year and source and publisher-name and publisher-loc"> required element missing</assert></rule> • & that the elements are in the correct sequence: Superset Me—Not JATS-Con Nov 2, 2010

  22. DTD vs. Schematron:References (cont’d) XML instance (JPTS) (edition is in the wrong place) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names><surname>Graybill</surname> </string-name> (<year>1963</year>), <edition>2</edition>nd ed., <source><italic>Introduction to the Theory …</italic></source>, <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Superset Me—Not JATS-Con Nov 2, 2010

  23. DTD vs. Schematron:References (cont’d) This Schematron uses positional predicate [1] to check that year is immediately followed by source: <rule context="mixed-citation[@publication-type= 'book-standalone']/year"> <assert test="following-sibling::*[1]/self::source"> '<name/>' must be followed by 'source', not by '<value-of select='name(following-sibling::*[1])'/>' </assert></rule> Schematron message 'year' must be immediately followed by 'source', not by 'edition' Superset Me—Not JATS-Con Nov 2, 2010

  24. DTD vs. Schematron:References (cont’d) But how to check the sequence of required elements when there might be optional elements interspersed between them? This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between: <rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> <assert test="preceding-sibling::source"> '<name/>' must be preceded by 'source'</assert></rule> Superset Me—Not JATS-Con Nov 2, 2010

  25. DTD vs. Schematron:References (cont’d) • Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order: • Each element rewritten as a string of its element names • Content model represented as a regular expression • Schematron checks the string of names against regex • Schematron generates an error message if content does not match the model Superset Me—Not JATS-Con Nov 2, 2010

  26. DTD vs. Schematron:References (cont’d) An XML file, e.g., citation-models.xml, specifies structured citation models: ... <model publication-type="book-standalone"> ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc) </model> ... Superset Me—Not JATS-Con Nov 2, 2010

  27. DTD vs. Schematron:References (cont’d) • Advantages: • DTD is still DTD-valid • Mixed content is permitted • Type-sensitive handling of references is possible • Caveat: XSLT 2.0! Superset Me—Not JATS-Con Nov 2, 2010

  28. Lessons learned • AGU Tag Set + Schematron (200+ checks) • Ensures data quality • Ensures markup integrity • Provides control over production processes • AGU Tag Set is a superset of JPTS • Based on JPTS • Uses the same modularization principles • Can be easily mapped to JPTS • Were we to do this again we would have developed JPTS subset and a Schematron Superset Me—Not JATS-Con Nov 2, 2010

  29. Lessons learned (cont’d) • Appropriate layer validation • Even the most “Prussian” DTD can’t enforce all business rules, data types, and house style • Rules-based checking needed anyway • May as well use “Californian” JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc. • Paradigm shift: the crux of validation shifts from XML parser to Schematron engine Superset Me—Not JATS-Con Nov 2, 2010

  30. Lessons learned (cont’d) • This shift is not without costs: • Content may be valid to JPTS but make no sense • Dependency on Schematron for semantic integrity • Constraints on business partners: must be Schematron-capable and have tools • Schematron does not “fix” problems—people do. Processes and procedures must be well-defined Superset Me—Not JATS-Con Nov 2, 2010

  31. Lessons learned (cont’d) • Writing a simple Schematron is easy; building a complex and efficient one is not: • Elicit, document, convey, and clarify the Requirements • Ensure Schematron fits into your workflow • Modularize Schematron • Ensure that individual Schematron rules aren’t in conflict • Optimize Schematron performance • Employ XSLT 2.0 • Test, test, test • Cultivate Schematron & XSLT 2.0 expertise in-house Superset Me—Not JATS-Con Nov 2, 2010

  32. Conclusion • What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters? • When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say: “Superset Me—Not!” Superset Me—Not JATS-Con Nov 2, 2010

More Related