320 likes | 436 Vues
Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation. Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010. Summary.
E N D
Superset Me—Not:Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010
Summary We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron Superset Me—Not JATS-Con Nov 2, 2010
Contents • Why we built a JPTS superset • DTD vs. Schematron • Attribute values • Number of element occurrences • Element position & sequence • References • Lessons learned Superset Me—Not JATS-Con Nov 2, 2010
Why we built a JPTS superset • No generic book model • Lack of familiarity with Schematron • Lack of mature tool support (running SVRL not a viable option in Production environment) • Lack of expertise on integrating Schematron with validation against relational DB • JATS v2.3: no Compound Keywords, not all content models parameterized Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Attribute values Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt) Strict DTD <!ATTLIST article article-type (rga | cor | edt) #REQUIRED > JPTS <!ATTLIST article article-type CDATA #IMPLIED > Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Attribute values (cont’d) XML instance (contains non-allowed article type) <article article-type='xxx'/> Schematron <rule context="article"> <assert test="@article-type=('rga','cor','edt')"> @article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule> Schematron message @article-type 'xxx' not allowed, must be 'rga', 'cor', or 'edt' Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Number of element occurrences Requirement:Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs Strict DTD <!ELEMENT ack (p, p?) > JPTS <!ELEMENT ack (p*) > Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Number of occurrences (cont’d) XML instance (wrong number of paragraphs) <article> ... <journal-id>jb</journal-id> ... <ack> <p>Blah</p> <p>Blah-blah</p> </ack> </article> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Number of occurrences (cont’d) Schematron <rule context="ack[ancestor::*/journal-id=('ja','rg')]"> <assert test="count(p) eq 2"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain exactly two paragraphs</assert></rule> <rule context="ack"> <assert test="count(p) eq 1"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain only one paragraph</assert></rule> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Number of occurrences (cont’d) Schematron message 'ack' in 'jb' must contain only one paragraph Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Element position & sequence Requirement:If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info Strict DTD <!ELEMENT article-categories (subject-group*, special-collection?) > JPTS <!ELEMENT article-categories (subj-group*) > Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Element position & sequence (cont’d) XML instance (wrong sequence of subject groups) <article-categories> <subj-group subj-group-type="special-section"> <subject content-type="EARLYWARN1">New Methods and Applications of Earthquake Early Warning</subject> </subj-group> <subj-group subj-group-type="toc-category"> <subject content-type="SDE">Solid Earth</subject> </subj-group> </article-categories> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:Element position & sequence (cont’d) Schematron <rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling:: subj-group[@subj-group-type=('toc-category','subset')])"> <name/>/@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present</assert></rule> Schematron message subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References Validating references is a challenge: • Variety vs. the need to enforce editorial style Strict DTD: • Fixed element order, no mixed content • Punctuation, spacing, face markup – on output JPTS: • Lots of elements, any order, mixed content • Punctuation, spacing, face markup included Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) Strict DTD <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > <!ATTLIST book-standalone-citation id ID #REQUIRED > Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) JPTS <!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc | ... | ...)* > <!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED > Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) Example: Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York. Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) XML instance (strict DTD) <book-standalone-citation id="mood63"> <person-group person-group-type="author"> <name><surname>Mood</surname> <given-names>A. M.</given-names></name> <name><surname>Graybill</surname> <given-names>F. A.</given-names></name> </person-group> <year>1963</year> <source>Introduction to the Theory Statistics</source> <edition>2nd</edition> <size units="page">295 pp<size/> <publisher-name>McGraw-Hill</publisher-name> <publisher-loc>New York</publisher-loc> </book-standalone-citation> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) XML instance (JPTS) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names> <surname>Graybill</surname> </string-name> (<year>1963</year>), <source><italic>Introduction to the Theory Statistics</italic></source>, <edition>2</edition>nd ed., <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition,if present, follows source): <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) • Schematron can check that all required elements are present: <rule context="mixed-citation[@publication-type='book-standalone']"> <assert test="(person-group | string-name) and year and source and publisher-name and publisher-loc"> required element missing</assert></rule> • & that the elements are in the correct sequence: Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) XML instance (JPTS) (edition is in the wrong place) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names><surname>Graybill</surname> </string-name> (<year>1963</year>), <edition>2</edition>nd ed., <source><italic>Introduction to the Theory …</italic></source>, <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) This Schematron uses positional predicate [1] to check that year is immediately followed by source: <rule context="mixed-citation[@publication-type= 'book-standalone']/year"> <assert test="following-sibling::*[1]/self::source"> '<name/>' must be followed by 'source', not by '<value-of select='name(following-sibling::*[1])'/>' </assert></rule> Schematron message 'year' must be immediately followed by 'source', not by 'edition' Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) But how to check the sequence of required elements when there might be optional elements interspersed between them? This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between: <rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> <assert test="preceding-sibling::source"> '<name/>' must be preceded by 'source'</assert></rule> Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) • Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order: • Each element rewritten as a string of its element names • Content model represented as a regular expression • Schematron checks the string of names against regex • Schematron generates an error message if content does not match the model Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) An XML file, e.g., citation-models.xml, specifies structured citation models: ... <model publication-type="book-standalone"> ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc) </model> ... Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:References (cont’d) • Advantages: • DTD is still DTD-valid • Mixed content is permitted • Type-sensitive handling of references is possible • Caveat: XSLT 2.0! Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned • AGU Tag Set + Schematron (200+ checks) • Ensures data quality • Ensures markup integrity • Provides control over production processes • AGU Tag Set is a superset of JPTS • Based on JPTS • Uses the same modularization principles • Can be easily mapped to JPTS • Were we to do this again we would have developed JPTS subset and a Schematron Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned (cont’d) • Appropriate layer validation • Even the most “Prussian” DTD can’t enforce all business rules, data types, and house style • Rules-based checking needed anyway • May as well use “Californian” JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc. • Paradigm shift: the crux of validation shifts from XML parser to Schematron engine Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned (cont’d) • This shift is not without costs: • Content may be valid to JPTS but make no sense • Dependency on Schematron for semantic integrity • Constraints on business partners: must be Schematron-capable and have tools • Schematron does not “fix” problems—people do. Processes and procedures must be well-defined Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned (cont’d) • Writing a simple Schematron is easy; building a complex and efficient one is not: • Elicit, document, convey, and clarify the Requirements • Ensure Schematron fits into your workflow • Modularize Schematron • Ensure that individual Schematron rules aren’t in conflict • Optimize Schematron performance • Employ XSLT 2.0 • Test, test, test • Cultivate Schematron & XSLT 2.0 expertise in-house Superset Me—Not JATS-Con Nov 2, 2010
Conclusion • What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters? • When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say: “Superset Me—Not!” Superset Me—Not JATS-Con Nov 2, 2010