1 / 44

Batch-conversion of Non-standard Multiscript Records by XSLT

Batch-conversion of Non-standard Multiscript Records by XSLT. Lucas Mak Metadata and Catalog Librarian Michigan State University. Agenda. Background Structure of multiscript records Model A vs. Model B Using z39.50 for cataloging Multiscript records retrieved through z39.50

arva
Télécharger la présentation

Batch-conversion of Non-standard Multiscript Records by XSLT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest Group, ALA Midwinter 2011, Jan. 8, 2011, San Diego CA

  2. Agenda • Background • Structure of multiscript records • Model A vs. Model B • Using z39.50 for cataloging • Multiscript records retrieved through z39.50 • Coding issues • Problems caused by non-standard multiscript records • Solutions • Design of XSLT • Processing logic • Factors affecting the design • Limitations & unintended consequence

  3. Structure of Multiscript Records • Multiscript records • For recording data in multiple scripts in MARC records • One script may be considered the primary script of the data content of the record, even though other scripts are also used for data content • Two models • Model A: Vernacular & Transliteration • Model B: Simple Multiscript Records

  4. Structure of Multiscript Records • Model A: Vernacular & Transliteration • The regular fields may contain data in different scripts and in the vernacular or transliteration of the data. Fields 880 are used when data needs to be duplicated to express it in both the original vernacular script and transliterated into one or more scripts • Model A data in the regular fields is linked to the data in 880 fields by a subfield $6 that occurs in both of the associated fields • $6 [linking tag]-[occurrence number]/[script identification code]/[field orientation code] * MARC21 Bibliographic Appx. D

  5. Structure of Multiscript Records • Model A: Vernacular & Transliteration Occurrence Number Linking Tag Field Orientation Code Linking Tag Script Identification Code Occurrence Number

  6. Structure of Multiscript Records • Model A: Vernacular & Transliteration Occurrence Number Linking Tag

  7. CJK Record according to Model A Specifications

  8. Structure of Multiscript Records • Model B: Simple Multiscript Records • All data is contained in regular fields and script varies depending on the requirements of the data • Repeatability specifications of all fields should be followed • Although the Model B record may contain transliterated data, Model A is preferred if the same data is recorded in both the original vernacular script and transliteration • Field 880 is not used * MARC21 Bibliographic Appx. D

  9. CJK Record according to Model B Specifications Item in Chinese. Cataloging language in English

  10. Structure of Multiscript Records • Field 066 (Character Sets Present) • To indicate the MARC-8 character sets other than the default sets that are invoked in the record • MARC-8 vs. Unicode Environment

  11. z39.50 for Cataloging • SkyRiver • MSU switched to SkyRiver in Oct 2009 • Ways to expand the pool of re-usable bibliographic records • z39.50 function in Innovative Millennium (day-to-day cataloging) • MarcEdit z39.50 client (HathiTrust record load)

  12. z39.50 search in Millennium

  13. z39.50 search in Millennium (Record retrieved for Editing)

  14. HathiTrust Data Availability

  15. MarcEdit z39.50 Client (HathiTrust) Batch search against Univ. of Michigan Catalog using UM record identifier

  16. U of M Catalog MSU Catalog Request Record Dump Retrieve HathiTrust Record Load Workflow

  17. Non-standard Multiscript Records from z39.50 Sample Non-standard CJK Record Retrieved by MSU Millennium z39.50 Client

  18. Same Record in Source Library Catalog (Staff View)

  19. Non-standard Multiscript Records from z39.50 HathiTrust Record Retrieved by MarcEdit z39.50 Client* * As of Dec. 10, 2010, Univ. of Michigan has rebuilt 880 fields on their z39.50 serving records

  20. Same HathiTrust Record in Univ. of Michigan Catalog (Staff View)

  21. Coding Issues Non-standard Coding Standard Model A Coding Field-pairing Transliteration in regular field Vernacular data in 880 field Linking tag Tag number of an associated field Script identification code* $1 => CJK script • Field-pairing • Vernacular data in regular field • No linking tag in subfield $6 • No script identification code in subfield $6 (may be due to Unicode environment) * Applicable to MARC-8 encoded records

  22. Coding Issues Non-standard Coding Standard Model A Coding Field orientation code /r • No field orientation code in subfield $6

  23. Coding Issues Non-standard Coding Practice Model B Guidelines Repeatability specifications of all fields should be followed Model A is preferred if the same data is recorded in both the original vernacular script and transliteration • Repeat non-repeatable fields (245, 250) • Duplication of data in both vernacular and transliteration

  24. Problems Caused by Non-standard Multiscript Records • Irregular/Incorrect field orientation in Arabic and Hebrew records in OPAC display • Left-to-right display of subfields in “Title” due to the lack of “Field Orientation code” while scripts within subfields are from right to left “Field Orientation code” added back

  25. Problems Caused by Non-standard Multiscript Records • Irregularity in result display • Inconsistent sequencing of vernacular and transliteration fields

  26. Problems Caused by Non-standard Multiscript Records • Database maintenance • Data structure inconsistency • Same kind of data resides in two different places • Extra steps needed to accommodate inconsistencies • Heading validation issues • NACO records with headings in vernacular in 4xx since mid 2008 • Vernacular headings (4xx) in regular fields

  27. Problems Caused by Non-standard Multiscript Records • Expectation in retrieval of vernacular data • MSU only indexes CJK and Cyrillic data in 880 fields • Arabic, Hebrew, Greek, and other vernacular data in regular fields of non-standard multiscript records are indexed and searchable • Create a false impression that patrons can search in scripts other than CJK and Cyrillic

  28. Solutions • MSU uses Model A for multiscript records • Tasks • To change field tag of vernacular data to 880 • Subfield $6 in both regular & 880 fields • To insert linking tag • Subfield $6 in 880 fields • To insert script identification code* • To insert field orientation code for Arabic & Hebrew records • To insert 066 field if not already exist* *No longer applicable since MSU has moved to Unicode environment

  29. Solutions • Necessary steps • Determine which fields contain vernacular data • Replace regular field tag with 880 • Determine which script(s) is contained in a record • Insert field 066* • Insert “Script Identification code*” and “Field Orientation code” when appropriate *No longer applicable since MSU has moved to Unicode environment

  30. Solutions • XSLT (Extensible Stylesheet Language Transformation) • Within the family of XML • Current version: 2.0 • Case sensitive • “Transformation”means: • Manipulation of XML documents by creating a new document based on the original document • Common usages in library context • Web display • e.g. converting EAD into HTML for display • Metadata crosswalking • Data selection and manipulation • Conditional processing • Specify matching criteria and corresponding action(s)

  31. Corrected MARC File Uncorrected MARC File Corrected MARCXML Uncorrected MARCXML Database Maintenance Workflow

  32. U of M Catalog Request Retrieve Uncorrected records MSU Catalog XSLT Processor Corrected records Alternative HathiTrust Pre-load Data Cleanup Workflow

  33. Design of XSLT • Processing logic • Regular field to 880 and insert linking tag • Remove all roman data from a field • Determine length of a field • 0 => no vernacular data • ≠0 => contains vernacular data • Field 066, Script identification & Field orientation codes • Match vernacular data field against vernacular characters

  34. Design of XSLT • Remove all roman data • Roman data (ASCII, special characters & diacritics used in transliteration) • replace() and translate() functions • Find “pattern A” and replace it with “pattern B” • Replace roman data with nothing <xsl:value-ofselect="replace(replace(replace(translate(translate(translate(translate(normalize-space(.),$ascii,$spaces),$specialCharacters,' '),$diacritics,' '),$extendedLatin,' '),$apos,' '),'[A-Za-z]',' '),' ','')"/>

  35. Design of XSLT • Test the length of the field after removing all non-vernacular data • XSLT elements: <xsl:choose> in combination with <xsl:when> & <xsl:otherwise> • XSLT functions: string-length() <xsl:choose> <xsl:when test="string-length($subfieldString)=0"> …… [series of actions when string-length equals 0] </xsl:when> <xsl:otherwise> …… [series of actions when string-length not equals 0] </xsl:otherwise> </xsl:choose>

  36. Design of XSLT • Field with no vernacular data <xsl:when test="string-length($subfieldString)=0"> <xsl:element name="marc:datafield"> <xsl:attribute name="tag"> <xsl:value-of select="$tag"/> </xsl:attribute> <xsl:attribute name="ind1"> <xsl:value-of select="$ind1"/> </xsl:attribute> <xsl:attribute name="ind2"> <xsl:value-of select="$ind2"/> </xsl:attribute> <xsl:element name="marc:subfield"> <xsl:attribute name="code"> <xsl:text>6</xsl:text> </xsl:attribute> <xsl:text>880-</xsl:text> <xsl:value-of select="$subfield6"/> </xsl:element> <xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/> </xsl:element> </xsl:when> Test length of the field Insert original values Insert linking tag (880) and original occurrence number Copy subfields other than $6

  37. Design of XSLT • Field with vernacular data <xsl:otherwise> <xsl:element name="marc:datafield"> <xsl:attribute name="tag"> <xsl:text>880</xsl:text> </xsl:attribute> <xsl:attribute name="ind1"> <xsl:value-of select="$ind1"/> </xsl:attribute> <xsl:attribute name="ind2"> <xsl:value-of select="$ind2"/> </xsl:attribute> <xsl:element name="marc:subfield"> <xsl:attribute name="code"> <xsl:text>6</xsl:text> </xsl:attribute> <xsl:value-of select="$tag"/> <xsl:text>-</xsl:text> <xsl:value-of select="$subfield6"/> …… [Insert “Script Identification Code” & “Field Orientation Code”] </xsl:element> <xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/> </xsl:element> </xsl:otherwise> Insert “880” as tag no. Insert original values • Insert original tag no. as linking tag • Insert original occurrence number

  38. Design of XSLT • Insert “Script Identification Code” (MARC-8 environment) <xsl:choose> <xsl:when test="matches($basicArabic,substring($subfieldString,1,1)) or matches($extendedArabic,substring($subfieldString,1,1))"> <xsl:text>/(3</xsl:text></xsl:when><xsl:when test="matches($greek,substring($subfieldString,1,1))"> <xsl:text>/(S</xsl:text></xsl:when><xsl:when test="matches($basicHebrew,substring($subfieldString,1,1))"> <xsl:text>/(2</xsl:text></xsl:when><xsl:when test="matches($basicCyrillic,substring($subfieldString,1,1)) or matches($extendedCyrillic,substring($subfieldString,1,1))"> <xsl:text>/(N</xsl:text></xsl:when> <xsl:when test="matches($bengali,substring($subfieldString,1,1)) or matches($tamil,substring($subfieldString,1,1)) or matches($thai,substring($subfieldString,1,1)) or matches($devanagar,substring($subfieldString,1,1)) "/> <xsl:otherwise> <xsl:text>/$1</xsl:text></xsl:otherwise></xsl:choose> Insert code for Arabic Insert code for Greek Insert code for Hebrew Insert code for Cyrillic Insert code for CJK

  39. Design of XSLT • Insert “Field Orientation Code” <xsl:choose> <xsl:when test=“contains($subfieldString,‘[Arabic script]’or contains($subfieldString,‘[Hebrew script]’)"> <xsl:text>//r</xsl:text></xsl:when> </xsl:choose> Test if the subfield contains Arabic or Hebrew script Insert Field Orientation Code

  40. Design of XSLT • Field 066 (MARC-8 environment) • Insert character set code in subfield $c • A single record may have more than one vernacular script => multiple subfield $c • XSLT element: <xsl:if> • Allows multiple matches • XSLT function: matches() • Processing logic • Turn the whole record into a text string • Remove all Latin data • Match vernacular script against normalized text string

  41. Design of XSLT • After removing all Latin data from the record <xsl:value-of select="translate(translate(translate(translate(translate(translate(translate(translate(translate(translate(.,$basicArabic,'3'),$extendedArabic,'4'),$basicCyrillic,'N'),$extendedCyrillic,'Q'),$Greek,'S'),$basicHebrew,'2'),$bengali,'b'),$tamil,'ta'),$thai,'th'),$devanagar,'d')"/> … <xsl:if test="matches($normalizedWholeRecord,'3')"> <xsl:element name="marc:subfield"> <xsl:attribute name="code">c</xsl:attribute> <xsl:text>(3</xsl:text> </xsl:element> </xsl:if> …… <xsl:if test="matches($normalizedWholeRecord,'[^A-Za-z0-9]')"> <xsl:element name="marc:subfield"> <xsl:attribute name="code">c</xsl:attribute><xsl:text>$1</xsl:text> </xsl:element> </xsl:if> Replace Arabic characters with “3” Test if the normalized data contains “3” Insert “(3” as the character set code in $c Insert code for CJK Test if any non-alpha-numeral characters exist

  42. Design of XSLT • Factors affecting the design • Pre-load vs. post-load data clean up (HathiTrust workflow) • Mechanism to filter out non-multiscript records needed for pre-load data clean up • Construction of 949 overlay command* • MARC-8 vs. Unicode • Field 066 and Script identification code not allowed in Unicode environment • 2 separate XSLTs made • OCLC vs. MARC21 Standard • Representation of Bengali, Devanagari, Tamil, and Thai in field 066 * Innovative Millennium specific

  43. Limitations & Unintended Consequences • Processing of data represented by UTF-8 character number • \U+0e33\\U+0e43\\U+0e2b\\U+0e49\ • Vernacular scripts processed (MARC-8 environment) • Handling of unlinked vernacular data • Implications on OPAC display

  44. Questions? Lucas Mak makw@mail.lib.msu.edu Michigan State University Libraries

More Related