data bricolage n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Bricolage PowerPoint Presentation
Download Presentation
Data Bricolage

Data Bricolage

224 Vues Download Presentation
Télécharger la présentation

Data Bricolage

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Bricolage Mixed methods to verify, summarize, clean, and enhance data in and out of the ILS Kristina Spurgin E-Resources Cataloger - UNC-Chapel Hill

  2. BRICOLAGE “construction (as of a sculpture or a structure of ideas) achieved by using whatever comes to hand; also : something constructed in this way” – Photo by dannybirchall

  3. A map A bit of context – my institution and my role in it My favorite load table – gathering bib records Extended example of “data bricolage” for cleaning/enhancing bib records Script to verify full text access to ebooks Script/program to summarize data exported from Millennium

  4. “It’s complicated…” • University of NC @ Chapel Hill • Large institution, ARL member • 6,048,337 catalog results (not exactly what’s in our III backend, but gives an idea of scale) • 3 administrative units • +/- 30 branches and specialized collection locations • >1060 item locations • Part of Triangle Research Libraries Network, sharing: • Endeca OPAC • Physical storage space • Some MARC records • Some acquisitions • 1 staff member with load table training Photo of Davis Library @ UNC by benuski

  5. My official job – E-resources cataloger • Managing & loading batches of MARC records for ebooks • Individual cataloging of Web sites, online databases, and some ebooks • Oversee maintenance of URLs in catalog records • (new!) Extraction of our catalog data from Millennium for use in our Endeca OPAC

  6. My official job – Tools of the data bricoleur

  7. My unofficial job – “Fixer” Image from So, HathiTrust requires very specific info in their metadata for an ingest… Oops, a lot titles in that big ebook package we just cancelled were on EReserve. How can we identify them? This branch library has an old Access database of items they want to put in the catalog… We need a way to easily work with payment data outside Millennium for a serials review!

  8. gathering, cleaning, & enhancing records My favorite load table

  9. BACKGROUND: a pre-existing workflow Spreadsheet from Internet Archive Scribe manager: Spreadsheet >> MarcEdit Delimited Text Translator:

  10. BACKGROUND: A pre-existing workflow Compiled to .mrc and loaded with locally-created load table that: • Matches on bnum (907) for overlay • Protects ALL fields in existing record (LDR, Cat Date, etc… everything) • Inserts any fields from the new stub record (will create dupe fields) • Creates new item

  11. Why am I telling you about this old thing? “How can I get these back into a review file?” b29786551 b30718326 b31024907 b31024932 b31351463 b32383137 b32568149 b32594124 b32874492 b32921342 b32935602 b33764037 … “You can’t, really.” (me)

  12. What if I loaded stub records containing nothing but the bnum? • On load, check “Use Review Files” box • It works! • We toggle item creation in the load table as needed (trivial tweak) (me)

  13. cleaning & maintaining catalog records THE SAVINE SAGA

  14. Savine Digital Library home:

  15. Local Millennium Record OCLCMasterRecord

  16. +3600 local records +3600 OCLC records became

  17. initial list of catalog bnums for Savine records (but for print only… oops) new URL for each bnum new URLs manually identified for each bnum list of bnums not associated with new URLs

  18. Local Record Strategy • New worksheet w/new DB info (table name = contdm) • Create review file of all bib records with 856 matching old db URL • Export data from Millennium/open in Excel… (table name = mill) Hmm… these bnums won’t match…

  19. Local Record Strategy • Copy entire bnum8 column • “Paste special > Values” back in the same place • Add 8-character bnum to mill table

  20. Local Record Strategy mill table (some columns hidden) contdm table • VLOOKUP formula to grab new URLS from contdm table

  21. Local Record Strategy • Create new table (name = urlmatch) • Identify pattern in missing new URLs

  22. Local Record Strategy • In mill table, clear out NEW URL column

  23. Local Record Strategy • In mill table, repopulate NEW URL with VLOOKUP from urlmatch

  24. Local Record Strategy • Use MarcEdit Delimited Text Translator to create “stub records”

  25. Local Record Strategy • Global update on review file of Savine records • Delete all old 865s containing |u • Load stub records with my favorite load table • New URLs added

  26. OCLC Record Strategy • Batch search OCLC#s into local OCLC save file • Validate/correct as necessary • Use MARCedit/OCLC plugin to open local save file in MARCedit • Copy all to new MARCedit file • Delete old URLs, Save • Merge in new URLs from “stub” record file created w/OCLC# and new URLs • Copy merged records back into file created by plugin • Save records from plugin MARCedit file back to local OCLC save file • Batch replace records in OCLC Connexion

  27. Other bricolage projects using my favorite load table • SpringerLink ebook records • 950s (subject module) were deleted from many records • In SpringerLink title list: DOI url, Subject module • In Millennium: bnum, DOI url • Stub records with bnum (907) and new 950 • Alexander Street Press (ASP) records released without OCLC nums • From ASP: ASP record ID, OCLC num • From Mill: bnum, ASP record ID • Stub records with bnum (907) and new 035

  28. A script to verify full-text access to ebooks Beyond the URL checker

  29. Access checker: The problem addressed • Ideally, vendors would provide us with: • MARC records for ALL items to which we have full access • NO MARC record for items to which we have restricted access • Reality is not ideal. • Example: SpringerLink e-books • 250-560 new MARC records a month





  34. Access checker: Script use: input • Data souces: • Extract from MARC file pre-load using MARCedit • Export from Millennium Create Lists post-load • URL must be final column – One URL per row • Any number of columns can be included before the URL

  35. Access checker: Script use: running the script In Windows Powershell:

  36. Access checker: Script use: running the script In Windows Powershell:

  37. Access checker: Script use: running the script In Windows Powershell:

  38. Access checker: Script use: output

  39. Access checker: Other info • Looks at the “landing page” for each URL – does not download or harvest any full text content • Written in JRuby • Open source – Code available from GitHub • Instructions for use also at GitHub – I tried to write them for people not familiar with using scripts

  40. A script to summarize PAID data from order records Dealing with payment data

  41. Payment data processor: The problem addressed • Millennium will export payment data from Create Lists of order records • BUT the format of the exported data makes it virtually unusable. • 9 payment field columns, repeated • One row in the output below had data all the way to column ST!

  42. Payment data processor: The solution • Script outputs either: • One payment per line • Payments summarized by fiscal year

  43. Payment data processor: Script use: input • Exported .txt file from Millennium Create Lists

  44. Payment data processor: Script use: running the script • You can run the Ruby (.rb) script from the command line • BUT • Everyone using this at UNC just double-clicks on the .exe

  45. Payment data processor: Script use: running the script

  46. Payment data processor: Script use: running the script

  47. Payment data processor: Script use: running the script

  48. Payment data processor: Script use: output

  49. Payment data processor: Script use: running the script

  50. Payment data processor: Script use: output