1 / 35

Metadata Harvesting

EuropeanaLocal Knowledge Sharing Workshop. Metadata Harvesting. Julie Verleyen Scientific Coordinator, Europeana Office. The Hague, 13 & 14 January 2009. Table Of Content. Harvesting in Europeana: workflow and requirements Best-practices Recommendations Common issues Tools / Software

emera
Télécharger la présentation

Metadata Harvesting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EuropeanaLocal Knowledge Sharing Workshop Metadata Harvesting Julie Verleyen Scientific Coordinator, Europeana Office The Hague, 13 & 14 January 2009

  2. Table Of Content • Harvesting in Europeana: workflow and requirements • Best-practices • Recommendations • Common issues • Tools / Software • Resources • Documentation

  3. Harvesting in Europeana • Determine collections to be contributed • Questionnaire

  4. Harvesting in Europeana • Obtain OAI-PMH repository parameters: • Absolute minimum (enough for fully implemented, tested and documented OAI repositories) • Server base URL • Very useful to have: • Mapping between described collection(s) and OAI-PMH set(s) • Prefix of metadata format to use preferably for Europeana (if not described in ListMetadataFormats response): ex: oai_dc, mods, tel, ese

  5. Harvesting in Europeana • Configuration of harvester • Full harvest with ListRecords request • Records collected in XML files ≤ 10MB • Harvest stored in SVN

  6. Best-practices: implementation • Compliancy to OAI-PMH 2.0 protocol specifications http://www.openarchives.org/OAI/openarchivesprotocol.html .  Follow implementation guidelines OAI-PMH v2 for repository implementers http://www.openarchives.org/OAI/2.0/guidelines-repository.htm • Full functional tests!!

  7. Best-practices: OAI validation OAI validation = Your OAI repository correctly implements the OAI-PMH!  Correct response to all OAI-PMH requests: with arguments, various error conditions, every XML schema of every OAI response is valid,...

  8. Recommended approach to OAI validation • Follow the Open Archive Initiative Protocol Testing • Validate your server using the validator supplied by the OAI. http://www.openarchives.org/data/registerasprovider.html Without registering  clicking checkbox "only validate and do not register (you may then register later)."

  9. http://www.openarchives.org/data/registerasprovider.html

  10. #Protocol_Conformance_Testing

  11. http://www.openarchives.org/data/registerasprovider.html => bottom of the page

  12. Issues and recommendations: sets • Set = "an optional construct for grouping items for the purpose of selective harvesting.“

  13. Number of obstacles related to sets: • Interpreting how a repository has organized sets and determining which sets to harvest • Issue: setName not human understandable and/or no setDescription provided. • Issue: Large number of sets to sort through. • Knowing when there are records that belong to no sets • Issue: Items that belong to no sets are included in the OAI repository. • Knowing when there are empty sets • Issue: Data provider exposes sets with no records.

  14. Number of obstacles related to sets: • Understanding relationships between sets • Issue: Relationships between sets are not expressed. • Mechanism to express relationships between hierarchical sets • But no mechanism to express relationships between overlapping sets! • The only way to know: harvest the identifiers or records which contain the header information  sets record belongs to

  15. Number of obstacles related to sets: • Knowing how many records there are within a set before harvesting • Issue: Not expressing how many records are within a set which can be expressed via a completeListSize attribute in a resumptionToken or within the set description. • Knowing when a set structure has been substantially changed • Issue: Changes in a set structure has not been communicated

  16. Sets: recommendations • No single best practice for the organization of sets. • Realistically: data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows. • Useful to organize the metadata items into sets according to the collections of resources they represent. • Concept of collections varies and not completely clear in Europeana. • Useful for harvester to understand notion of collection for data providers

  17. Basic requirements • Repository implementation following OAI-PMH v2.0 + tested • Inform Europeana harvesting responsible of any repository changes / maintenance • No regular harvesting schema determined yet • “SLA” between data providers and harvesters

  18. Common issues • Unavailability / unreliability of repository server • Implementation of OAI-PMH v2 incomplete • resumptionToken not supported • Only ListIdentifiers • XML syntax errors • Character encoding errors • Short lifetime of resumptionToken

  19. Tools / Software TEL/Europeana OAI-PMH Harvester – Offline documentation • Harvester • Java standalone application with GUI • Multiple harvesting jobs • Resuming unfinished jobs • Logging • No scheduling, No configuration interface

  20. Tools / Software REPOX - http://repox.ist.utl.pt/ • Repository + Harvester • Java standalone application with web GUI • Multiple harvesting jobs, Scheduler • Statistics • Management of XML metadata repository • Versioning and identification of records • Different metadata format • User interface to create metadata crosswalks: Schema mapper

  21. Tools / Software OAIcat from OCLC - http://www.oclc.org/research/software/oai/cat.htm • Framework conforming to the OAI-PMH v2.0 • Repository + Harvesting • Java web application • Scheduling, logging • Limited scalability (~2M records)

  22. Tools / Software (TELplus D2.1) Other implementations in different languages to plug-in into a Library Management System: • PHP: OAIbiblio data provider implementation of the OAI-PMH, version 2.0. This toolkit can be easily customized to communicate with an already existing, multi-table MySQL database • PERL: Celestial OAI aggregator/cache application that imports OAI metadata from version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that metadata through either an aggregated or per-repository OAI-compliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl 5.6.x and a CGI-capable web server • Ruby: ruby-oai Includes a client library, a server/provider library and a interactive harvesting shell • Python: pyoai package enables high-level access to an OAI-PMH Metadata Repository and also implements a framework for quickly creating OAI-PMH compliant servers

  23. Tools / Software • ESE XML validation schemas developed by partners

  24. Resources • The Open Archives Initiative Protocol for Metadata Harvesting v2.0 http://www.openarchives.org/OAI/openarchivesprotocol.html • TELplus D2.1, “OAI-PMH implementation and tools guidelines”, 21 pages • Protocol overview and description of main concepts • OAI-PMH implementation in libraries • References

  25. Resources • Wiki “Best Practices for OAI Data Provider Implementations and Shareable Metadata”: Excellent source of guidelines, tutorials, recommendations, implementation softwares and tools, references etc... http://webservices.itcs.umich.edu/mediawiki/oaibp/index.php/Main_Page

  26. Documentation in Europeana context • Requirements: • Europeana OAI-PMH Harvesting • Europeana OAI-PMH Repositories • ESE XML validation schema • Europeana OAI-PMH data providers registry & forum/mailing list • Local systems • OAI-PMH repository solution • Contact

  27. Thank youQuestions? Remarks?... Julie.Verleyen@kb.nl

More Related