1 / 88

Introduction to the Open Archives Initiative Protocol for Metadata Harvesting

Introduction to the Open Archives Initiative Protocol for Metadata Harvesting. Timothy W. Cole ( t-cole3@uiuc.edu ), Mathematics Librarian William H. Mischo ( w-mischo@uiuc.edu ), Engineering Librarian Thomas G. Habing ( thabing@uiuc.edu ), Research Programmer

Gabriel
Télécharger la présentation

Introduction to the Open Archives Initiative Protocol for Metadata Harvesting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to the Open Archives Initiative Protocol for Metadata Harvesting Timothy W. Cole (t-cole3@uiuc.edu), Mathematics Librarian William H. Mischo (w-mischo@uiuc.edu), Engineering Librarian Thomas G. Habing (thabing@uiuc.edu), Research Programmer Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign Presented 27 May 2003 in conjunction with JCDL 2003, Houston, TX http://dli.grainger.uiuc.edu/Publications/TWCole/JCDL-OAI

  2. Today’s Agenda (Part 1) • Overview of OAI (Mischo) • What it is, where it comes from, what it’s used for • Relation to HTTP, XML, Dublin Core, & Z39.50 • Basic Concepts & Definitions (Cole) • OAI verbs • OAI transactions • Protocol details & architecture options • Illustrations • Implementation Guidelines for Repositories (Cole) • Tools & program layout options • Metadata generation / mapping • Optional protocol elements • Error handling & deleted records JCDL 2003OAI Intro / t-cole3@uiuc.edu

  3. Today’s Agenda (Part 2) • Tools, testing, & problems (Cole) • XML & OAI validation tools • Common problems • Implementation Guidelines for Harvesters (Mischo) • How to harvest • Harvesting policies & strategies • Harvester Technologies • Advanced topics (Cole) • Communities • OAI Static Repository • OAI & SOAP • Where do you go from here? JCDL 2003OAI Intro / t-cole3@uiuc.edu

  4. OAI as a tool • All about moving metadata around • Designed to be a building block, useable by many different communities • Can facilitate (in some cases enable) services & functions • Assumes widely distributed content, butcentralized indexing(!) & services • Build once, use for many applications • Focus of OAI is interoperability JCDL 2003OAI Intro / t-cole3@uiuc.edu

  5. Metadata vs. Information Resources • Resource refers to information objects or digital representations of information objects • Metadata item is a collection of properties about a resource (e.g. title, author, etc.) • Metadata record is a metadata item expressed in a specific syntax according to an XSD • OAI focuses on metadata, with the implicit understanding that metadata contains useful links to the source information object(s) JCDL 2003OAI Intro / t-cole3@uiuc.edu

  6. OAI Antecedents • Call to other E-Print archives (July 1999) Paul Ginsparg, Rick Luce, & Herbert Von de Sompel: “…mobilize core group to work towards achieving a universal service for author self-archived scholarly literature.” • Santa Fe Mtgs. (Oct. 1999 & June 2000) • OAI – PMH version history: • First Alpha Release, Sept. 2000 • 1.0 (Beta) Release January 2001 • 1.1 (Beta 2) Release July 2001 • 2.0 (Production) Release June 2002 JCDL 2003OAI Intro / t-cole3@uiuc.edu

  7. Original OAI Organization • OAI Executive: • Carl Lagoze & Herbert Van de Sompel • OAI Steering Committee: • Co-Chairs: Dan Greenstein, Cliff Lynch • OAI Technical Committee • Funded by NSF, DLF & CNI • Seeks to be user community driven • Adopters (selective list): • NSDL, NDLTD, Open Archives Forum (EU), JISC/DNER (UK) • E-Prints.Org, DLXS, DSpace, ContentDM, ENCompass JCDL 2003OAI Intro / t-cole3@uiuc.edu

  8. OAI Protocol for Metadata Harvesting • Harvesting approachto interoperabilityat metadata level • Divides world intoMetadata Providers& Service Providers • Builds on HTTP,XML, & Dublin Core http://www.openarchives.org/ JCDL 2003OAI Intro / t-cole3@uiuc.edu

  9. Harvesting/Federation vs. Broadcast • Competing approaches to interoperability • Distributed/Broadcast searching: search and discovery over remote services and data • Harvesting is when data/metadata is transferred from the remote source to the destination where the services are located (e.g. Union catalogs) • OAI designed to make it easy for providers • Low barrier design • OAI focuses on harvesting JCDL 2003OAI Intro / t-cole3@uiuc.edu

  10. Data and Service Providers • Data Providers (Repositories) refer to entities who possess resources & metadata and are willing to share metadata with others via well-defined OAI protocols • Service Providers (Harvesters) are entities who harvest metadata from Data Providers in order to supply higher-level services to users (e.g. search & discovery) • OAI uses these denotations for its client/server model (data=server, service=client) JCDL 2003OAI Intro / t-cole3@uiuc.edu

  11. Reliance on HTTP & XML • OAI-PMH is a REpresentational State Transfer (REST) protocol (unlike RPC, SOAP) • OAI requests and responses are sent via the HTTP protocol • OAI Requests are encoded as HTTP GET or POST operations • OAI Responses are valid XML documents JCDL 2003OAI Intro / t-cole3@uiuc.edu

  12. XML Namespaces and Schema • Consistency and data “quality” is ensured by using XML Schema Definitions (XSD) for all responses • XML Namespaces are used where necessary to clearly define which parts of the responses are actual metadata and which support the Metadata Harvesting Protocol JCDL 2003OAI Intro / t-cole3@uiuc.edu

  13. OAI-PMH Use of Dublin Core • DC is OAI’s lowest common denominator • OAI supports & encourages use of other, community-driven metadata schemas • Typically, metadata provider stores metadata in ‘best’ schema as dictated by material & resources • Crosswalk (semantic mapping) to simpler schemas • Semantic mapping at metadata delivery (rather than at time of search) • As with Z39.50, can’t search for what’s not there JCDL 2003OAI Intro / t-cole3@uiuc.edu

  14. As Compared to Z39.50 JCDL 2003OAI Intro / t-cole3@uiuc.edu

  15. What OAI Is Not • Not search • Not database • Not metadata • Not OAIS JCDL 2003OAI Intro / t-cole3@uiuc.edu

  16. What OAI is good for • Where content is widely distributed, in different kinds of non-Z39.50 enabled locations • Metadata provider more lightweight than Z39.50 • Metadata provider scales wellService provider scales according to search capability • Metadata is sufficient for services desired • Normalization, dedupping, augmentation desired Not mutually exclusive • Portals can use both Z39.50 & OAI JCDL 2003OAI Intro / t-cole3@uiuc.edu

  17. The NSDL metadata repository Services The metadata repository is a resource for service providers. It holds information about every collection and item known to the NSDL. Users Metadata repository From “The NSDL Metadata Strategy,” A presentation by William Y. Arms and Diane I. Hillman. Available: http://nsdl.comm.nsdlib.org/allprojects01/metastrategy.ppt Collections JCDL 2003OAI Intro / t-cole3@uiuc.edu

  18. NSDL Metadata strategy• Support eight standard formats • Collect all existing metadata in these formats • Provide crosswalks to Dublin Core • Expose records in the metadata repository for service providers to harvest • Concentrate human effort on collection-level metadata • Use automatic generation to augment item-level metadata From “The NSDL Metadata Strategy,” A presentation by William Y. Arms and Diane I. Hillman. Available: http://nsdl.comm.nsdlib.org/allprojects01/metastrategy.ppt JCDL 2003OAI Intro / t-cole3@uiuc.edu

  19. IMLS Digital Collections & Content • Build a registry of all National Leadership Grant collections with digital content. • Assist and guide NLG projects in making item-level metadata sharable using OAI. • Build a repository and search & discovery tools for integrated access to the content of NLG collections (unique metadata schema?). • Research best practices for sharing metadata about diverse digital content and for supporting the interests of diverse user communities. JCDL 2003OAI Intro / t-cole3@uiuc.edu

  20. http://imlsdcc.grainger.uiuc.edu/ JCDL 2003OAI Intro / t-cole3@uiuc.edu

  21. Open Language Archive Community • Supports the “OLAC Protocol for Metadata Harvesting” – based on OAI • Includes metadata extensions to DC • Supports Qualified DC refinements and encodings and unique OLAC attribute “code” to hold restricted element values • Also supports “OLAC Static Repository Gateway” – based on OAI Static Repository (still alpha) • Developing an “OLAC Repository Editor” for creating a metadata provider JCDL 2003OAI Intro / t-cole3@uiuc.edu

  22. Basic Concepts & Definitions • OAI verbs • OAI transactions • Protocol Details • Architecture Options • Illustrations JCDL 2003OAI Intro / t-cole3@uiuc.edu

  23. How OAI Works OAI “VERBS” Identify ListMetadataFormats ListSets ListIdentifiers ListRecords GetRecord Service Provider Metadata Provider H A R VESTER REPOSITORY OAI HTTP Request OAI (OAI Verb) HTTP Response (Valid XML) JCDL 2003OAI Intro / t-cole3@uiuc.edu

  24. Identify • Purpose • Return general information about the archive and its policies (e.g., datestamp granularity) • Parameters • None • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=Identify JCDL 2003OAI Intro / t-cole3@uiuc.edu

  25. ListSets • Purpose • Provide a listing of sets in which records may be organized (may be hierarchical, overlapping, or flat) • Parameters • None • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListSets JCDL 2003OAI Intro / t-cole3@uiuc.edu

  26. ListMetadataFormats • Purpose • List metadata formats supported by the archive as well as their schema locations and namespaces • Parameters • identifier – for a specific record (O) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListMetadataFormats JCDL 2003OAI Intro / t-cole3@uiuc.edu

  27. ListIdentifiers • Purpose • List headers for all items corresponding to the specified parameters • Parameters • from – start date (O) • until – end date (O) • set – set to harvest from (O) • metadataPrefix – metadata format to list identifiers for (R) • resumptionToken – flow control mechanism (X) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListIdentifiers&metadataPrefix=oai_dc JCDL 2003OAI Intro / t-cole3@uiuc.edu

  28. GetRecord • Purpose • Returns the metadata for a single item in the form of an OAI record • Parameters • identifier – unique id for item (R) • metadataPrefix – metadata format for the record (R) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=GetRecord&identifier=oai:test:123&metadataPrefix=oai_dc JCDL 2003OAI Intro / t-cole3@uiuc.edu

  29. ListRecords • Purpose • Retrieves metadata records for multiple items • Parameters • from – start date (O) • until – end date (O) • set – set to harvest from (O) • resumptionToken – flow control mechanism (X) • metadataPrefix – metadata format (R) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListRecord&metadataprefix=oai_dc&from=2001-01-01 JCDL 2003OAI Intro / t-cole3@uiuc.edu

  30. Protocol Details • OAI Transaction == An OAI request (HTTP) & corresponding OAI response (XML) • Optional: use resumptionToken & other flow control mechanisms to manage service load • Item Identifiers – Persistence & Uniqueness • Item Datestamps – Date of last metadata change; supports selective harvesting JCDL 2003OAI Intro / t-cole3@uiuc.edu

  31. Examples of OAI Requests http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify http://publications.uu.se/portal/OAI?verb=ListSets http://www.language-archives.org/cgi-bin/olaca3.pl?verb=ListMetadataFormats http://www.language-archives.org/cgi-bin/olaca3.pl?verb=ListIdentifiers&metadataPrefix=oai_dc&from=2002-12-01 http://www.language-archives.org/cgi-bin/olaca3.pl?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai%3Aacl.sr.language-archives.org%3AA00-1006 JCDL 2003OAI Intro / t-cole3@uiuc.edu

  32. An OAI Response <?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns=… xmlns:xsi=… xsi:schemaLocation=…> <responseDate>2002-05-01T19:20:30Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv:hep-th/9901001“ metadataPrefix="oai_dc"> http://an.oa.org/OAI-script</request> <GetRecord> <record> ... </record> </GetRecord> </OAI-PMH> JCDL 2003OAI Intro / t-cole3@uiuc.edu

  33. An OAI Record <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2002-02-28</datestamp> <setSpec>cs</setSpec> </header> <metadata> <oai_dc:dc xmlns…> <dc:title>Using Structural Metadata…</dc:title> … </oai_dc:dc> </metadata> <about> <provenance xmlns…> …. </provenance> </about> JCDL 2003OAI Intro / t-cole3@uiuc.edu

  34. Unique Identifiers • Each item must have a unique identifier • Identifiers must follow rules for valid URIs • Example: • oai:<archiveId>:<recordId> • oai:etd.vt.edu:etd-1234567890 • Each identifier must resolve to a single item and always to the same item • Can’t reuse OAI item identifiers JCDL 2003OAI Intro / t-cole3@uiuc.edu

  35. Datestamps • Needed for every OAI record to support incremental harvesting • Must be updated when addition or modification or deletion made in order to ensure changes are correctly propagated to harvesters • Different from dates within the metadata – OAI datestamp is used only for harvesting • Can be either YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ (must be GMT timezone) JCDL 2003OAI Intro / t-cole3@uiuc.edu

  36. HTML <meta> XML DBMS DBMS OAI Application (CGI, ASP, PHP, etc.) Webserver - HTTP OAI Provider Architectures Descriptive Metadata OAI Administrative Metadata OAI Harvesters JCDL 2003OAI Intro / t-cole3@uiuc.edu

  37. Architecture Options • Metadata items in database • If individual metadata items are stored in a database • Usually requires programmatic mapping to DC • Metadata items as XML files • If individual metadata items already in XML, can do without the database component, or can use database to cache and/or hold OAI administrative metadata • May use XSLT stylesheets to extract / map metadata • Metadata elements in HTML files • As with XML file system options • “Static” repository option (more later) JCDL 2003OAI Intro / t-cole3@uiuc.edu

  38. Technology Options • WWW Server (e.g., Apache, MS IIS) • Protocol may be implemented in many forms • CGI Script (Perl, C++, Java) • Java Servlet • PHP • Metadata (e.g. database) access mechanism required • See www.openarchives.org for list of publicly available software templates • See www.SourceForge.Net for UIUC OAI tools JCDL 2003OAI Intro / t-cole3@uiuc.edu

  39. Illustrations • Identify • ListSets • ListMetadataFormats • ListIdentifiers • GetRecord oai_dc • GetRecord olac • ListRecords • Error JCDL 2003OAI Intro / t-cole3@uiuc.edu

  40. ***15 Minute Break *** JCDL 2003OAI Intro / t-cole3@uiuc.edu

  41. Implementation Guidelines for Repositories • Tools Required • Basic program layout (incl. object-oriented approaches) • Optional container elements • Metadata generation / mapping, data cleaning • Sets • resumptionToken, flow control, load-balancing • Denial-of-service prevention • Error handling • Deleted metadata records JCDL 2003OAI Intro / t-cole3@uiuc.edu

  42. Typical Pre-Requisites • Metadata & Web server • Code templates if available (available for many languages) • Basic Web programming environment • XML parsers (for non-trivial encoding) • Database access libraries/drivers (e.g. ODBC, JDBC) JCDL 2003OAI Intro / t-cole3@uiuc.edu

  43. Basic program layout parse WWW request to extract parameters if (verb=‘Identify’) Validate arguments; ProcessIdentify; else if (verb=‘ListMetadataFormats’) Validate arguments; ProcessListMetadataFormats; else if (verb=‘ListSets’) Validate arguments; ProcessListSets; else if (verb=‘GetRecord’) Validate arguments; ProcessGetRecord; else if (verb=‘ListIdentifiers’) Validate arguments; ProcessListIdentifiers; else if (verb=‘ListRecords’) Validate arguments; ProcessListRecords; else ReportError (‘badVerb’); Re-usable subroutines to extract / clean up / transform metadata, generate standard error messages, etc. JCDL 2003OAI Intro / t-cole3@uiuc.edu

  44. Object-Oriented Approaches • Cleaner separation of protocol, database access and metadata generation • Example approaches • Each service request is handled by a object • Simpler incremental development • Protocol, Database and Metadata are objects • Greater portability of code • Inheritance from a basic OAI data provider JCDL 2003OAI Intro / t-cole3@uiuc.edu

  45. Provider Performance Issues • Database design impacts performance • Work required to map to DC • Use of resumptionTokens way to improve performance • Fetch only records needed to satisfy current request • Queries only retrieve needed records • resumptionTokens should retain state information for best performance and for idempotency JCDL 2003OAI Intro / t-cole3@uiuc.edu

  46. Optional Container Elements • <Identify><description> • Additional information about repository • oai-identifier, eprints, friends, branding, other… • <ListSets><setDescription> • Additional information describing a set • <metadata> • Other metadata besides Dublin Core • rfc1807, marc21, oai_marc, mods, other… • <about> • Meta-metadata, i.e. record level rights JCDL 2003OAI Intro / t-cole3@uiuc.edu

  47. Metadata Generation / Mapping • Approaches • Map from source to each metadata format • Use multiple crosswalks (may use XSLT) to transform to multiple metadata formats source (e.g., DB) dc rfc1807 name title title = = author creator author = = JCDL 2003OAI Intro / t-cole3@uiuc.edu

  48. Data Cleaning • Escape special XML characters (<, >, &, “) • Convert to UTF-8 version of Unicode • Convert entity references (e.g., &copy;) • Remove extraneous whitespace • URLs • /?#=&:;+ must be encoded as escape sequences JCDL 2003OAI Intro / t-cole3@uiuc.edu

  49. Sets – another option for selective harvesting • Optional: no well-defined semantics – depends completely on local data providers • Must provide setSpec & setName, may provide setDescription, for each Set in repository • Sets may be hierarchical (use “:”); may overlap • Allows for harvesting of sub-collections • May be pre-defined by arrangement between data providers and service providers • E.g. Subject areas, years, author names (but must be pre-defined – for ListSets) • Not a substitute for searching! JCDL 2003OAI Intro / t-cole3@uiuc.edu

  50. resumptionToken, flow control, load-balancing • Incomplete response: resumptionToken can be used to return partial results – the client is issued with a token which may be presented to the server to receive more results • resumptionToken embeds state information, allowing OAI to be stateless even for incomplete response model • HTTP 503 “retry-after” mechanism can be used to support server-side delaying of a client’s request • HTTP 302 / 303 can be used for load balancing • HTTP 4xx can be used to deny a harvester JCDL 2003OAI Intro / t-cole3@uiuc.edu

More Related