1 / 34

Metadata Harvesting

Metadata Harvesting. Interoperable digital collections. Two basic approaches. One service provider with access to resources stored in multiple locations Information about the resources located at the service provider.

gigi
Télécharger la présentation

Metadata Harvesting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Harvesting Interoperable digital collections

  2. Two basic approaches • One service provider with access to resources stored in multiple locations • Information about the resources located at the service provider. • Services developed to use the information to provide connections to resources at multiple locations • Distributed services • Information kept with the resources • Services interact with multiple collection sites

  3. Two protocols • Z39.50 • Developed before the web • Protocol for communicating with collection holders in order to provide services. • Open Archives Initiative • Recent innovation • Central service provider gathers information from collection holders

  4. Z39.50 - briefly • Information Retrieval Service Definition and Protocol Specifications for Library Applications • Initially developed over the OSI network standards • Protocol for information exchange • Free the information seeker from the need to know the details of the target database configuration • Each site provides services • Each service queries remote sites for needed information • Information requests mapped to database queries at the collection site. • Some inconsistency in the interpretation of queries.

  5. Distributed ResourcesMultiple Services Approach 1 - One service provider gathers information about data and uses it to provide services Data provider Data provider Data provider Service provider -- search, browse, compare, etc. Data provider Data provider

  6. Distributed data and services Search, browse Approach 2: Each system is both a data repository and a service provider. Services query other data providers as needed. Search, browse, compare

  7. Open Archives Initiative (OAI) • Web-based • Uses HTTP to communicate between sites • Centralized server • Services provided from a site that has already gathered the information it needs for those services from a distributed collection of sites.

  8. OAI Compared to Z39.50 Source: oai.grainger.uiuc.edu/FinalReport/JCDL_2003_OAI_Intro.ppt

  9. Open Archives Initiative Protocol for Metadata Harvesting -- OAI-PMH Implemented as CGI, ASP, PHP, or other Repository Harvester OAI PMH defines an interface between the Harvester and any number of Repositories HTTP req (OAI verb) OAI OAI HTTP resp (XML) Metadata Provider Service Provider

  10. OAI components Service Providers and Data Providers Requests and Responses http://www.oaforum.org/tutorial/english/page3.htm#section3

  11. Records • Metadata of a resource. • Three parts • Header (required) • Identifier (required: 1 only) • Datestamp (required: 1 only) • setSpec elements (optional: 0, 1, or more) • Status attribute for deleted item • Metadata (required) • XML encoded metadata with root tag, namespace • Repositories must support Dublin Core, other formats optional • “About” statement (optional) • Right statements • Provenance statements

  12. Identifiers • Globally unique identifier • Valid URI • Examples • oai:<archiveId>:<recordId> • oai:etd.vt.edu:etd-1234567890 • Must resolve to one item • No duplicates • No reuse of previously used identifiers

  13. Datestamps • Date of last modification of a record • Used only for harvesting (meta metadata?) • Mandatory for each item in the repository • Two levels of granularity possible • YYYY-MM-DD • YYYY-MM-DThh:mm:ssZ • T … Z = time zone -- must be GMT • Allows harvesting incrementally -- get only what is new since last visit • Accessed by arguments from and until

  14. The OAI-PMH verbs • Each requests a specific response from a data repository

  15. Identify • Function: Description of the archive • Example: http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify • Parameters: none • Errors/exceptions: • badArgument (there should not be any) • Response format: Element Example Ordinality ‡ repositoryName My Archive 1 baseURL http://archive.org/oai 1 protocolVersion 2.0 1 earliestDatestamp 1999-01-01 1 deleteRecords no, transient, persistent 1 granularity YYY-MM-DD, YYYY-MM-DDThh:mm:ssZ 1 adminEmail oai-admin@archive.org + compression deflate, compress * description oai-identifier, eprints, friends, … * ‡ Ordinality: 1 = mandatory, 1 only; + = mandatory, 1 only; * = optional, 0 or more

  16. Actual response from http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify <OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2006-10-17T01:37:44Z</responseDate> <request verb="Identify">http://www.language-archives.org/cgi-bin/olaca3.pl</request> − <Identify> <repositoryName>OLAC Aggregator</repositoryName> <baseURL>http://www.language-archives.org/cgi-bin/olaca3.pl</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>mailto:haejoong@ldc.upenn.edu</adminEmail> <earliestDatestamp>2002-12-14</earliestDatestamp> <deletedRecord>no</deletedRecord> <granularity>YYYY-MM-DD</granularity> − <!-- maybe later <compression>identity</compression> --> Continued

  17. − <description> − <oai-identifier xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"> <scheme>oai</scheme> <repositoryIdentifier>OLACA.language-archives.org</repositoryIdentifier> <delimiter>:</delimiter> <sampleIdentifier>oai:ethnologue.com:aaa</sampleIdentifier> </oai-identifier> </description> Continued

  18. − <description> − <olac-archive type="institutional" xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/olac-archive http://www.language-archives.org/OLAC/1.0/olac-archive.xsd"> <archiveURL>http://www.language-archives.org:8082/dp9/</archiveURL> <curator>Steven Bird & Gary Simons</curator> <curatorTitle>Coordinators</curatorTitle> <curatorEmail>mailto:olac-admin@language-archives.org</curatorEmail> <institution>Open Language Archives Community</institution> <institutionURL>http://www.language-archives.org/</institutionURL> <shortLocation>Philadelphia, U.S.A.</shortLocation> <location/> − <synopsis> This repository contains all records from OLAC-registered archives. It is intended to be used by services which do not want to harvest individual OLAC archives. </synopsis> − <access> Metadata may be used only subject to the access permissions given by the individual archives. </access> </olac-archive> </description> </Identify> </OAI-PMH>

  19. ListMetadataFormats • Function: retrieve available metadata formats from archive • Example:archive.org/oai-script?verb=ListMetadataFormats& • identifier=oai:HUBerlin.de:3000218 • Parameters: identifier (optional) • Errors/exceptions: • badArgument • idDoesNotExist • noMetadataFormats

  20. − <OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2006-10-17T01:58:06Z</responseDate> <request verb="ListMetadataFormats">http://www.language-archives.org/cgi-bin/olaca3.pl</request> − <ListMetadataFormats> − <metadataFormat> <metadataPrefix>olac</metadataPrefix> <schema>http://www.language-archives.org/OLAC/1.0/olac.xsd</schema> <metadataNamespace>http://www.language-archives.org/OLAC/1.0/</metadataNamespace> </metadataFormat> − <metadataFormat> <metadataPrefix>olac_display</metadataPrefix> <schema>http://www.language-archives.org/OLAC/1.0/olac.xsd</schema> <metadataNamespace>http://www.language-archives.org/OLAC/1.0/</metadataNamespace> </metadataFormat> − <metadataFormat> <metadataPrefix>oai_dc</metadataPrefix> <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema> <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace> </metadataFormat> </ListMetadataFormats> </OAI-PMH> Response to http://www.language-archives.org/cgi-bin/ olaca3.pl?verb=ListMetadataFormats

  21. ListSets • Function: retrieve set structure of a repository • Example: archive.org/oai-script?verb=ListSets • Parameters: resumptionToken (exclusive) • Errors/exceptions: • badArgument • badResumptionToken • noSetHierarchy

  22. ListIdentifiers • Function: abbieviated form of ListRecords, retrieve only headers • Example:archive.org/oai-script?verb=ListIdentifiers&metadataPrefix= oai_dc&from=2002-12-01 • Parameters: • from (optional) • until (optional) • metadataPrefix (required) • set (optional) • resumptionToken (exclusive) • Errors/exceptions: • badArgument • badResumptionToken • cannotDisseminateFormat • noRecordsMatch • noSetHierarchy

  23. ListRecords • Function: harvest records from a repository • Example: archive.org/oai-script?verb=ListRecords& metadataPrefix=oai_dc&set=biology • Parameters: • from (optional) • until (optional) • metadataPrefix (required) • set (optional) • resumptionToken (exclusive) • Errors/exceptions: • badArgument • badResumptionToken • cannotDisseminateFormat • noRecordsMatch • noSetHierarchy

  24. GetRecord • Function: retrieve an individual metadata record from a repository • Example: archive.org/oai-script?verb=GetRecord&identifier=oai:HUBerlin.de: 3000218 &metadataPrefix=oai_dc • Parameters: • Identifier (required) • metadataPrefix (required) • Errors/exceptions: • badArgument • cannotDisseminateFormat • idDoesNotExist

  25. Interoperability • The goal: communication, without human intervention, between information sources • Books that “talk to each other” • Live links for references • Knowledge of how to find relevant resources when needed • Ability to query other information locations

  26. Protocols • Precise rules for interactions between independent processes • Format of the messages • Both structure and content • Specified behavior in response to specific messages • Many ways to accomplish the same result, but both sides must have the same understanding of the rules of engagement.

  27. Protocol Types • RPC model • Point to point • Completely open to definition by developer • Verbs (methods) • Nouns (objects, resources) • Useful to closed community or group who know about the availability of the resource.

  28. SOAP • Initial words of the acronym have been discontinued. • Initially developed as part of the Microsoft .NET paradigm • Now in W3C committee • Stateless, one-way message exchange paradigm • XML encoded • Flexibility of RPC, but more constrained in the way communication is formatted.

  29. REST • REpresentational State Transfer • An after-the-fact definition of the architecture of the World Wide Web • The model is • Client/server • Stateless • Cacheable • Layered • Resource interface constrained • Restricted verbs • Restricted content types

  30. REST and RPC • RPC provides flexibility for any type of interaction between any type of resources • REST provides consistency to allow interaction among resources without prior discovery of accepted actions and responses.

  31. SOAP and REST • Debate in the Web community about which is the better paradigm for application development • REST -- restricted, but simple extension of existing Web processes • SOAP -- added flexibility with cost in terms of bandwidth, security, complexity for development

  32. References • Giving SOAP a REST http://www.devx.com/DevX/Article/8155 • SOAP Version 1.2 Part 0: Primer http://www.w3.org/TR/2003/REC-soap12-part0-20030624/#L1153 • OAI For Beginners - The Open Archives Forum online tutorial: http://www.oaforum.org/tutorial/index.php • Z39.50 Resource Page: http://www.niso.org/standards/resources/Z3950_Resources.html • Z39.50 An Overview of Development and the Future (1995) http://www.cqs.washington.edu/~camel/z/z.html

More Related