1 / 35

Improving Metadata Quality: Strategies and Services

Improving Metadata Quality: Strategies and Services. Diane I. Hillmann Naomi Dushay Jon Phipps National Science Digital Library. Introduction. Useful services depend on good metadata, but most metadata is not very good Human created metadata is expensive

alaire
Télécharger la présentation

Improving Metadata Quality: Strategies and Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Metadata Quality: Strategies and Services Diane I. Hillmann Naomi Dushay Jon Phipps National Science Digital Library

  2. Introduction • Useful services depend on good metadata, but most metadata is not very good • Human created metadata is expensive • Automated crawling strategies are limited by: • Accessibility barriers (rights issues, technical issues) • Variable results with crawling technologies for non-text • Best metadata does not rely solely on information contained within the resource itself • Ex.: Controlled vocabularies, descriptions, links

  3. The NSDL Environment • Functions to some extent as a metadata aggregator • Simple, two-level hierarchy (Collections & items) • Based on OAI-PMH harvest model • Each harvested item associated with a collection • Collection records managed via internal system that also drives automated harvest/ingest processes • Harvested records split into elements for storage and reassembled for output

  4. Why Transform Metadata at All? • Four categories of problems limit metadata usefulness: • Missing data: elements not present • Incorrect data: values not conforming to proper usage • Confusing data: embedded html tags, improper separation of multiple elements, etc. • Insufficient data: no indication of controlled vocabularies, formats, etc.

  5. Transforming Metadata “Safely” • Enhance original data with no risk of degradation • Provide low cost, scaleable way to improve the quality and predictability of data • Remove “noise”: empty elements, useless values • Detect and identify controlled vocabularies: DCMIType and IMT values • Normalize presentation: clean up values, remove double XML encodings, extra whitespace, etc.

  6. Beyond “Safe Transforms” • Managing each "record" separately made automated maintenance and enhancement difficult • Many sources of data required more tailored quality improvement • Distinction between improvements to the metadata expression and additional information about the resource itself • Potential to make the knowledge and expertise of NSDL data managers available to downstream consumers of the data

  7. From Records to Elements • Metadata record -- “a series of statements about resources” which can be aggregated to build a more complete profile of a resource • Statements come with source information, and links to details about services and harvests

  8. ENC Enhancement Service iVia Enhancement Service Provider A OAI OAI OAI NSDL Metadata Repository Provider A orig metadata <dc:title> <dc:identifier> <dc:creator> <dc:type> iVia enhancements <dc:subject GEM> <dc:subject LCSH> <dc:subject LCC> ENC enhancements <dct:audience> <dct:educationLevel> NSDL Safe Transforms NSDL normalized/augmented <dc:title source=A> <dc:creator source=A> OAI Safe xform enhancements <dc:identifier URI> <dc:type DCMIType> <dc:identifier URI source=MR> <dc:type DCMIType source=MR> <dc:subject GEM source=iVia> <dc:subject LCSH source=iVia> <dc:subject LCC source=iVia> <dct:audience source=ENC> <dct:educationLevel source=ENC>

  9. Exposing Quality Information • Metadata statements vary in quality, and may be subjective • Quality of statements can be determined to a great extent by knowledge of the source, and knowledge of the methodology used to create the statement • Detailed provenance itself is a good indicator of quality metadata

  10. Exposing Data to Downstream Users • Two major issues: • Linking statements to particular harvested source records (including the datestamp of the harvest) • Linking records to the services that provided them (including descriptions of those services and the methods used to create the metadata) • Required the creation and exposure of service records and a service vocabulary to categorize them

  11. <record> <metadata> <nsdl_dcard_m > … <dc:identifier sourceRecordID="332518“ sourceServiceID="316878">http://www.chem.qmw.ac.uk/surfaces/scc/ </dc:identifier> <dc:identifier sourceRecordID="993251“ sourceServiceID="8957432"xsi:type="dct:URI">http://www.chem.qmw.ac.uk/surfaces/scc/ </dc:identifier> <dc:language sourceRecordID="332518“ sourceServiceID="316878">eng-GB </dc:language> <dc:language sourceRecordID="993251“ sourceServiceID="8957432"xsi:type="dct:RFC3066">en-GB </dc:language> … </nsdl_dcard_m > </metadata> </record>

  12. <about> <sourceRecords> <sourceRecord recordID="332518" sourceServiceID="316878"> <datestamp>2002-11-11 </datestamp> <identifier>http://nsdl.org/mr/oai:nsdl.org:316878:oai:asdlib.org:asdl001709 </identifier> </sourceRecord> <sourceRecord recordID="993251" sourceServiceID="8957432"> <datestamp>2004-15-05T05:11:00Z </datestamp> <identifier>http://nsdl.org/mr/oai:nsdl.org:nsdl.service:993251 </identifier> </sourceRecord> … </sourceRecords> </about>

  13. <about> <sourceServices> <sourceService serviceID="316878"> <dc:title>Analytical Sciences Digital Library (ASDL) </dc:title> <dc:description>The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an... </dc:description> <dc:type xsi:type="nsdl:serviceType">Collection </dc:type> <serviceDescription xsi:type="nsdl:xml">http://nsdl.org/mr/xml/316878 </serviceDescription> </sourceService> <sourceService serviceID="8957432"> <dc:title>NSDL Metadata Normalization Service </dc:title> <dc:description>The NSDL Metadata Normalization Service provides the spices that help to create delicious sausage from metadata chicken lips, feathers... </dc:description> … </sourceService> </sourceServices> </about>

  14. <about xmlns:dc="http://purl.org/dc/elements/1.1/"> <collectionMembership> <collection collectionID="316878"> <dc:title>Analytical Sciences Digital Library (ASDL) </dc:title> <dc:description>The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an... </dc:description> <dc:identifier xsi:type="URI">http://www.asdlib.org/ </dc:identifier> <dc:identifier>oai:nsdl.org:nsdl.nsdl:00229 </dc:identifier> </collection> <collection collectionID="4718"> <dc:title>ENC Online: The best selection of K-12 mathematics and science curriculum resources on the Internet! </dc:title> … </collection> </collectionMembership> </about>

  15. Service Provision Model: iVia • A variety of metadata generation services • Crawling to determine what resources are part of a “collection” • Metadata creation for each resource • Augmenting metadata, adding subjects, classification, format

  16. iVia Service Issues • Human review of results is essential • Error handling and Blacklisting

  17. iVia Service Issues • Human review of results is essential • Log review

  18. iVia Service Challenge • Repeatable Crawls • Storing and reusing the crawl parameters • Repeating the crawl on a schedule • Incremental updates of the iVia data • Editor notification of crawl completion • Initiation of incremental reharvest

  19. Metadata Quality Services • Metadata generation & augmentation • Metadata transformation (“safe” and “collection specific”) • Equivalence • Crosswalking (schema and vocabulary) • Persistence/archiving • Annotation • Metadata improvement and rating

  20. “Conducting” Service Interactions • Order, timing, and response important • Passive and active interactions; human and automated triggers • Parameters for each interaction stored • Supporting “freshness” and automated updating

  21. Typical Service Orchestration Introducing Lenny… • Editor initiates an iVia Guided Crawl • Editor reviews results, blacklists • Editor notifies Lenny that crawl is complete • Lenny initiates OAI harvest and ingest • Lenny notified of ingest success

  22. Typical Service Orchestration • Lenny initiates Safe Transform Service • Service notifies Lenny that it’s done • Lenny initiates OAI harvest and ingest • Lenny notified of ingest success

  23. Typical Service Orchestration • Lenny initiates Collection-Specific Transform Service • Service notifies Lenny that it’s done • Lenny initiates OAI harvest and ingest • Lenny notified of ingest success • Lenny rests

  24. The Who and Where of Services • Many of the services we describe are useful to most metadata aggregators • No aggregator can afford to create many single purpose services closely coupled with a single aggregator • Shared, open services can provide a useful basis for improved metadata for all

  25. Conclusions • New role for “metadata aggregators”—providing enhanced metadata for other services to re-use • Integrating fragmentary metadata created by automated services • Improving metadata in standard ways • Exposing all relevant data in ways that allow consumers to evaluate quality and usefulness

  26. Conclusions “This model of service provision holds much potential in an environment where persistent metadata quality issues threaten to overwhelm aggregators hoping to build services on top of harvested metadata. No single aggregator can fill in the quality gaps alone, but if metadata services are built to interoperate with a variety of aggregators using low barrier protocols like OAI-PMH, many can benefit from the work, freeing resources for new service development.”

More Related