Introduction to Marine Metadata - SeaDataNet Training Course

SeaDataNet Training Course Introduction to Marine Metadata Roy Lowry British Oceanographic Data Centre

Jargon Warning • The nature of the material Geoff and I will be presenting inevitably involves many words that are very familiar to us, but not to you • One approach would be for us to define everything, but that could take all day • Please • Don’t feel you have a problem if you don’t know what we’re talking about – it’s our role to help you understand • Consider our presentations interactive and feel free to ask any question at any time

Overview • Metadata definition • Metadata function • Metadata classification • Metadata interoperability • Vocabulary management • Metadata standards and crosswalks • Ontolgies • Metadata horror stories

Metadata Definition • What is metadata? • Information about data • Includes everything except the numbers themselves • “42” is data, but means nothing • “42 is the abundance of Calanus finmarchicusper litre at a location 56N 4E, between depths of 10m and 20m at 00:30 on 01/02/1990” means a lot more

Metadata Function • Why is metadata needed? • To provide information to allow data to be discovered • To provide information on whether data should be used for a given purpose • To provide information on how to use data

Metadata Classification • EU INSPIRE draft metadata rules follow this approach, classifying metadata into: • Discovery metadata • Evaluation metadata • Use metadata

Discovery Metadata • Discovery metadata is information posted to allow datasets to be located by search engines • Bare minimum is 5-dimensional co-ordinate coverage • 3 spatial (x,y,z) as digital ranges or keywords • 1 temporal (time) as digital ranges or keywords • 1 other (parameter space) as keywords • May be enriched by keywords covering aspects such as instrument, platform, project, activity

Evaluation Metadata • Evaluation metadata is information that allows a potential user to ascertain whether a discovered dataset is fit for purpose • Covers issues like resolution, precision, accuracy, methodology, provenance, data quality, access restrictions • Often includes a plain-text abstract to provide scientific context

Use Metadata • Use metadata is information required to make use of the data in a tool or application • Access protocols (technical and political) • 5-dimensional co-ordinate coverage (see discovery) plus units of measure • Properties of co-ordinate coverage known as dataset ‘shape’ or feature type (e.g. point time series, profile, spatial grid)

Metadata Classification • Those still awake will have noticed that co-ordinate coverage represents a significant overlap between discovery and use • Controversy rages about whether ‘discovery’ coverage and ‘use’ coverage should be the same • My current view is: • They are different with significantly more detail required for use • Systems should be able to convert from use to discovery, but not the other way round • Search engines should be able to drill down from discovery metadata into use metadata to satisfy the evaluation use case

Metadata Interoperability • Interoperability is the ability to share data from multiple sources as a common resource from a single tool • Interoperability has four levels (Bishr, 1998) • System – protocols, hardware and operating systems • Syntactic/Structural – loading data into a common tool (reading each others’ files) • Semantic – understanding of terms used in the data by both humans and machines • Only two levels (syntactic/semantic) need worry us as data managers

Metadata Interoperability • The easiest way to achieve any kind of interoperability is by maintaining uniformity across distributed systems • Nice idea, but this is the real world and many different people have had many different reasons (some valid, others not) why they should do it ‘their way’ • So we have the face the reality of heterogeneous legacy metadata repositories

Metadata Interoperability • Most marine metadata greybeards agree that if they knew 20 years ago what they know now, we wouldn’t have the problem of a heterogeneous legacy • Anyone in this day and age with the blank canvass of a new system who decides to ignore standards and design their own metadata structures deserves to be damned for eternity making legacy systems interoperate • I’ve re-invented wheels in the past and am currently on my way to eternity!

Metadata Interoperability • Metadata standards support interoperability by specifying • The fields to be included in a metadata document • The way in which those fields are populated

Vocabulary Management • A controlled vocabulary contains the terms that may be used to populate a metadata field • A good controlled vocabulary comprises: • Terms • Term definitions • Keys (semantically neutral strings that may be used to represent the term to computers) • Term abbreviations

Vocabulary Management • A good controlled vocabulary possesses: • Content governance • A mechanism for the management of vocabulary entries that: • Makes decisions about new entries • Makes decisions about changes to existing entries • Technical governance • A mechanism to • Control changes dictated by content governance including • Versioning • Audit trails to allow recreation of previous versions • Distribute the most up-to-date vocabulary version

Vocabulary Management • In SEA-SEARCH and EDIOS we had: • Content governance • Ad-hoc decisions made by individuals, including vacation students (non-specialist undergraduates), on the spur of the moment • No rules of engagement for permitted changes • Technical governance • CSV files on one or more FTP sites updated haphazardly with no formal time stamping and no version labelling

Vocabulary Management • In SeaDataNet we now have: • Content governance • SeaVoX a moderated e-mail list under the joint auspices of SeaDataNet and IOC MarineXML to make decisions concerning vocabulary change • Technical governance • Vocabularies held in an Oracle back-end that automatically documents change, including timestamps, versioning and previous version preservation • A web service API (plus client for those who need it) maintained by BODC on behalf of SeaDataNet and the UK NERC DataGrid providing live access to the latest version of the BODC Oracle database

Metadata Standards • Metadata specifications, primarily targeted at ‘discovery’ metadata, relevant to ‘evaluation’ metadata but of little relevance to ‘use’ metadata • DIF - Set up by NASA’s Global Change Master Directory (GCMD) primarily to document satellite datasets • FGDC - Mandatory US Government dataset description • ISO19115/19139 - A metadata content standard (19115) now developed into an XML schema (19139) targeted at describing GIS datasets, but much more useful • INSPIREMetadata Rules - A European draft standard for geospatial data, largely based on ISO19115, destined to become the European answer to FGDC

Metadata Standards • Continuing….. • EDMED - Standard developed by BODC for EU MAST programme to describe datasets and subsequently developed by SEA-SEARCH • Cruise Summary Report - IOC standard description for research cruises and associated datasets • EDIOS - Standard developed by EuroGOOS with EU funding to describe datasets of repeated measurements • My view is that this list is far too long…….

Metadata Standards • All these standards do pretty much the same thing • It would seem a very good idea to provide ‘crosswalks’ – the means to translate documents conforming to one standard into another • XML technology provides the means to do this through XSLT scripts • I have yet to find one that actually works at what I consider an acceptable level • Lowest Common Denominator mappings make the conversions too ‘lossy’ • Semantic issues are generally ignored

Ontologies • Each standard is associated with a set of controlled vocabularies • Semantic interoperability requires us to either • Harmonise the vocabularies (develop a single overarching vocabulary) • Translate between the vocabularies • Harmonisation usually considered to be too difficult for repositories with significant legacy population • Which brings us to ontologies…..

Ontologies • Vocabulary translation can be based on a simple mapping • Term in vocabulary 1 maps (i.e. has some relation to) term in vocabulary 2 • Now considered to be a gross over-simplification – consider the examples • Pigments map to chlorophyll • Nitrate maps to nutrients • Carbon concentration due to phytoplankton maps to phytoplankton carbon biomass

Ontologies • An ontology (small ‘o’ – this is computer science, not philosophy) may be considered as a set of lists with relationships specified between list members • The previous example becomes • Pigments is broader than chlorophyll • Nitrate is narrower than nutrients • Carbon concentration due to phytoplankton is synonymous with phytoplankton carbon biomass

Ontologies • Computer science provides tools for ontology management • XML specification languages (Web Ontology Language OWL and SKOS thesaurus language) • Tools, such as inference engines, to use these languages as a basis for decision making and to derive additional relationships • Statements • A synonymous with B • B synonymous with C • Inference • A synonymous with C • Welcome to the world of Artificial Intelligence….

Metadata Horror Stories • Vocabulary nightmares • The plaintext monster • The evil shoehorn

Vocabulary Nightmares • Weak content governance • Maintaining a vocabulary properly requires a surprisingly large amount of intellectual input as consistent and robust decisions need to be made quickly • Many vocabularies have been populated by isolated individuals, who are sometimes inexperienced and working under pressure at the coal-face • The result is vocabularies with useless terms like ‘see website’ (referring to a broken URL) or rubbish like ‘NVT’ (Dutch for not applicable) in a list of sea level datums • Weak technical governance • Lack of clearly defined, readily obtainable and versioned master copies leads to a proliferation of local lists • Like finches on the Galapagos Islands these soon evolve into something completely different. Eventually as finches lose the ability to interbreed, lists lose the ability to interoperate

Vocabulary Nightmares • Semantic keys • During the 80s and 90s, great importance was placed on making keys meaningful mnemonics • Not scalable, particularly if there are restrictions on key size (try getting 18,000 meaningful unique labels out of 8 bytes!) • Following this doctrine has caused • New vocabularies to be created requiring months of subsequent mapping work to re-establish interoperability • The disintegration of established standards (e.g. ICES Ship Codes when USA left the fold) • Insanity in other vocabularies (e.g. USA has three different IOC Country Codes due to Ship Code key syntax)

Vocabulary Nightmares • GCMD maintenance • NASA’s GCMD maintain vocabularies (called GCMD keywords) for the DIF metadata standard • They have no compunction about deleting terms which • Invalidates DIFs in legacy repositories • Breaks referential integrity in user databases • Their entries have no keys which • Makes changes or corrections to terms difficult to find • Causes these changes to break referential integrity in user databases • Consequently GCMD keyword updates are only done when there is a dire need, resulting in yet more local list evolution

The Plaintext Monster • Some of the previous generation of data managers saw hard-copy printout as the primary metadata delivery mechanism • This can easily be delivered by metadata repositories based on big chunks of plaintext • Such repositories are virtually useless for machine-based metadata processing • Remember that sticking text together is much easier than picking it apart

The Plaintext Monster • SeaDataNet has EDMED and not DIF because of history and a misguided desire for plaintext over structured fields • EDMED through its XML schema is currently evolving towards a structured standard (ISO19115) that will make interoperability much easier and this evolution will continue during the life of SeaDataNet • When populating EDMED plaintext always think how you can make it easier to pick it apart by being consistent or using internal markup (but not XML or XHTML - embedding these inside fields is not XML-friendly )

The Evil Shoehorn • A shoehorn forces something (a foot) to occupy a space that it doesn’t quite fit (a shoe) • The metadata equivalent is using a metadata structure designed to describe a particular thing to describe something else • For example, using a Cruise Summary Report to describe the activities of a scientist on a beach collecting mussels for analysis

The Evil Shoehorn • Why is this evil? • Shoehorning causes data model entity definitions to be changed • Changing entity definitions causes strange things to happen to supporting vocabularies, for example CSR shoehorning has led to the following ‘ship names’ • RRS Challenger (a ship) • Dr Mussel Collector (a person) • Helicopter (a type of platform) • Dover to Boulogne (a ferry route) • These vocabularies are shared between data models and used as constraints or to populate drop-down lists in user interfaces • Would you want ‘Dr Mussel Collector’ appearing in the drop-down list labelled ‘ship name’ in your system?

That’s All Folks! Questions or Coffee?

Introduction to Marine Metadata - SeaDataNet Training Course

Introduction to Marine Metadata - SeaDataNet Training Course

Presentation Transcript

Introduction to Metadata

Introduction to Geospatial Metadata

Introduction TO Metadata!

Introduction to (Music) Metadata

Introduction to Geospatial Metadata – ISO 191** Metadata

Introduction to Metadata

An Introduction to Metadata

Introduction to Metadata

Introduction to Metadata

Introduction to metadata

Introduction to Metadata - ISO

Introduction TO Metadata!

Introduction to Metadata

Introduction to SeaDataNet Metadata

Introduction to Metadata

Introduction to Metadata

Introduction to Metadata

INTRODUCTION TO MARINE ECOLOGY

Introduction to Geospatial Metadata

Introduction to Metadata

An Introduction to Metadata

Introduction to Metadata