Efficient Metadata Capture for Enhanced Data Management in e-Science

Adaptable and Incremental Metadata Capture in e-Science Scott Jensen Data to Insight Center Indiana University University of Chicago – March 2, 2012

What is Metadata? Data About Data • “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage any other resource” National Information Standards Organization • Alternately, answers the who, what, when, and why questions about a dataset. ISO 19115 standard • Where (spatial metadata) • How (configuration)

Why Does Metadata Matter? • Data Reuse • “Metadata is key to being able to share results” U.K. e-Science Core Programme • “A significant need exists in many disciplines for long-term, distributed, and stable data and metadata repositories” NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure • “Preservation of digital data is arguably a ‘grand challenge’ of the information age” Francine Berman • Trusting and Understanding Data • The ability to understand and evaluate the quality of data is key to reuse after discovery. If they have too much uncertainty, they would not use it. Ann Zimmermann • Data that is Costly and Irreplaceable • Can other data be regenerated? • Data Management Plans

Metadata Capture • Historically done at the end of the data lifecycle • Research is completed • Data and results tarred up as a dataset • metadata at the dataset level • Inserts are full metadata documents • Metadata often captured at the collection level • Generalized and not specific to each data product • Collection level metadata for discovery (e.g., WCS) • Detailed metadata stored as an object • Data search is coarse • Based on keywords or text search • Spatial bounding box and temporal range • Not specific to a data product , details not searchable • Sometimes just browse capabilities

How Much Metadata to Capture? Lower Barriers to Entry Structured Metadata Schemata (FGDC, EML) Core Metadata Flat Schemata (unqualified DC) More Structure Less Structure Name / Value Pairs Richer Metadata to Search Over Cost / Benefit Trade-offs

Research Problem • Early Capture of Ephemeral Metadata • Incremental, not at the end of the lifecycle • Incremental capture must be efficient • Deluge, Tsunami, Bonanza • Requires automation • Detailed metadata for discovery • Scalability • Variable and Dynamic Data • Must accommodate new metadata • Accommodate different domains and schemata

Research Focus • Identified the concept based character of scientific metadata schemas that differentiates them as a class from other XML schemas. • Capture metadata incrementally and efficiently early in the scientific process • Capture detailed metadata without full update • Reconstruct metadata on-the-fly after incremental capture • automated metadata extraction from data objects • Incremental capture must be efficient and scalable • Architecture must generalize across schemas and domains • Detailed metadata must be discoverable • Extensible without schema modifications

Metadata Schemas - a Bag of Concepts FGDC Spatial Schema ISO 19115 • Identification • Constraints • Data Quality • Spatial • Reference System • Distribution • Metadata Extension • and more … • Astronomy • Identity • Curation • Content • Coverage • Spatial • Temporal • Data Quality • Ecology (EML) • General • Geographic • Temporal • Taxonomic • Methods • Data table metadata • DDI (version 2.0) • Description • Study description • Physical file description • Logical description (variables) • other

Concepts have Complex Structure • Schemata are often composed of complex concepts (compound elements) • “Compound elements represent higher-level concepts that cannot be represented by an individual data element” • Increased structure → Increased reusability • Flat schema → difficulty harvesting • Harvesting Dublin Core led to incomplete and inconsistent data - California Digital Libraries • Similar issues at the National Science Digital Library made it difficult to build services on harvested Dublin Core. • Performance bottleneck when converting XML to name/value pairs

Concepts & Incremental Metadata Capture As an experiment runs, adding a concepts does not require editing the existing metadata. Can capture ephemeral metadata such as workflow notifications and add them to a detailed metadata document. Metadata can be harvested from files and added as queryable metadata at different levels of the hierarchy.

Partitioning a Schema on Concepts Global ordering of concept elements and higher levels 3 2 6 5 1 7 Concept Requirements: Recursion is within concept Elements where cardinality can exceed one are concepts or contained in concepts 12 13 16 Beneficial when CRUD operations are at the concept level or higher Incremental ingest – no need to modify existing concepts. Efficient reconstruction based on concept-sized fragments

ID Name Global Order Name Concept Source CLOB Source Metadata Document Concept Sub-concept * … Typed Value Shredded Concept Concept Element * Shredding XML Concepts • Metadata documents are “shredded” into concepts and then concepts are shredded into elements using XSLT. • Once CLOBs are stored, metadata cannot be lost. • CLOBs are indexed on Object ID and their global ordering. • Shredded metadata is only a search index, allowing for strong typing – even if types do not match XML. Fast Response Detailed Search

Ingest & Search Using Incremental Capture XMC Cat Database Determine Schema Concept CLOBs Shredded Concepts Shred Validate new concept Build Query Build Result search based on concepts query shredded metadata for matching objects object IDs query for CLOBs based on IDs

Exploded Datasets(Describing data in a broader context) Incremental Capture During a Workflow or Experiment Not a tarball at the end of a project Automated capture during an experiment Data objects are generated throughout a workflow Experiment data hierarchies vary by domain Provides scientists access to incremental metadata

Automated Metadata Capture

Domain Schema → Generalized Architecture

Adaptable Metadata Store Shares characteristics of clinical genomics databases and relational RDF stores such as Jena. Definition of concepts is based on schema structure. Dynamic concepts can be defined based on metadata content instead of structure. Every concept is stored as a CLOB Concepts can optionally be parsed into concepts, sub-concepts, and elements.

mapped to A Generic Structure for SearchingDomain Concepts Metadata Schema : Concept + Concept : Sub-Concept *, Atomic Element * Sub-Concept : Sub-Concept *, Atomic Element * Atomic Element : date | time | timestamp | integer | float | spatial | string Complex Domain-Specific Concepts Generalized Concepts, Sub-Concepts and Elements

Shredding Domain Metadata <lead:LEADresourcexmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:le="http://schemas.leadproject.org/2007/01/lms/leadelements" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc"> <le:resourceID>urn:uuid:97afbef7-58c8-4143-9b05-f0b9d82d27ef</le:resourceID> <lead:data> <lead:idinfo> <lead:citation> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> <lead:descript> <fgdc:abstract>Real-time meteorological data assimilations with CONUS coverage at 10km resolution produced hourly by CAPS at OU. The List of contents provides the OPeNDAP URLs for the files within the collection. They have a form: http://lead.unidata.ucar.edu/cgi-bin/nph-dods/test-data/ADAS/OU/ad{date}.nc where {date} has the form: YYYYMMDDHH and indicates the hour for which the data assimilation is valid. </fgdc:abstract> <fgdc:purpose>Scientific research and education</fgdc:purpose> </lead:descript> . . . <lead:keywords> <fgdc:theme> <fgdc:themekt>DatasetTypes.lead.org</fgdc:themekt> <fgdc:themekey>ADAS</fgdc:themekey> </fgdc:theme> <fgdc:theme> <fgdc:themekt>CF-1.0</fgdc:themekt> <fgdc:themekey>projection_x_coordinate</fgdc:themekey> <fgdc:themekey>projection_y_coordinate</fgdc:themekey> <fgdc:themekey>height</fgdc:themekey> <fgdc:themekey>geopotential_height</fgdc:themekey> <lead:LEADresourcexmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:le="http://schemas.leadproject.org/2007/01/lms/leadelements" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc"> <le:resourceID>urn:uuid:97afbef7-58c8-4143-9b05-f0b9d82d27ef</le:resourceID> <lead:data> <lead:idinfo> <lead:citation> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> Citation Concept Description Concept 2nd Theme Keyword Concept

Shredded Citation Metadata All Shredded Metadata Conforms to the Same Schema <objectClobPropertymyPos=“5" (namespaces omitted here) > <objectClob> <lead:citationxmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> </objectClob> <objectPropertymyName="citation" mySource="LEAD"> <objectPropertymyName="pubInfo" mySource="LEAD"> <objectElementmyName="pubPlace" mySource="LEAD" myVal="unknown"/> <objectElementmyName="publisher" mySource="LEAD" myVal="IU/GEOG"/> </objectProperty> <objectElementmyName="originator" mySource="LEAD" myVal="/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson"/> <objectElementmyName="pubDate" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="pubDateTime" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="title" mySource="LEAD" myVal="LEAD CONUS ADAS Catalog/CONUS ADAS 10km"/> </objectProperty> </objectClobProperty> <objectPropertymyName="citation" mySource="LEAD"> <objectPropertymyName="pubInfo" mySource="LEAD"> <objectElementmyName="pubPlace" mySource="LEAD" myVal="unknown"/> <objectElementmyName="publisher" mySource="LEAD" myVal="IU/GEOG"/> </objectProperty> <objectElementmyName="originator" mySource="LEAD" myVal="/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson"/> <objectElementmyName="pubDate" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="pubDateTime" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="title" mySource="LEAD" myVal="LEAD CONUS ADAS Catalog/CONUS ADAS 10km"/> </objectProperty> </objectClobProperty> CLOB for Citation Concept pubInfo Sub-concept

Dynamic Concepts Based on Content CLOB parsed out and saved based on global order (schema structure) Concept defined based on “entity” label and source 1 12 13 Sub-concept and elements defined based on “attribute” label and source New domain concepts without schema changes Concept CLOBs are always saved based on global order – even if concept is not defined. To be queryable, new concepts and elements defined, but no schema change is required

XMC Cat Builder: Concepts

Deployed in Diverse Domains • Linked Environments for Atmospheric Discovery (LEAD) • NSF funded science gateway • Metadata describing 500TB of data, intermediate results, and workflow output • Data objects each described by up to 2,202 elements • Individual workspaces of up to 15,000 objects • One Degree Imager (ODI) WIYN Consortium • Component in the data subsystem • Data-driven workflows • SEAD Project • Sustainability science • Provide search capability over archived use metadata

Comparing to a Native XML Database Concurrent Insert/Query Execution Time in Milliseconds • Except for queries based on object IDs, XMC Cat at 8Xthe base workload performs better than Berkeley XML at 1/10thof the base workload. • XMC Cat experiment inserts include validation not reflected in Berkeley results, eliminating validation, XMC Cat at 8X the workload is 2,477 ms. Projected insert and query workload as multiples of projected LEAD workload based on LEAD technical report and insert/query ratios of the TPC-E benchmark. Scott Jensen, DevarshiGhoshal, and Beth Plale, Evaluation of Two XML Storage Approaches for Scientific Metadata Indiana University CS Technical Report TR698, October 2011.

Performance Compared to Inlining Scott Jensen and Beth Plale, Using Characteristics of Computational Science Schemas for Workflow Metadata Management, In Proceedings of the 2008 IEEE Congress on Services, IEEE 2008 Second International Workshop on Scientific Workflows (SWF 2008), Hawaii, July 2008.

Eventual ConsistencyBrowse versus Search Metadata Scott Jensen and Beth Plale, Trading Consistency for Scalability in Scientific Metadata, In Proceedings of the 2010 IEEE International Conference on e-Science, Brisbane, Australia, December 2010.

Bounds on Eventual Consistency ECt = Wt + Tt + Rt + St + It Above times are averages for fetching a batch of 100 concepts (Tt and Rt) and then processing each concept (St and It). Total wait time is dominated by Wt. If the distributed shredders keep pace with the ingest rate, the frequency of the shredders fetching determines Wt

Evaluation of Eventual Consistency strict consistency is 42% longer at 6X the base workload • Eventual consistency scales higher • Strict consistency scaled to 8X the projected workload • Mostly due to deferred shredding • Using two eventually consistent shredders on a separate server

Domain-Adaptable Metadata Search Metadata search criteria are often limited keywords or text, spatial bounding box, and temporal bounds. If rich metadata is captured as a BLOB, it is available as use metadata, but not discovery metadata. Instead … Use domain concepts and dynamic concepts to define search criteria. Generic architecture for shredded metadata -> search criteria can include any shredded domain metadata.

Dynamic Search Definition

Search Adjusts to Domain Concepts When the target is selected: all concepts are listed as search options – grouped by their categories When a concept is selected, all of its sub-concepts and elements are listed as options

Strongly Typed Search Criteria

Current Work Simulation Forecast Census Data Sensor Data Ecological Data Satellite Data • Handle hierarchies based on multiple schema • Experiments bringing together data from multiple sources described by different standards. • Data described by different metadata standards can be combined in a single dataset. • Metadata can be queried based on different schemas. • Faceted search • Added to XMC Cat web service. • Can alternate between facets and details. • Unified criteria for multiple schema.

Thank You! Scott Jensen scjensen@cs.indiana.edu Thanks also to: - The NSF-funded Linked Environments for Atmospheric Discovery (LEAD) project - Data to Insight Center

Efficient Metadata Capture for Enhanced Data Management in e-Science

Efficient Metadata Capture for Enhanced Data Management in e-Science

Presentation Transcript

Adaptable design

Metadata: why and how for social science

ENTSO-E Metadata Repository

Incremental Recomputations in MapReduce

ThaiGrid and E-science in Thailand

Science Metadata

Conceptual metadata and process metadata

Accessible and Adaptable Housing

Heuristic Approach for Automatic Metadata Capture of E-books/Journals

MCAF (Metadata Capture and Formatting)

Metadata and identifiers for e-journals

Incremental Pilots in OFDM

Adaptable Clinical Pathways and Protocols

Metadata, Provenance, and Search in e-Science

Science Metadata

Robotic Self-Replication in Structured and Adaptable Environments

ADAPTABLE SOFTWARE

Science Metadata

Adaptable Priority Queues

E-SURFMAR Metadata Database