Data Ingestion in ENES and collaboration with RDA

Data Ingestion in ENES and collaboration with RDA Sandro Fiore, Ph.D. Euro-Mediterranean Center on Climate Change Foundation sandro.fiore@cmcc.it INDIGO Summit About data ingestion May 12, 2017 - Catania

Climate Model Intercomparison Data Analysis case study and “data ingestion” This case study proposed in INDIGO by CMCC is related to the climate change community (ENES) and to the multi-model analytics experiments From D2.7: the objective of the data ingestion process is to make the data, and metadata, FAIR, i.e. "Findable, Accessible, Interoperable and Reusable", and accordingly our definition of data ingestion is “the process that ends with the data being ready for sharing / (re-) use, following the usual community requirements”

Earth System Modelling Workflow Earth System Modelling Workflow Source: “ISENES2 Workshop on Workflow Solutions in Earth System Modelling”, by Reinhard Budich (Strategic IT Partnerships Scientific Computing Lab MPI-M) and Kerstin Fieg (Applications Deutsches Klimarechenzentrum DKRZ). June 3-5 2014, DKRZ, Hamburg.

"6S" Data Life Cycle in INDIGO (I)General workflow perspective Stage 1: Plan: prepare a Data Management Plan, including how data will be gathered, metadata definition, preservation plan, etc. Stage 2: Collect: including both creation and acquisition, it is the process of getting data, in different ways. A storage service is needed as well (output of simulations) Stage 3: Curate: also known as “Transform”: using the raw data collected in the previous stage, manual or automatic actions are performed over the data, which is converted and also filtered (postprocessing) Stage 4: Analyse: an optional step also called “Process”, that implies performing different actions to give the data an added value and get new derived data. (analysis, i.e. with Ophidia, metadata checking, etc.) Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, gets a persistent identifier (a DOI) and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use(ingestion & publication on HTTP server) Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account (“transfer data to a long-term storage”)

Earth System Modelling Workflow…with server-side processing curate Publish Preserve Earth System Modelling Workflow Source: “ISENES2 Workshop on Workflow Solutions in Earth System Modelling”, by Reinhard Budich (Strategic IT Partnerships Scientific Computing Lab MPI-M) and Kerstin Fieg (Applications Deutsches Klimarechenzentrum DKRZ). June 3-5 2014, DKRZ, Hamburg.

"6S" Data Life Cycle in INDIGO (II)Data analysis sub-workflow perspective Stage 1: Plan: prepare a Data Management Plan, including how data will be gathered, metadata definition, preservation plan, etc. Stage 2: Collect: including both creation and acquisition, it is the process of getting data, in different ways. A storage service is needed as well. (analysis, i.e. with Ophidia) Stage 3: Curate: also known as “Transform”: using the raw data collected in the previous stage, manual or automatic actions are performed over the data, which is converted and also filtered. (prepare for publication, adding metadata and checking format) Stage 4: Analyse: an optional step also called “Process”, that implies performing different actions to give the data an added value and get new derived data (it could be related to preparing a map, preview, etc.) Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, gets a persistent identifier (a DOI) and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use (ingestion & publication) Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account(“transfer data to a long-term storage (link with WP4)”)

RDA Europe Collaboration project: towards a provenance-aware analytics eco-system BARRACUDA: pid-BAsedwoRkflowsfoRclimAteChange Using ophiDiA • This RDA Europe collaboration project aims at bringing the multi-model climate analytics experiment case study implemented in the context of the H2020 EU INDIGO-DataCloudproject, one step forward by adopting the RDA recommendation on the PID Information Types (PIT) framework • The provided extensions will: • (i) make new data products interoperable, • (ii) enable data provenance at large scale • (iii) enable experiments reproducibilityand • (iv) implement a more complete and interoperable workflow lifecyclein close synergy with the ESGF eco-system/services. • (v) build a provenance-aware analytics ecos-system

Workplan, results and outputs • Project workplan (April 1, 2017 – November 30, 2017) Planned tasks include: • Design of the Ophidia support for RDA-PIT • Basic tests on the PID Handle service managed at DKRZ • Implementation, testing and validation of the Ophidia support for RDA-PIT • Integration of the PID-resolving interface in the testbed setup in the EU H2020 INDIGO-DataCloud project • Final results: • RDA-PIT support integrated into Ophidia • PID Handle Service client API integrated into the large-scale INDIGO-DataCloud experiment multi-model climate analytics • Output: • RDA-PIT support for Ophidia available as open source • Ophidia service deployed and running at CMCC with RDA-PIT support • Deliverables: • Report on the design and implementation of the RDA-PIT extensions for Ophidia • Short user manual for using the two extensions The results of the activity will be demonstrated at the ESGF F2F 2017 Conference

Implementation stage • New computer engineer in the team to work on this project • Design of the “pid-based analytics use case” • By extending the INDIGO use case for ENES • In depth analysis of the RDA PIT recommendation • Account setup on a PID Handle Service instance running at DKRZ • Next steps: • test CLI of the PID Handle Service • develop simple test client applications based on the available API Link with data ingestion in INDIGO • INDIGO will cover the data ingestion workflow (analysis, curation, publication to HTTP server and copy to a preservation storage) • Validation step in INDIGO could exploit the PID support provided by BARRACUDA

Thank you https://www.indigo-datacloud.eu Better Software for Better Science

Data Ingestion in ENES and collaboration with RDA

Data Ingestion in ENES and collaboration with RDA

Presentation Transcript

In Collaboration with:

Creating Collaboration and Context with Government Data

in collaboration with

RDA and Linked Data

RDA and Linked Data

in collaboration with

Data and ISIS Ingestion in ArcGIS 101 Tutorial

In collaboration with:

Working in collaboration with data centres

In collaboration with

Practice with RDA

In collaboration with

In collaboration with:

The Process of Data Ingestion in ÆKOS

Global Data Ingestion with Amazon CloudFront and Lambda Edge

Data and ISIS Ingestion in ArcGIS 101 Tutorial

in collaboration with

In collaboration with

Creating Collaboration and Context with Government Data

Data Ingestion Process and the Tools Required