1 / 7

SCAPE Project

SCAPE Project. EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration of large data sets in order to help automate digital preservation

doria
Télécharger la présentation

SCAPE Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCAPE Project • EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration of large data sets in order to help automate digital preservation • Digital preservation: standards + policies + technologies to ensure access to digital objects over time • “Preservation workflows”, “Digital objects 4 ever” • 42 months, in the period 2011-2014 • 16 project partners, 22 WPs, 55 deliverables, 88 milestones, zillion mailing lists

  2. The Problem • Scale of data sets involved in digital preservation: • large number of objects involved in data sets • the objects can be large in size • or complex in structure • the data collections can contain heterogeneous objects (objects of different type) • Data formats change over time, become obsolete • Migrating digital objects – must ensure success • Reproducibility of preservation processes and collection of provenance data over the entire digital object’s lifecycle

  3. The Solution – From Project Proposal • The preservation processes - realised as data pipelines and described formally as Taverna workflows • Workflows will invoke various services for planning and execution of institutional preservation and quality assurance strategies • Workflows will be deployed on a large scale (using clouds) and executed over large, distributed and heterogeneous collections of complex digital objects • The execution of workflows will be controlled by a policy-based system, which will ensure the workflows are in line with state-of-the art in digital object representation, file formats, rendering tools, etc. and detect and report any errors in a preservation process

  4. The Solution – In Practice • Preservation services are written in various languages • Use Taverna’s External Tools or Beanshells to invoke them from inside Taverna workflows • Preservation services need to be running locally to be able to deploy them to a cluster and avoid bottleneck problem related to invoking a Web service • Convert Taverna’s workflows to workflows executable and parallelizable on Hadoop MapReduce • Compile Taverna workflows to intermediate language Jaql that can be optimized and executed on MapReduce

  5. Benefits to Us • Strengthened External Tools plugin and improved support for running external services • Taverna workflow (potentially containing only local services) -> parallelizable Jaql workflow executable on a MapReduce cloud • App4Andy-style applications that process large data, use local scripts and need parallelization/optimization • Some extensions to myExperiment (“run wf on a cloud”) /BioCatalogue – not sure how reusable

  6. Other Projects Affecting SCAPE • External Tools plugin for Taverna • Provenance in Taverna • Browsing, exporting • We design a Taverna wf, but actually run a Jaql wf – so provenance is not being captured by Taverna? • Next Generation Workbench – could with a more advanced UI • SCUFL2 – for conversion to Jaql workflows • Easier for manipulation than current t2flow?

  7. Summary • Contributions • Taverna Workbench for workflow design • myExperiment VRE for sharing workflows • BioCatalogue catalogue for curating preservation services • Ontology development • Expectations • Scalability in workflow execution • Experiences with new domain – digital libraries

More Related