Data Placement for Scientific Applications in Distributed Environments

Data Placement for Scientific Applications in Distributed Environments Ann Chervenak1, Ewa Deelman1, Miron Livny2, Mei-Hui Su1, Rob Schuler1, Shishir Bharathi1, Gaurang Mehta1, Karan Vahi1 1 USC Information Sciences Institute 2 University of Wisconsin, Madison

Motivation • Scientific applications often perform complex computational analyses that consume and produce large data sets • The placement of data onto storage systems can have a significant impact on • performance of applications • reliability and availability of data sets • We want to identify data placement policies that distribute data sets so that they can be • staged into or out of computations efficiently • replicated to improve performance and reliability • We study the relationship between asynchronous dataplacement servicesand workflow management systems • Experimental evaluation demonstrates that good placement has the potential to significantly improve workflow execution performance

Compute Cluster Storage Elements Jobs Data Transfer Workflow Tasks Setup Transfers Staging Request Workflow Planner Data Placement Service Interaction Between Workflow Planner and Data Placement Service for Staging Data

Outline • Example data placement service for high energy physics (PheDEx) • Data placement for LIGO • Asynchronous data placement for scientific applications • Approach: Integrate two components • Pegasus Workflow Management System • Data Replication Service • Experimental evaluation: Workflow performance using asynchronous data placement • Ongoing and Future Work

Example Data Placement Service: Physics Experiment Data Export (PheDEx) • Manages data distribution for Compact Muon Solenoid (CMS) High Energy Physics Project • High energy physics community has a hierarchical or tiered model for data distribution • Tier 0 at CERN: data collected, pre-processed, archived • Tier 1 sites: store & archive large subsets of Tier 0 data • Tier 2 sites: less storage, store a smaller subset of data • Goal of PheDEx: automate data distribution processes as much as possible • Data operations that must be performed include: • staging data from tape storage (for example, at the Tier 0 site) into disk buffers • wide area data transfers (to Tier 1 or Tier 2 sites) • validation of these transfers • migration from the destination disk buffer into archival storage at those sites

Physics Experiment Data Export (PheDEx) (cont.) • PheDEx system design involves a collection of agents or daemons running at each site • Each agent performs a unique task, for example: • Agents that stage data from tape to disk • Agents that perform data transfers • The agents communicate through a central database running on a multi-server Oracle database platform • PheDex supports: • initial “push-based” hierarchical distribution from Tier 0 site to Tier 1 sites • subscription-based transfer of data • sites or scientists subscribe to data sets of interest • data sets are sent to the requesting sites as they are produced • on-demand access to data by individual sites or scientists

Existing Data Placement Services: LIGO Lightweight Data Replicator (LDR) • Distributes data for the Laser Interferometer Gravitational Wave Observatory (LIGO) project • Two types of data publishing: 1. Detectors at Livingston and Hanford produce data sets • Approx. a terabyte per day during LIGO experimental runs • Data sets are copied to main repository at CalTech, which stores them in tape-based mass storage system • LIGO sites can acquire copies from CalTech or one another to provide scientists with local access to data 2. Scientists also publish new or derived data sets as they perform analysis on existing data sets • E.g., data filtering or calibration may create new files • These new files may also be replicated at LIGO sites

LIGO Lightweight Data Replicator (LDR) (cont.) • LDR provides a rich set of data management functionality, including • Publication/replication • Efficient data transfer among LIGO sites • Distributed metadata service architecture • Interface to local storage systems • Validation component that verifies that files on a storage system are correctly registered in a local RLS catalog • LDR distributes data at LIGO sites based on metadata queries by scientists at those sites • Currently more than 140 million files across ten locations • LDR is built on top of standard Grid data services: • GridFTP data transport, Globus Replica Location Service • Each LDR site initiates local data transfers using a pull model • Runs a scheduling and transfer daemon

Existing Data Placement Services • PheDEx, LDR are examples data management services for science • Provide asynchronous data movement of large scientific data sets • Terabytes of data • Millions of files • Disseminate subsets of data to multiple sites some time after data sets are produced • Based on VO policies, metadata queries, subscriptions, explicit data requests • Stage data sets onto resources where scientists plan to run analyses

Asynchronous Data Placement and Workflows • Goal: Separate to the extent possible the activities of data placement and workflow management services • Placement of data items largely asynchronous with respect to workflow execution • Placement operations are performed • as data sets become available • according to the policies of the Virtual Organization • Workflow system can provide hints to placement service r.e. grouping of files, expected order of access, dependencies, etc. • In contrast to many existing workflow systems • Explicitly stage data onto computational nodes before execution • Some explicit data staging may still be required • Data placement has potential to • Significantly reduce need for on-demand data staging • Improve workflow execution time

Types of Data Placement: Staging Data Into Computational Analysis Efficiently • Large workflows composed of thousands of interdependent tasks • Each task allocated for execution on a computational node requires that its input files be available before computation begins • Placement service could take into account characteristics of data access • Data items tend to be accessed as related collections rather than individually, so place together on storage system • Data access patterns tend to be bursty, with many data placement operations taking place during the stage in (or stage out) phase of execution. • This paper: focus on moving data sets asynchronously onto storage systems accessible to computational nodes • ideally before workflow execution begins

Types of Data Placement: Staging Data Out Efficiently • Applications run large analyses on distributed resources (e.g., Open Science Grid • Individual nodes that run computational jobs may have limited storage capacity • When a job completes, the output of the job may need to be staged off the computational node • onto another storage system • before a new job can run at that node • A data placement service that does efficient stage-out can have a large impact on the performance of scientific workflows

Types of Data Placement: Maintenance of Data Redundancy • Maintenance of data to provide high availability or durability, i.e., protection from data failures • Placement algorithms replicate data to maintain additional copies to protect against temporary or permanent failures of storage systems • E.g., create a new replica of a data item whenever the number of accessible replicas falls below a certain threshold • Algorithms may be reactive to failures or might proactively create replicas

Compute Cluster Storage Elements Jobs Data Transfer Workflow Tasks Setup Transfers Staging Request Workflow Planner: Pegasus Data Placement Service: Globus DRS Approach: Combine Pegasus Workflow Management with Globus Data Replication Service

Pegasus Workflow Management System • Maps from high-level, resource-independent workflow descriptions to executable workflows • Finds appropriate resources • Finds data described in the workflow • Finds appropriate executables and stages them in if necessary • Creates an executable workflow • Schedules the computations and creates computational nodes • Adds data transfer and data registration nodes • Performs optimizations • Node clustering • Dynamic data cleanup • Relies on Condor DAGMan for correct, scalable, and reliable execution • Used in astronomy, earthquake science, gravitational-wave science, and other disciplines

Pegasus Mapping a • Select compute resources • Select data sources • Add data stage-in and data stage-out nodes 0 b b a1 1 2 d c c 3 4 5 e h f 6 g

Pegasus Mapping b: S1->R1 a: S1->R1 b: S1->R2 a • Select compute resources • Select data sources • Add data stage-in and data stage-out nodes 0 R1 0 b b a1 1 2 a1: R1->R2 d 1 c c 3 4 5 2 R2 3 4 e h f 6 5 e: R1->R2 g h: R1->S1 6 g: R2->S1

Pegasus Mapping b: S1->R1 a: S1->R1 b: S1->R2 a • Select compute resources • Select data sources • Add data stage-in and data stage-out nodes 0 R1 0 b b a1 1 2 a0: R1->R2 d 1 c c 3 4 5 2 R2 3 4 e h f 6 5 e: R1->R2 g h: R1->S1 6 g: R2->S1

Data Replication Service • Service that replicates a set of files at a site and registers them in catalog for later discovery • Discovers replicas (possible source files) that exist in the Grid • Uses Globus Replica Location Service • Select among source files • Invoke Globus Reliable File Transfer Service to copy data • Uses GridFTP Data Transfer service • Register new replicas in replica catalog Client Data Replication Service Reliable File Transfer Service Replica Location Service Catalogs GridFTP Server GridFTP Server

Galactic Star Formation Region RCW 49 Montage Project WorkflowsBruce Berriman et al, Caltech • Generates science-grade mosaics of the sky

Experimental Study • Ran Montage workflows with different Mosaic degrees • Modified input data sizes to simulate larger data sets • Pegasus workflows: compared explicit data staging tasks with workflows that check whether data has been staged asynchronously • Use Globus DRS to stage input data asynchronously before workflow execution begins • Workflows ran on a cluster with up to 50 available compute nodes, where each node has: • Dual Pentium III 1GHz processor • 1GByte of RAM • Debian Sarge 3.1 operating system

Experimental Study (cont.) • The data sets are staged onto a storage system associated with the cluster from a GridFTP server on the local area network • We compare the time taken to: • Stage data using the Data Replication Service • Run the workflow when the data are already prestaged by DRS (requiring no additional data movement by Pegasus) • Run the workflow when Pegasus manages the data staging explicitly • Sum of DRS data staging time and the Pegasus execution time for prestaged data (corresponds to sequential invocation of these two services)

Input Sizes for Experiments

Execution Times with Default Input Sizes

Execution Times with Additional 2 MByte Input Files

Execution Times with Additional 20 MByte Input Files

Experiment Summary • Largest workflow • Total input data size of 13.2 GBytes for a mosaic of 4 degree square • Combination of prestaging data with DRS followed by workflow execution using Pegasus • Improves execution time approximately 21.4% over Pegasus performing explicit data staging • Avoid Condor queuing overheads for data staging tasks • When data sets are completely prestaged before workflow execution begins • Execution time is reduced by over 46% • Shows potential advantage of combining asynchronous data placement services with workflow management

Ongoing and Future Work First step towards understanding interplay between community-wide data placement services and community workflow management systems Continued work on efficient stage-in and stage-out of data sets for workflows • Initial data placement service, release October 2007 • Will allow workflow manager or other client to specify movement of data sets, specify groups of files, change priorities for staging • Simulation studies (based on GridSim) of workflow manager and placement service • Allow experimentation with different placement algorithms Future work • Placement services that replicate data for performance and reliability

Data Placement for Redundancy • Data placement services may enforce policies that attempt to maintain a certain level of redundancy • To provide highly available or durable access to data • Maintain several copies of each data item on different storage systems • Monitor the current state of the distributed system • If number of replicas of a data item falls below the threshold specified by V.O. policy, create additional replicas • Useful when data sets are valuable and expensive to regenerate • UK Quantum Chromodynamics Grid (QCDGrid) project maintains several redundant copies of data items for reliability • Medical applications that must preserve patient records

Example: Reactive Replication • Oceanstore global distributed storage system • Reactive replication algorithm called Carbonite • Models replica repair and failure rates in a system as the birth and death rates in a continuous time Markov model • To provide durability, the replication rate must match or exceed the average rate of failures • Carbonite creates a new replica when a failure is detected that decreases the number of replicas below a specified minimum

Example: Proactive Replication • Another algorithm from Oceanstore group • Proactive replication algorithm called Tempo • Creates replicas periodically at a fixed low rate • Tempo creates redundant copies of data items as quickly as possible using available maintenance bandwidth and disk capacity • Provides durability for data sets comparable to that from the reactive Carbonite algorithm • Uses a less variable amount of bandwidth, thus helping to keep maintenance costs predictable

Acknowledgements This work is supported by: • The National Science Foundation under the grants CNS 0615412 and OCI 0534027 • The Department of Energy’s Scientific Discovery through Advanced Computing II program under grant DE-FC02-06ER25757 Thanks to the Montage team (Bruce Berriman, John Good, and Daniel Katz) for their helpful discussions and the use the of the Montage codes and 2MASS data

Data Placement Service • DPS is a layered system made up of: • Policy-based services • Data staging, N-copies, publish/subscribe, push/pull, etc. • Reliable distribution layer • Logging, persistence, “multicast” style transfer, bulk transfer management • The Bulk Transfer Service is the first layer implemented in the Reliable Distribution Layer stack

Data Placement Service Architecture Overview Policy-based Placement Services Data Staging N-Copies PubSub … CEDPS Event Logging Layer Reliable Distribution Layer optional Persistent Transfer Layer Multicast Transfer Layer Bulk Transfer Layer Lower level transfer services and protocols

Bulk Transfer Service (BuTrS) A lightweight, simple, efficient utility for bulk data transfer with priorities • Lightweight • Single .jar file • Minimal 3rd-party dependencies • Simple • No admin overhead • Basic Java API • Efficient • Concurrent transfers on all available pair-wise links • In-memory, efficient data structures for transfer queues

BuTrS v0.0 Features • Create bulk transfers • Destroy bulk transfers • Change priority of individual data transfer • Find data transfer object by (source, sink) URL pair • Find data transfer objects by transfer status • Callback notification of transfer status change • Supports GridFTP third-party transfers • Supports most GridFTP transfer options • Retries failed transfers up to defined maximum

Bulk Transfer Service Design Overview Key Public Interface Client State TransferCallback Execution createBulkTransfer(...) terminated(transfer) Transfer Service BulkTransferFactory schedule(queueBC) new BulkTransfer TransferScheduler BulkTransfer TransferQueues Thread Pool transferQueueAB Add transfers onto queues transferQueueAC Transfer Thread execute(transfer) transferQueueBA remove() transferQueueBC C A B transferQueueCA

For more info… • See release notes: • http://www.cedps.net/wiki/index.php/Dps10ReleaseNotes

Data Placement for Scientific Applications in Distributed Environments