Managing Data for LHC Experiments: Challenges and Solutions

Data Handling for LHC:Plans and Reality Tony CassLeader, Database Services GroupInformation Technology Department 11th July 2012

Outline • HEP, CERN, LHC and LHC Experiments • LHC Computing Challenge • The Technique • In outline • In more detail • Towards the Future • Summary

We are looking for rare events! number of events = Luminosity × Cross section2010 Luminosity: 45pb-1 70 billion pb3 trillion events!* * N.B. only a very small fraction saved! ~250x more events to date Higgs(mH=120 GeV) : 17 pb750 events e.g. potentially ~1 Higgs in every 300 billion interactions! 7

So the four LHC Experiments…

… generate lots of data … The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors

… generate lots of data … Which are recorded on disk and magnetic tapeat 100-1,000 MegaBytes/sec ~15 PetaBytes per year for all four experiments reduced by online computers toa few hundred “good” eventsper second. ATLAS Z μμ event from 2012 data with 25 reconstructed vertices • Current forecast ~ 23-25 PB / year, 100-120M files / year • ~ 20-25K 1 TB tapes / year • Archive will need to store 0.1 EB in 2014, ~1Billion files in 2015 Z μμ

What is the technique? • Break up a Massive Data Set …

What is the technique? • … into lots of small pieces and distribute them around the world …

What is the technique? • … analyse in parallel …

What is the technique? • … gather the results …

What is the technique? a • … and discover the Higgs boson: • Nice result, but… • … is it novel?

Is it Novel? • Maybe not novel as such, but the implementation • is Terrascale computing • that is widely appreciated!

The Grid • Timely Technology! • The WLCG project deployed to meet LHC computing needs. • The EDG and EGEE projects organised development in Europe. (OSG and others in the US.)

Grid Middleware Basics • Compute Element • Standard interface to local workload management systems (batch scheduler) • Storage Element • Standard interface to local mass storage systems • Resource Broker • Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according todata and cpu time availability. Many implementations of the basic principles: Globus, VDT, EDG/EGEE, NorduGrid, OSG

Job Scheduling in Practice • Issue • Grid sites generally want to maintain a high average CPU utilisation; easiest to do this if there is a local queue of work to select from when another job ends. • Users are generally interested in turnround times as well as job throughput. Turnround is reduced if jobs are held centrally until a processing slot is known to be free at a target site. • Solution: Pilot job frameworks. • Per-experiment code submits a job which chooses a work unit to run from a per-experiment queue when it is allocated an execution slot at a site. • Pilot job frameworks separate out • site responsibility for allocating CPU resources from • Experiment responsibility for allocating priority between different research sub-groups. … But note: Pilot job frameworks talk directly to the CEs and we have moved away from a generic solution to one thathas a specific framework per VO (although these can beshared in principle)

Data Issues • Reception and long-term storage • Delivery for processing and export • Distribution • Metadata distribution 700MB/s 420MB/s 700MB/s 2600MB/s (3600MB/s) (>4000MB/s) 1430MB/s Scheduled work only – and we need ability to support 2x for recovery!

(Mass) Storage Systems • After evaluation of commercial alternatives in the late 1990s, two tape-capable Mass storage systems have been developed for HEP: • CASTOR: an integratedmass storage system • dCache: a disk pool manager thatinterfaces to multiple tape archives(Enstore @ FNAL, IBM’s TSM) • dCache is also used a basic disk storage manager Tier2s along with the simpler DPM

A Word About Tape • Our data set may be massive, but… It is made up ofmany small files… ~195MB average only increasing slowly after LHC startup! …which is bad fortape speeds: Average write drive speed: < 40MB/s (cf native drive speeds: 120-160MB/s) Small increases with new drive generations

Tape Drive Efficiency So we have to change tape writing policy…

Storage vs Recall Efficiency • Efficient data acceptance: • Have lots of input streams, spread across a number of storage servers, • wait until the storage servers are ~full, and • write the data from each storage server to tape. • Result: data recorded at the same time is scattered over many tapes. • How is the data read back? • Generally, files grouped by time of creation. • How to optimise for this? Group files on to a small number of tapes. • Ooops…

Keep users away from tape

CASTOR & EOS

Data Distribution • The LHC experiments need to distribute millions of files between the different sites. • The File Transfer System automates this • handling failures of the underlying distribution technology (gridftp) • ensuring effective use of the bandwidth with multiple streams, and • managing the bandwidth use • ensuring ATLAS, say, is guaranteed 50% of the available bandwidth between two sites if there is data to transfer

Data Distribution • FTS uses the Storage Resource Manager as an abstract interface to the different storage systems • A Good Idea™ but this is not (IMHO) a complete storage abstraction layer and anyway cannot hide fundamental differences in approaches to MSS design • Lots of interest in the Amazon S3 interface these days; this doesn’t try to do as much as SRM, but HEP should try to adopt de facto standards. • Once you have distributed the data, a file catalogue is needed to record which files are available where. • LFC, the LCG File Catalogue was designed for this role as a distributed catalogue to avoid a single point of failure, but other solutions are also used • And as many other services rely on CERN, the need for a distributed catalogue is no longer (seen as…) so important.

Looking more widely — I • Only a small subset of data distributed is actually used • Experiments don’t know a priori which dataset will be popular • CMS has 8 orders magnitude in access between most and least popular • Dynamic data replication: create copies of popular datasets at multiple sites.

Looking more widely — II • Network capacity is readily available… • … and it is reliable: • So let’s simply copy data from another site if it is not available locally • rather than recalling from tape or failing the job. • Inter-connectedness is increasing with the design of LHCOne to deliver (multi-) 10Gb links between Tier2s. 622 Mbits/s FNAL 4.107 MIPS 110 Tbyte Robot Desk tops 622 Mbits/s Desk tops University n.106MIPS m Tbyte Robot N x 622 Mbits/s Optional Air Freight CERN n.107 MIPS m Pbyte Robot Desk tops 622Mbits/s MONARC2000 622 Mbits/s Fibre cut during tests in 2009Capacity reduced, but alternative links took over

Metadata Distribution • Conditions data is needed to make sense of the raw data from the experiments • Data on items such as temperatures, detector voltages and gas compositions is needed to turn the ~100M Pixel image of the event into a meaningful description in terms of particles, tracks and momenta. • This data is in an RDBMS, Oracle at CERN, and presents interesting distribution challenges • One cannot tightly couple databases across the loosely coupled WLCG sites, for example… • Oracle streams technology improved to deliver the necessary performance, and http caching systems developed to address need for cross-DBMS distribution.

Job Execution Environment • Jobs submitted to sites depend on large, rapidly changing libraries of experiment specific code • Major problems ensue if updated code is not distributed to every server across the grid (remember, there are x0,000 servers…) • Shared filesystems can become a bottleneck if used as a distribution mechanism within a site. • Approaches • Pilot job framework can check to see if the execution host has the correct environment… • A global caching file system: CernVM-FS. ATLAS Today: 22/1.8M filesATLAS Today: 921/115GB 2011

Towards the Future • Learning from our mistakes • We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. • Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. • Clouds • Identity Management

Integrating With The Cloud? User Site A Shared Image Repository (VMIC) Payload pull Central Task Queue Site B Instance requests Slide courtesy of Ulrich Schwickerath Site C VO service Image maintainer Cloud bursting Commercial cloud

Towards the Future • Learning from our mistakes • We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. • Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. • Clouds • Identity Management

Grid Middleware Basics • Compute Element • Standard interface to local workload management systems (batch scheduler) • Storage Element • Standard interface to local mass storage systems • Resource Broker • Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according todata and cpu time availability. None of this works without… Many implementations of the basic principles: Globus, VDT, EDG/EGEE, NorduGrid, OSG

Trust!

One step beyond?

Summary • WLCG has delivered the capability to manage and distribute the large volumes of data generated by the LHC experiments • and the excellent WLCG performance has enabled physicists to deliver results rapidly. • HEP datasets may not be the most complex or (any longer) massive, but in addressing the LHC computing challenges, the community has delivered • the world’s largest computing Grid, • practical solutions to requirements for large-scale data storage, distribution and access, and • a global trust federation enabling world-wide collaboration.

And thanks to Vlado Bahyl, German Cancio, Ian Bird, Jakob Blomer, Eva Dafonte Perez, Fabiola Gianotti, Frédéric Hemmer, Jan Iven, Alberto Pace and Romain Wartel of CERN, Elisa Lanciotti of PIC and K. De, T. Maeno, and S. Panitkin of ATLAS for various unattributed graphics and slides. Thank You!

Managing Data for LHC Experiments: Challenges and Solutions