Experiment Plans for SC4

ExperimentPlans for SC4

Introduction • Global goals and timelines for SC4 • Experiment plans for pre, post and SC4 production • Medium term outline for WLCG services • The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up - by April 2007 -to the capacity and performance needed for the first beams. • Development of new functionality and services must continue, but we must be careful that this does not interfere with the main priority for this year –reliable operation of the baseline services

Service Challenge 4 Pilot Services – stable service from 1 June 06 LHC Service in operation– 1 Oct 06over following six months ramp up to full operational capacity & performance 2006 cosmics 2007 LHC service commissioned – 1 Apr 07 first physics full physics run 2008 LCG Service Deadlines

SC4 – the Pilot LHC Service from June 2006 • Full demonstration of experiment production • DAQ  Tier-0  Tier-1data recording, calibration, reconstruction • Full offline chain – Tier-1  Tier-2 data exchangesimulation, batch and end-user analysis • Service metrics  MoU service levels • Extension to most Tier-2 sites • Functionality - modest evolution from current services • Focus on reliability, performance

ALICE Data Challenges 2006 • Last chance to show that things are working together (i.e. to test our computing model) • whatever does not work here is likely not to work when real data are there • So we better plan it well and do it well

ALICE Data Challenges 2006 • Three main objectives • Computing Data Challenge • Final version of rootifier / recorder • Online data monitoring • Physics data challenge • Simulation of signal events: 106 Pb-Pb, 108 p-p • Final version reconstruction • Data analysis • PROOF data challenge • Preparation of the fast reconstruction / analysis framework

Main points • Data flow • Realistic system stress test • Network stress test • SC4 Schedule • Analysis activity

Data Flow • Not very fancy… always the same • Distributed Simulation Production • Here we stress-test the system with the number of jobs in parallel • Data back to CERN • First reconstruction at CERN • RAW/ESD Scheduled “push-out” – here we do the network test • Distributed reconstruction • Here we stress test the I/O subsystem • Distributed (batch) analysis • “And here comes the proof of the pudding” - FCA

SC3 -> SC4 Schedule • February 2006 • Rerun of SC3 disk – disk transfers (max 150MB/s X 7 days) • Transfers with FTD, either triggered via AliEn jobs or scheduled • T0 -> T1 (CCIN2P3, CNAF, Grid.Ka, RAL) • March 2006 • T0-T1 “loop-back” tests at 2 x nominal rate (CERN) • Run bulk production @ T1,T2 (simulation+reconstruction jobs) and send data back to CERN • (We get ready with proof@caf) • April 2006 • T0-T1 disk-disk (nominal rates) disk-tape (50-75MB/s) • First Push out (T0 -> T1) of simulated data, reconstruction at T1 • (First tests with proof@caf) • July 2006 • T0-T1 disk-tape (nominal rates) • T1-T1, T1-T2, T2-T1 and other rates TBD according to CTDRs • Second chance to push out the data • Reconstruction at CERN and remote centres • September 2006 • Scheduled analysis challenge • Unscheduled challenge (target T2’s?)

SC4 Rates - Scheduled Analysis • Users • Order of 10 at the beginning of SC4 • Input • 1.2M Pb-Pb events, 100M p-p events, ESD stored at T1s • Job rate • Can be tuned, according to the availability of resources • Queries to MetaData Catalogue • Time/Query to be evaluated (does not involve LCG services) • Job splitting • Can be done by AliEn according to the query result (destination set for each job) • CPU availability is an issue (sub-jobs should not wait too much for delayed executions) • Result merging can be done by a separate job • Network • Not an issue

SC4 Rates - Scheduled Analysis • Some (preliminary) numbers • Based on 20 minutes jobs

SC4 Rates - Unscheduled Analysis • To be defined

ATLAS SC4 Tests • Complete Tier-0 test • Internal data transfer from “Event Filter” farm to Castor disk pool, Castor tape, CPU farm • Calibration loop and handling of conditions data • Including distribution of conditions data to Tier-1s (and Tier-2s) • Transfer of RAW, ESD, AOD and TAG data to Tier-1s • Transfer of AOD and TAG data to Tier-2s • Data and dataset registration in DB (add meta-data information to meta-data DB) • Distributed production • Full simulation chain run at Tier-2s (and Tier-1s) • Data distribution to Tier-1s, other Tier-2s and CAF • Reprocessing raw data at Tier-1s • Data distribution to other Tier-1s, Tier-2s and CAF • Distributed analysis • “Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly) • Tests of performance of job submission, distribution and output retrieval

ATLAS SC4 Plans (1) • Tier-0 data flow tests: • Phase 0: 3-4 weeks in March-April for internal Tier-0 tests • Explore limitations of current setup • Run real algorithmic code • Establish infrastructure for calib/align loop and conditions DB access • Study models for event streaming and file merging • Get input from SFO simulator placed at Point 1 (ATLAS pit) • Implement system monitoring infrastructure • Phase 1: last 3 weeks of June with data distribution to Tier-1s • Run integrated data flow tests using the SC4 infrastructure for data distribution • Send AODs to (at least) a few Tier-2s • Automatic operation for O(1 week) • First version of shifter’s interface tools • Treatment of error conditions • Phase 2: 3-4 weeks in September-October • Extend data distribution to all (most) Tier-2s • Use 3D tools to distribute calibration data • The ATLAS TDAQ Large Scale Test in October-November prevents further Tier-0 tests in 2006… • … but is not incompatible with other distributed operations

ATLAS SC4 Plans (2) • ATLAS CSC includes continuous distributed simulation productions: • We will continue running distributed simulation productions all the time • Using all Grid computing resources we have available for ATLAS • The aim is to produce ~2M fully simulated (and reconstructed) events/week from April onwards, both for physics users and to build the datasets for later tests • We can currently manage ~1M events/week; ramping up gradually • SC4: distributed reprocessing tests: • Test of the computing model using the SC4 data management infrastructure • Needs file transfer capabilities between Tier-1s and back to CERN CAF • Also distribution of conditions data to Tier-1s (3D) • Storage management is also an issue • Could use 3 weeks in July and 3 weeks in October • SC4: distributed simulation intensive tests: • Once reprocessing tests are OK, we can use the same infrastructure to implement our computing model for simulation productions • As they would use the same setup both from our ProdSys and the SC4 side • First separately, then concurrently

ATLAS SC4 Plans (3) • Distributed analysis tests: • “Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly) • Generate groups of jobs and simulate analysis job submission by users at home sites • Direct jobs needing only AODs as input to Tier-2s • Direct jobs needing ESDs or RAW as input to Tier-1s • Make preferential use of ESD and RAW samples available on disk at Tier-2s • Tests of performance of job submission, distribution and output retrieval • Test job priority and site policy schemes for many user groups and roles • Distributed data and dataset discovery and access through metadata, tags, data catalogues. • Need same SC4 infrastructure as needed by distributed productions • Storage of job outputs for private or group-level analysis may be an issue • Tests can be run during Q3-4 2006 • First a couple of weeks in July-August (after distributed production tests) • Then another longer period of 3-4 weeks in November

Overview of requirements for SC4 • SRM (“baseline version”) on all storages • VO Box per Tier-1 and in Tier-0 • LFC server per Tier-1 and in Tier-0 • FTS server per Tier-1 and in Tier-0 • Disk-only area on all tape systems • Preferably we could have separate SRM entry points for “disk” and “tape” SEs. Otherwise a directory set as permanent (“durable”?) on disk (non-migratable). • Disk space is managed by DQ2. • Counts as online (“disk”) data in the ATLAS Computing Model • Ability to install FTS ATLAS VO agents on Tier-1 and Tier-0 VO Box (see next slides) • Single entry point for FTS with multiple channels/servers • Ability to deploy DQ2 services on VO Box as during SC3 • No new requirements on the Tier-2s besides SRM SE

Movement use cases for SC4 • EF -> Tier-0 migratable area • Tier-0 migratable area -> Tier-1 disk • Tier-0 migratable area -> Tier-0 tape • Tier-1 disk -> Same Tier-1 tape • Tier-1 disk -> Any other Tier-1 disk • Tier-1 disk -> Related Tier-2 disk (next slides for details) • Tier-2 disk -> Related Tier-1 disk (next slides for details) • Not done: • Processing directly from tape (not in ATLAS Computing Model) • Automated multi-hop (no ‘complex’ data routing) • Built-in support for end-user analysis: goal is to exercise current middleware and understand its limitations (metrics)

ATLAS SC4 Requirement (new!) • Small testbed with (part of) CERN, a few Tier-1s and a few Tier-2s to test our distributed systems (ProdSys, DDM, DA) prior to deployment • It would allow testing new m/w features without disturbing other operations • We could also tune properly the operations on our side • The aim is to get to the agreed scheduled time slots with an already tested system and really use the available time for relevant scaling tests • This setup would not interfere with concurrent large-scale tests or data transfers run by other experiments • A first instance of such a system would be useful already now! • April-May looks like a realistic request

Summary of requests • March-April (pre-SC4): 3-4 weeks in for internal Tier-0 tests (Phase 0) • April-May (pre-SC4): tests of distributed operations on a “small” testbed • Last 3 weeks of June: Tier-0 test (Phase 1) with data distribution to Tier-1s • 3 weeks in July: distributed processing tests (Part 1) • 2 weeks in July-August: distributed analysis tests (Part 1) • 3-4 weeks in September-October: Tier-0 test (Phase 2) with data to Tier-2s • 3 weeks in October: distributed processing tests (Part 2) • 3-4 weeks in November: distributed analysis tests (Part 2)

LHCb DC06 “Test of LHCb Computing Model using LCG Production Services” • Distribution of RAW data • Reconstruction + Stripping • DST redistribution • User Analysis • MC Production • Use of Condition DB (Alignment + Calibration)

SC4 Aim for LHCb • Test Data Processing part of CM • Use 200 M MC RAW events: • Distribute • Reconstruct • Stripped and Re-distribute • Simultaneous activities: • MC production • User Analysis

Preparation for SC4 • Event generation, detector simulation & digitization • 100M B-physics + 100M min bias events: • 3.7 MSI2k · month required (~2-3 months) • 125 TB on MSS at Tier-0 (keep MC True) • Timing: • Start productions mid March, • Full capacity end March

LHCb SC4 (I) • Timing: • Start June • Duration 2 months • Distribution of RAW data • Tier0 MSS SRM  Tier1’s MSS SRM • 2 TB/day out of CERN • 125 TB on MSS @ Tier1’s

LHCb SC4 (II) • Reconstruction/stripping • 270 kSi2k · month • 60 TB on MSS @ Tier1’s (full DST) • 1k Job/day (following the data) • Jobs duration 2 hour • 90 % Jobs (Rec): Input 3.6 GB Output 2 GB • 10% Jobs (Strip): Input 20 GB Output 0.5 GB • DST Distribution • 2.2 TB on Disk / Tier1 + CAF (selected DST+RAW)

DIRAC Tools & LCG • DIRAC Transfer Agent @ Tier-0 + Tier-1’s • FTS + SRM • DIRAC Production Tools • Production Manager console • Transformation Agents • DIRAC WMS • LFC + RB + CE • Applications: • GFAL: Posix I/O via LFN

To be Tested after SC4 • Data Management: • SRM v2 • gridFTP 2 • FPS • Workload Management: • gLite RB ? • gLite CE ? • VOMS • Integration with MW • Applications • Xrootd

Monthly Summary (I) • February • ALICE: data transfers T0->T1 (CCIN2P3, CNAF, Grid.Ka, RAL) • ATLAS: • CMS: • LHCb: • March • ALICE: bulk production at T1/T2; data back to T0 • ATLAS: 3-4 weeks Mar/Apr T0 tests • CMS: PhEDEx integration with FTS • LHCb: start generation of 100M B-physics + 100M min bias events (2-3 months; 125 TB on MSS at Tier-0) • April • ALICE: first push out of sim. data; reconstruction at T1s. • ATLAS: see above • CMS: 10TB to tape at T1s at 150MB/s • LHCb: see above • dTeam: T0-T1 at nominal rates (disk); 50-75MB/s (tape) Extensive testing on PPS by all VOs

Monthly Summary (II) • May • ALICE: • ATLAS: • CMS: • LHCb: • June • ALICE: • ATLAS: Tier-0 test (Phase 1) with data distribution to Tier-1s (3 weeks) • CMS: 2-week re-run of SC3 goals (beginning of month) • LHCb: reconstruction/stripping: 2 TB/day out of CERN - 125 TB on MSS @ Tier1’s • July • ALICE: Reconstruction at CERN and remote centres • ATLAS: • CMS: bulk simulation (2 months) • LHCb: see above • dTeam: T0-T1 at full nominal rates (to tape) Deployment of gLite 3.0 at major sites for SC4 production

Monthly Summary (III) • August • ALICE: • ATLAS: • CMS: bulk simulation continues • LHCb: Analysis on data from June/July … until spring 07 or so… • September • ALICE: Scheduled + unscheduled (T2s?) analysis challenges • ATLAS: • CMS: • LHCb: see above

New functionality Evaluation & developmentcycles Possible components for later years Additional planned Functionality to be agreed & completedin the nextfew months then - testeddeployed Subject to progress& experience ?? WLCG - Medium Term Evolution SRM 2 test and deployment Plan being elaborated SC4 3D distributed database services development test October?

So What Happens at the end of SC4? • Well prior to October we need to have all structures and procedures in place… • … to run –-- and evolve --- a production service for the long-term • This includes all aspects – monitoring, automatic problem detection, resolution, reporting, escalation, {site, user} support, accounting, review, planning for new productions, service upgrades … • For the precise reason that things will evolve, should avoid over-specification…

Summary • Two grid infrastructures are now in operation, on which we are able to build computing services for LHC • Reliability and performance have improved significantly over the past year • The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up - by April 2007 -to the capacity and performance needed for the first beams. • Development of new functionality and services must continue, but we must be careful that this does not interfere with the main priority for this year –reliable operation of the baseline services

Experiment Plans for SC4