LCG Deployment

LCG Deployment GridPP 18, Glasgow, 21st March 2007 Tony Cass Leader, Fabric Infrastructure & Operations Group IT Department Material provided by Ian Bird, FlaviaDonno, Jamie Shiers and others

A Usual Deployment Talk… LHC experiments now transferring ~ 1PB/month each • Continually increasing workloads • 50k-80k jobs per day • Feb’07  ~12500 cpu-months / month GridPP18: LCG Deployment - 2

This Deployment Talk • Future Deployment Issues • SL(C)4 • SRM • CE & RB • Deployment Windows • Not “Deployment of Windows”! GridPP18: LCG Deployment - 3

2006 2007 2008 LCG Commissioning Schedule SC4 – becomes initial service when reliability and performance goals met April 1st target is to allow experiments to prepare for July 1st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…) Introduce residual servicesFull FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4 Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, …. 01jul07 - service commissioned - full 2007 capacity, performance first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1). GridPP18: LCG Deployment - 4

SL(C)4 Migration • The target OS level for initial LHC operation • RHES5 out, but migration not feasible • Experiments would like to drop SL(C)3 builds • SL3 built, but SL4 compatible middleware rpms are available now • no longer any problem preventing subsequent updates. • Natively built WN rpms are under test; expected in PPS next week. • still some issues with the native UI build • Plan your migration! GridPP18: LCG Deployment - 5

SRM v2.2 Server Status • DPM version 1.6.3 available in production. SRM 2.2 features still not officially certified. Implementation stable. Use-case tests are OK. Copy not available but interoperability tests are OK. Few general issues to be solved. • Volunteers to install and test SRM 2.2 features welcome! • BeStMan and StoRM: Copy in PULL mode not available in StoRM. Stable implementations. Recently some instability observed with BeStMan. Some use-case tests still not passing and under investigation. • dCache: Stable implementation. Copy is available and working with all implementations excluding DRM. Working on some use-case tests. • Requires migration to v1.8.0 (which will support v1.1 & v2.2 transparently); beta version in April. • CASTOR: The implementation has improved remarkably. A lot of progress during the last 3 weeks. Main instability causes found and fixed. Use-case tests OK. Copy not yet implemented but interoperability tests OK. • Stress tests at CERN now and CNAF from next week, but an upgrade to the underlying CASTOR version is required for efficient operation • Deployment at CERN scheduled for mid-May; CNAF and RAL to follow soon afterwards to be ready for production use by July 1st. GridPP18: LCG Deployment - 6

SRM v2.2 Client Status • FTS • SRM client code has been unit-tested and integrated into FTS • Tested against DPM, dCache and StoRM. CASTOR and DRM test started. • Released to development testbed. • GFAL/lcg-utils • New rpms available on test UI and being officially certified. No outstanding issues at the moment. ATLAS has started some tests. • GLUE • V1.3 of the schema available • http://glueschema.forge.cnaf.infn.it/Spec/V13 GridPP18: LCG Deployment - 7

gLite WMS & LCG RB • Reliability of the gLite WMS is being addressed with high priority • not yet ready to replace LCG RB • no plans (or effort?) to migrate LCG RB to SL4. • Acceptance criteria for the RB have been agreed based on performance requirements from ATLAS and CMS GridPP18: LCG Deployment - 8

gLite WMS criteria • A single WMS machine should demonstrate submission rates of at least 10K jobs/day sustained over 5 days, during which time the WMS services including the L&B should not need to be restarted. This performance level should be reachable with both bulk and single job submission. • During this 5 day test the performance must not degrade significantly due to filling of internal queues, memory consumption, etc. i.e. the submission rate on day 5 should be the same as that on day 1. • Proxy renewal must work at the 98% level: i.e. <2% of jobs should fail due to proxy renewal problems (the real failure rate should be less because jobs may be retried). • The number of stale jobs after 5 days must be <1%. • The L&B data and job states must be verified: • After a reasonable time after submission has ended, there should be no jobs in "transient" or "cancelled" states • If jobs are very short no jobs should stay in "running" state for more than a few hours • After proxy expires all jobs must be in a final state (Done-Success or Aborted) GridPP18: LCG Deployment - 9

gLite CE • Similarly for the gLite CE • it is not yet reliable enough • reliability criteria have been defined • no port of the LCG CE to SL4 is foreseen. For both, development progress against the reliability criteria is reviewed weekly. Deployment of the gLite versions is not recommended at this stage, but if you do have them installed, please keep them running and track developments to help in testing, and be ready to deploy when production ready code becomes available! GridPP18: LCG Deployment - 10

gLite CE criteria • Performance: • 2007 dress rehearsals: • 5000 simultaneous jobs per CE node. • 50 user/role/submission node combinations (Condor_C instances) per CE node • End 2007: • 5000 simultaneous jobs per CE node (assuming same machine as 2007, but expect this to improve) • 1 CE node should support an unlimited number of user/role/submission node combinations, from at least 10 VOs, up to the limit on the number of jobs. (might be achieved with 1 Condor_C per VO with user switching done by glexec in blah) • Reliability: • Job failure rates due to CE in normal operation: < 0.5%; Job failures due to restart of CE services or CE reboot <0.5%. • 2007 dress rehearsals: • 5 days unattended running with performance on day 5 equivalent to that on day 1 • End 2007: • 1 month unattended running without performance degradation GridPP18: LCG Deployment - 11

gLite CE • Similarly for the gLite CE • it is not yet reliable enough • reliability criteria have been defined • no port to SL4 is foreseen. • For both, development progress against the reliability criteria is reviewed weekly. • Deployment of the gLite versions is not recommended at this stage, but • if you do have them installed, please keep them running and track developments to help in testing, and • be ready to deploy when production ready code becomes available! GridPP18: LCG Deployment - 12

Planning future deployments • Still a number of components to deploy before data taking • and even before the dress rehearsals. • Remember, there is no longer a “big bang” model; individual components are released as they are ready • be prepared… • Deployment/Intervention Scheduling • discussed at January workshop: when is “the least inconvenient time”? • has been discussed since then at the LCG Experiment Coordination Meeting, but no consensus. Opinion seems to be that this is not an issue for the “engineering run” • Last system changes in September/October then things kept stable for the short run. • situation for 2008 to be decided before the run • Whatever, clear and early announcement of changes leads to ready acceptance by users… GridPP18: LCG Deployment - 13

WLCG Intervention Scheduling • Scheduled service interventions shall normally be performed outside of the announced period of operation of the LHC accelerator. • In the event of mandatory interventions during the operation period of the accelerator – such as a non-critical security patch – an announcement will be made using the Communication Interface for Central (CIC) operations portal and the period of scheduled downtime entered in the Grid Operations Centre (GOC) database (GOCDB). • Such an announcement shall be made at least one working day in advance for interventions of up to 4 hours. • Interventions resulting in significant service interruption or degradation longer than 4 hours and up to 12 hours shall be announced at the Weekly Operations meeting prior to the intervention, with a reminder sent via the CIC portal as above. • Interventions exceeding 12 hours must be announced at least one week in advance, following the procedure above. • A further announcement shall be made once normal service has been resumed. • [deleted] • Intervention planning should also anticipate any interruptions to jobs running in the site batch queues. If appropriate the queues should be drained and the queues closed for further job submission. GridPP18: LCG Deployment - 14

2006 2007 2008 LCG Commissioning Schedule SC4 – becomes initial service when reliability and performance goals met April 1st target is to allow experiments to prepare for July 1st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…) Introduce residual servicesFull FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4 Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, …. 01jul07 - service commissioned - full 2007 capacity, performance first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1). GridPP18: LCG Deployment - 15

LCG Deployment

LCG Deployment

Presentation Transcript

LCG Applications Area

LCG DER

LCG Deployment in Japan

LCG Deployment Overview

LCG-SPI: SW-Testing LCG Applications Area

LCG-France

Grid Deployment Introduction and Overview Ian Bird LCG Deployment Area Manager

LCG-ES Plans of Spanish Groups for LCG

CREAM deployment Update on criteria for replacement of lcg -CE

LCG Security Coordination

LCG Gridview / LCG SAM use cases

LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN

LCG Security

LCG-1 Deployment Plan

LCG deployment workshop summary

LCG Deployment in the UK

LCG-1 Deployment and usage experience