GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP Status“How Ready are We for LHC Data-Taking?” Tony Doyle

Outline • Exec2 Summary • 2006 Outturn • The Year Ahead.. • 2007 Status • Some problems to solve.. • “All 6’s and 7’s”? 2007 GridPP18 Collaboration Meeting

Exec2 Summary • 2006 was the second full year for the UK Production Grid • More than 5,000 CPUs and more than 1/2 Petabyte of disk storage • The UK is the largest CPU provider on the EGEE Grid, with total CPU used of 15 GSI2k-hours in 2006 • The GridPP2 project has met 69% of its original targets with 92% of the metrics within specification • The initial LCG Grid Service is now starting and will run for the first 6 months of 2007 • The aim is to continue to improve reliability and performance ready for startup of the full Grid service on 1st July 2007 • The GridPP2 project has been extended by 7 months to April 2008 • The outcome of the GridPP3 proposal is now known • We anticipate a challenging period in the year ahead GridPP18 Collaboration Meeting

Grid Overview • Aim: by 2008 (full year’s data taking) • CPU ~100MSI2k (100,000 CPUs) • Storage ~80PB • - Involving >100 institutes worldwide • Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT) • Prototype went live in September 2003 in 12 countries • Extensively tested by the LHC experiments in September 2004 • February 200625,547 CPUs, 4398 TB storage Status in 2007: 177 sites, 32,412 CPUs, 13,282 TB storage Monitoring via Grid Operations Centre GridPP18 Collaboration Meeting

Resources 2006 CPU Usageby Region http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Via APEL accounting GridPP18 Collaboration Meeting

2006 Outturn Definitions: • "Promised" is the total that was planned at the Tier-1/A (in the March 2005 planning) and Tier-2s (in the October 2004 Tier-2 MoU) for CPU and storage • "Delivered" is the total that was physically installed for use by GridPP, including LCG and SAMGrid at Tier-2 and LCG and BaBar at Tier-1/A • "Available" is available for LCG Grid use, i.e. declared via the EGEE mechanisms with storage via an SRM interface • "Used" is as accounted for by the Grid Operations Centre GridPP18 Collaboration Meeting

2006 Outturn Resources Delivered Tier-1 and Tier-2 total delivery is impressive and usage is improved Available CPU: 8.5 MSI2k Storage: 1.7 PB Disk: 0.54 PB Delivery of Tier-1 disk Used CPU: 15 GSI2k-hours Disk: 0.26 PB Usage of Tier-2 CPU, disk Request: PPARC acceptance of the 2006 outturn (next week) GridPP18 Collaboration Meeting

LCG CPU Usage GridPP18 Collaboration Meeting

Efficiency (measured by UK Tier-1 for all VOs) ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this fell to ~75% target Each experiment needs to work to improve their system/deployment practice anticipating e.g. hanging gridftp connections during batch work GridPP18 Collaboration Meeting

CPU by experiment GridPP18 Collaboration Meeting

UK Resources 2006 CPU Usageby experiment GridPP18 Collaboration Meeting

LCG Disk Usage GridPP18 Collaboration Meeting

File Transfers (individual rates) http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary Current goals: >250Mb/s inbound-only >300-500Mb/s outbound-only >200Mb/s inbound and outbound Aim: to maintain data transfers at a sustainable level as part of experiment service challenges GridPP18 Collaboration Meeting

The Year Ahead.. 2006 2007 2008 GridPP3 GridPP2 GridPP2+ Don’t Panic 14TeV LHC: 900GeV 14 TeV Collisions First Collisions EGI ? EDG EGEE-I EGEE-II LHC Data Taking 2001 2002 2003 2004 2005 2006 2007 GridPP1 GridPP2 GridPP3 GridPP18 Collaboration Meeting

CMS Magnet & Cosmics Test (August 06) Detector Lowering (January 07) GridPP18 Collaboration Meeting

checkData Job JDL Job Receiver Data Optimizer Job Receiver Job Receiver Job Input Job JDL Sandbox JobDB LFC Task Queue checkJob Agent Monitor getReplicas Job Monitor Agent Director Matcher Pilot Job checkPilot SE RB RB RB CE JDL uploadData VO-box getSandbox DIRAC services putRequest CE LCG services User Application WN fork Workload On WN DIRAC WMS Job Wrapper Pilot Agent execute GridPP18 Collaboration Meeting

ALICE PDC’06 • The longest running Data Challenge in ALICE • A comprehensive test of the ALICE Computing model • Running already for 9 months non-stop: approaching data taking regime of operation • Participating: 55 computing centres on 4 continents: 6 Tier 1s, 49 T2s • 7MSI2k • hours  1500 CPUs running continuously • 685K Grid jobs total • 530K production • 53K DAQ • 102K user !!! • 40M evts, 0.5PB generated, reconstructed and stored • User analysis ongoing • FTS tests T0->T1 Sep-Dec • Design goal 300MB/s reached but not maintained • 0.7PB DAQ data registered GridPP18 Collaboration Meeting

2006 2007 AliRoot & Condition fwks SEs & Job priorities Continue DC mode, as per WLHC commissioning Combined T0 test DA for calibration ready Finalisation of CAF & Grid 2008 The real thing The Year Ahead.. WLCG CommissioningSchedule SC4 – becomes initial service when reliability and performance goals met Introduce residual servicesFull FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4 Continued testing of computing models, basic services Testing DAQTier-0 & integrating into DAQTier-0Tier-1data flow Building up end-user analysis support Exercising the computing systems, ramping up job rates, data management performance, …. Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, …. 01jul07 - service commissioned - full 2007 capacity, performance first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1). GridPP18 Collaboration Meeting

Resources 2007 CPU Usageby Region http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Via APEL accounting GridPP18 Collaboration Meeting

UK Resources 2007 CPU Usageby experiment GridPP18 Collaboration Meeting

The Year Ahead.. Hardware OutlookPlanning for 2007.. • A profiled ramp-up of resources is planned throughout 2007 to meet the UK requirements of the LHC and other experiments • The results are available for the Tier-1 and Tier-2s • The Tier-1/A Board reviewed UK input to International MoU negotiations for the LHC experiments as well as providing input to the International Finance Committee for BaBar • For LCG, the 2007 commitment for disk and CPU capacity can be met out of existing hardware already delivered GridPP18 Collaboration Meeting

T2 Resources e.g. Glasgow: UKI-SCOTGRID-GLASGOW August 28 • 800 kSI2k • 100 TB DPM Needed for LHCstart-up September 1 • IC-HEP • 440 KSI2K • 52 TB dCache • Brunel • 260 KSI2K • 5 TB DPM October 13 October 23 GridPP18 Collaboration Meeting

Efficiency (measured by UK Tier-1 for all VOs) ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this is falling further Current transition from dCache to CASTOR at the Tier-1 contributes to the problem [see Andrew’s talk]NB March is a mid-month figure Each experiment needs to work to improve their system GridPP18 Collaboration Meeting

ATLAS User Tests • Many problems identified and fixed at individual sites (GridPP DTeam) • Other ‘Generic’ system failures that need to be addressed before fit for widespread use by inexperienced users • Production teams mostly ‘work round’ these • Users can’t/won’t More sites and tests introduced System failures GridPP18 Collaboration Meeting

CMS User Jobs Status (Sunday) Production jobs now outnumbered by analysis and unknown jobs Analysis (CRAB) efficiency OK? e.g. RAL 93.3% http://lxarda09.cern.ch/dashboard/request.py/jobsummary GridPP18 Collaboration Meeting

Popular(?) Messages • Data recorded in the experiment dashboards • Initially only data from CMS (dashboard) • Now more and more data from ATLAS as well • CMS: mostly analysis; ATLAS: dominated by production • We expect to have “all” type of jobs soon GridPP18 Collaboration Meeting

Resource Broker • Use RBs at RAL (2) and Imperial • Broke about once a week and all jobs lost or in limbo • Never clear to user why • Switch to a different RB • Users don’t know how to do this • Barely usable for bulk submission – too much latency • Can barely submit and query ~20 jobs in 20 mins before next submission • Users will want to do more than this • Cancelling jobs doesn’t work properly – often fails and repeated attempts cause RB to fall over • Users will not cancel jobs • (We know EDG RB is deprecated but gLite RB isn’t currently deployed) • Work ongoing to improve RB availability and BDII (failover system) at Tier-1.. GridPP18 Collaboration Meeting

Information System • lcg-info is used to find out what version of ATLAS software is available before submitting a job to a site but it is too unreliable and previous answer needs to be kept track of • ldap query typically gives quick, reliable answer but lcg-info doesn’t • The lcg-info command is very slow (querying *.ac.uk or xxx.ac.uk) and often fails completely • Different bdiis seem to give different results and it is not clear to users which one to use (if the default fails) • Many problems with UK SE's have made the creation of replicas painful - it is not helped by frequent bdii timeouts • The FDR freedom of choice tool causes some problems because sites fail SAM tests because the job queues are full GridPP18 Collaboration Meeting

UI and Proxies User Interface • Users need local UIs (where their files are) • These can be set up by local system managers but generally these are not Grid experts • The local UI setup controls what RB, BDII, LFC etc all the users using that UI get and these appear to be pretty random • There needs to be clear guidance on which of these to use and how to change them if things go wrong Proxy Certificates • These cause a lot of grief as the default 12 hours is not long enough • If the certificate expires it's not always clear from the error messages when running jobs fail • They can be created with longer lifetimes but this starts to violate security policies • Users will violate these policies • Maybe MyProxy solves this but do users know? GridPP18 Collaboration Meeting

GGUS • GGUS is used to report some of these problems but it is not very satisfactory • The initial response is usually quite quick saying it has been passed to X but then the response after that is very patchy • Usually there is some sort of acknowledgement but rarely a solution and often the ticket is never closed even if the problem was transitory and now irrelevant • There are two particular cases which GGUS does not handle well: a) Something breaks and probably just needs to be rebooted: the system is just too slow and it's better to email someone (if you know whom) b) Something breaks and is rebooted/repaired etc but the underlying cause is a bug in the middleware: this doesn't seem to be fed back to the developers • There are also of course some known problems that take ages to be fixed (e.g. the globus port range bug, rfio libraries, ...) • More generally, the GGUS system is working at the ~tens (up to 100) of tickets/week level but may not scale as new users start using the system GridPP18 Collaboration Meeting

Usability • The Grid is a great success for Monte Carlo production • However it is not in a fit state for a basic user analysis • The tools are not suitable for bulk operations by normal users • Current users therefore set up ad-hoc scripts that can be mis-configured • ‘System failures’ are too frequent (largely independent of the VO, probably location-independent) • The User experience is poor • Improved overall system stability is needed • Template UI configuration (being worked on) will help • Wider adoption of VO-specific user interfaces may help • Users need more (directed) guidance • There is not long to go • Usability task force required? GridPP18 Collaboration Meeting

The Year Ahead.. 3D tested by ATLAS and LHCb 3D used for the conditions DB SRM 2.2 implementations SRM 2.2 tested by experiments SLC4 Migration gLite CE new RB FTS v2 VOMS scheduling priorities 24x7 definition and 24x7 test scenario VO boxes SLA VO boxes implementation Accounting data into APEL repository Automated Accounting reports Tier-1 Tier-1 Tier-1 & Tier-2 Tier-1 & Tier-2 Tier-1 & Tier-2 Tier-1 & Tier-2 Tier-1 Tier-1 & Tier-2 Experiments Tier-1 Tier-1 Tier-1 & Tier-2 Tier-1 & Tier-2 GOC GOC GridPP18 Collaboration Meeting

FTS 2.0: schedule The Year Ahead.. Example: GridPP18 Collaboration Meeting • FTS 2.0 currently deployed on pilot service@CERN • In testing since December: running dteam tests to stress-test it • This is the ‘uncertified’ code • Next step: open pilot to experiments to verify full backwards compatibility with experiment code • Arranging this now • Deployment at CERN T0 scheduled in April 2007 • Goal is April 1, but this is tight • Subject to successful verification • Roll-out to T1 sites a month after that • We expect at least one update will be needed to the April version

The Year Ahead.. Site Reliability Tier-0, Tier-1 2007 SAM targets monthly average Target for each site - 91% - by Jun 07 - 93% - by Dec 07 Taking 8 best sites - 93% - by Jun 07 - 95% - by Dec 07 Tier-2s “Begin reporting the monthly averages, but do not set targets yet” ~80% - by Jun 07 ~90% - by Dec 07 SAM tests (critical=subset) BDII Top-level BDII sBDII Site BDII FTS File Transfer Service gCE gLite Computing Element LFC Global LFC VOMS VOMS CE Computing Element SRM SRM gRB gLite Resource Broker MyProxy MyProxy RB Resource Broker VOBOX VO BOX SE Storage Element RGMA RGMA Registry GridPP18 Collaboration Meeting

Summary • Exec2 Summary – status OK • 2006 Outturn – some issues • The Year Ahead.. • Some problems to solve.. (AKA challenges) • The weather is fine • We need to set some targets 2007 GridPP18 Collaboration Meeting

You can see ~home if you look up Devoncove Hotel 931 Sauchiehall Street,Glasgow, G3 7TQ Tel: 0141 334 4000 Sandyford Hotel, 904 Sauchiehall Street,Glasgow, G3 7TF Tel: 0141 334 0000 GridPP18 Collaboration Meeting

GridPP Status “How Ready are We for LHC Data-Taking?”