Tier-1 Status

Tier-1 Status Andrew Sansum GRIDPP20 12 March 2008

Tier-1 Capacity delivered to WLCG (2007) RAL RAL

Tier-1 CPU Share by 2007 MoU

Wall Time

CPU Use by VO (2007) ATLAS ALICE CMS LHCB

Experiment Shares (2008)

Grid Only • Non-Grid access to Tier-1 has now ended. Only special cases (contact us if you believe you are) now have access to: • UIs • Job Submission • Until end of May 2008 • IDs will be maintained (disabled) • Home directories will be maintained online • Mail forwarding will be maintained. • After end of May 2008 • Ids will be deleted • Home filesystem will be backed up • Mail spool will be backed up • Mail forwarding will stop • AFS service continues for Babar (and just in case)

Reliability • Feb mainly due to power failure + 8 hours network • Jan/December mainly CASTOR problems over Xmass period (despite multiple callouts) • Out of hours on-call will help but some problems take time to diagnose/fix

Power Failure: Thursday 7th February 13:00 • Work on power supply since December • Down to 1 transformer (from 2) for extended periods (weeks). Increased risk of disaster. • Single transformer running at max operating load • No problems until work finished and casing closed – control line crushed and power supply tripped. • Total loss of power to whole building • First power interruption for over 3 years • Restart (Effort > 200 FTE hours) • Most Global/National/Tier-1 core systems up by Thursday evening • Most of CASTOR and part of batch up by Friday • Remaining batch on Saturday • Still problems to iron out in CASTOR on Monday/Tuesday • Lessons • Communication was prompt and sufficient but ad-hoc • Broadcast unavailable as RAL run GOCDB (now fixed by caching) • Careful restart of disk servers slow and labour intensive (but worked) will not scale See: http://www.gridpp.rl.ac.uk/blog/2008/02/18/review-of-the-recent-power-failure/

Hardware: Disk • Production capacity: 138 Servers, 2800 drives, 850TB (usable) • 1.6PB capacity delivered in January by Viglen • 91 Supermicro 3U servers with dual AMD 2220E (2.8GHz) dual-core CPUs, 8GB RAM, IPMI • 1 x 3ware 4 port 9650 PCIe RAID controller with 2 x 250GB WD HDD • 1 x 3ware 16 port 9650 PCIe RAID controller with 14 x 750GB WD HDD • 91 Supermicro 3U servers with dual Intel E5310 (1.6GHz) quad-core CPUs, 8GB RAM, IPMI • 1 x 3ware 4 port 9650 PCIe RAID controller with 2 x 400GB Seagate HDD • 1 x 3ware 16 port 9650 PCIe RAID controller with 14 x 750GB Seagate HDD • Acceptance test running – scheduled to be available end of March. • 5400 spinning drives after planned phase out in April (expect drive failure every 3 days)

Hardware: CPU • Production about 1500KSI2K on 600 systems. • Recently upgraded about 50% of capacity to 2GB/core • Recent procurement (approximately 3000KSI2K - but YMMV) delivered and under test • Streamline • 57 x 1U servers (114 systems, 3 racks), each system: • dual Intel E5410 (2.33GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD • Clustervision • 56 x 1U servers (112 systems, 4 racks), each system: • dual Intel E5440 (2.83GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD

Hardware: Tape • Tape Drives • 8 9940B drives • Used on legacy ADS/dCache service – phase out soon • 18 T10K tape drives and associated servers delivered, 15 in production, remainder soon • Planned bandwidth 50MB/s per drive • Actual bandwidth (8-80MB/s) - a work in progress • Media • Approximately 2PB on site

Hardware: Network RAL Site CPUs + Disks CPUs + Disks ADS Caches RAL Tier 2 N x 1Gb/s 2 x 5510 + 5530 3 x 5510 + 5530 5510 5530 10Gb/s Router A Firewall Force10 C300 8 slot Router (64*10Gb) Stack 4 x Nortel 5530 10Gb/s bypass OPN Router 10Gb/s Site Access Router 5 x 5510 + 5530 6 x 5510 + 5530 Oracle systems 1Gb/sLancaster (test network) 10Gb/s to SJ5 CPUs + Disks CPUs + Disks Tier 1 10Gb/s to CERN

RAL links implemented Implement soon never

Backplane Failures (Supermicro) • 3 servers “burnt out” backplane • 2 of which set off VESDA • 1 called out fire-brigade • Safety risk assessment: Urgent rectification needed • Good response from supplier/manufacturer • PCB fault in “bad batch” • Replacement nearly complete

Machine Rooms • Existing Machine room • Approximately 100 racks of equipment • Getting close to power/cooling capacity • New Machine Room • Work still proceeding near to schedule • 800M**2 can accommodate 300 racks + 5 robots • 2.3MW Power/Cooling capacity (some UPS) • Scheduled to be available for September 2008

CASTOR Memory Lane Happy days! 4Q05 1Q06 2Q06 3Q06 4Q06 1Q07 2Q07 3Q07 4Q07 1Q08 CASTOR1 tests OK 2.1.3 good but missing functionality 2.1.2 bad CASTOR2 Core Running Hard to install + dependencies ATLAS on CASTOR  CSA07 encouraging Problems with functionality and performance – it doesn’t work! OC Committees note improvement but concerned Service stopped for extended upgrade CSA08 reasonably successful 2.1.4 upgrade Goes well. Disk 1 support! CMS on CASTOR for CSA06. Encouraging. Declare production service. LHCB on CASTOR 

Growth in Use of CASTOR

Name Server +vmgr Name Server +vmgr Tape Server Test Architecture Oracle NS+ vmgr Oracle NS+ vmgr Shared Services Shared Services Tape Server Oracle DLF Oracle repack Oracle stager Oracle DLF Oracle DLF Oracle repack Oracle stager Oracle stager stager DLF stager stager DLF repack DLF+ LSF repack LSF LSF Preproduction Development Certification Testbed 1 Diskserver - variable 1 Diskserver - variable 1 Diskserver - variable

Name Server 1 +vmgr Tape Server Tape Server Tape Server Tape Server Oracle stager Oracle DLF CASTOR Production Architecture Oracle NS+ vmgr Name Server 2 Shared Services Tape Server Tape Server Oracle stager Oracle DLF Oracle stager Oracle DLF Oracle DLF Oracle repack Oracle stager stager DLF stager DLF stager DLF stager DLF repack LSF LSF LSF LSF CMS Stager Instance Atlas Stager Instance LHCb Stager Instance Repack and Small User Stager Instance Diskservers Diskservers Diskservers 1 Diskserver

D0T1 D1T0 Farm D1T1 D0T0 Atlas Data Flow Model AOD2 RAW RAW T0Raw simRaw AODm2/ TAG RAW TAG/ AODm2 RAW ESD2/ AODm2/ TAG T0 T2 T1’s ESD/ AODm/ TAG/ AODm1/ TAG AODm2/ TAG ESD1/ AODm1/ TAG ESD1 StripInput simStrip ESD Partner T1

CMS Dataflow All pools are disk0tape1 FarmRead 50 LSF Slots Per server Batch Farm Recall Disk2Disk Copy T0, T1 & T2 WanIn 8 LSF Slots Per server Disk2Disk Copy WanOut 16 LSF Slots Per server T1 & T2

CMS Disk Server Tuning: CSA06/CSA07 • Problem: network performance too low • Increase default/maximum tcp window size • Increase tcp ring buffers and tx queue • Ext3 journal changed to data=writeback • Problem: Performance still too low • Reduce number of gridftp slots/server • Reduce number of streams per file • Problem: Phedex transfers now timeout • Reduce FTS slots to match disk pools • Problem: servers sticky or crash with OOM • Limit total tcp buffer space • Protect low memory • Aggressive cache flushing • See: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Disk_Server_Tuning

3Ware Write Throughput

CCRC08 Disk server Tuning • Migration rate to tape very bad (5-10MB/s) when concurrent with writing data to disk • Was OK in CSA06 (50MB/server) – Areca servers • 3Ware 9550 performance terrible under concurrent read/write (2MB/s read, 120MB/s write) • 3Ware appears to prioritise writes • Tried many tweaks, most with little success except • Either: changing elevator to anticipatory • Downside – write throughput reduced • Good under benchmarking - testing in production this week • Or: increasing block device read ahead • Read throughput high but erratic under test • But seems OK in production (30MB/server) See: http://www.gridpp.rl.ac.uk/blog/2008/02/29/3ware-raid-controllers-and-tape-migration-rates/

CCRC (CMS – WANIN) 300MB/s Network In Network Out Phedex Migration queue Tier-0 Rate CPU

CCRC (WANOUT) 300MB/s Network In Network Out Phedex Before Replication CPU After Replication

CASTOR Plans for May CCRC08 • Still problems • Optimising end to end transfer performance remains a balancing act. • Hard to manage complex configuration • Working on • Alice/xrootd deployment • Preparation for 2.1.6 upgrade • Installation of Oracle RACS (resilient Oracle services for CASTOR) • Provisioning and configuration management

dCache Closure • Agreed with UB that we would give 6 months notice before terminating dCache service • dCache closure announced to UB to be May 2008 • ATLAS and LHCB working to migrate their data • Migration slower than hoped • Service much reduced in size now (10-12 servers remain) and operational overhead much lower • Remaining non-LHC experiments migration delayed by low priority for non-CCRC work. • Work on Gen instance of CASTOR will recommence shortly. • Pragmatically – closure may be delayed by several months until Minos and tiny VOs migrated

Termination of GRIDPP use of ADS Service • GRIDPP funding and use of old legacy Atlas Datastore service scheduled to end at end of March 2008. • No gridpp access by “tape” command after this • Also no access via C callable VTP interface • RAL will continue to operate ADS service and experiments are free to purchase capacity directly from Datastore Team. • Pragmatically closure cannot happen until: • dCache ends (uses ADS back end) • CASTOR is available for small VOs • Probably 6 months away

Conclusions • Hardware for 2008 MoU in the machine room and moving satisfactorily through acceptance • Volume not yet a problem but warning signs starting to appear. • CASTOR situation continues to improve • Reliable during CCRC08 • Hardware performance improving. Tape migration problem reasonably understood and partly solved. Scope for further improvement • Progressing various upgrades • Remaining Tier-1 infrastructure essentially problem free. • Availability fair, but stagnating. need to progress: • Incident response staff • On-Call • Disaster Planning and National/Global/Cluster Resilience • Concerned that we still not seen all experiment use cases.

Tier-1 Status

Tier-1 Status

Presentation Transcript

Tier-1 Overview

Tier 1 Toolbox

RAL Tier A Status

Rationale—Tier 1

Tier 1

Tier 1: Behavior

Status Report on Tier-1 in Korea

Tier 1

Tier 1

RAL Tier 1/A Status

Status and Plans for the NIKHEF Tier 1

Tier 3 Status at Panjab

U.S. ATLAS Tier-1 Network Status

SLAC Tier 3 Status

Tier-1 Status

Tier 1, Round 1

US ATLAS Tier 3 Status

Status of WLCG Tier-0

Status of Tier-1 @ GSDC, KISTI

Tier 1

Rationale—Tier 1

Tier-1/Tier-2 operation experience