RHIC Computing Facility at BNL - Overview

HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory The RHIC Computing Facility at BNL

RCF - Overview • Brookhaven National Lab is a multi-disciplinary DOE research laboratory • RCF formed in the mid-90’s to provide computing infrastructure for the RHIC experiments. Named US Atlas Tier 1 computing center in late 90’s • Currently supports both HENP and HEP scientific computing efforts as well as various general services (backup, email, web hosting, off-site data transfer…) • 25 FTE’s (expanding soon) • RHIC Run-3 completed in Spring. Run-4 slated to begin in Dec/Jan

RCF Structure

Mass Storage • 4 StorageTek tape silos managed by HPSS (v4.5) • Upgraded to 37 9940B drives (200GB/cartridge) prior to Run-3 (~2 mos. to migrate data) • Total data store of 836TB (~4500TB capacity) • Aggregate bandwidth up to 700MB/s – expect 300MB/s in next run • 9 data movers with 9TB of disk (Future: array to be fully replaced after next run with faster disk) • Access via pftp and HSI, both integrated with K5 authentication (Future: authentication through Globus certificates)

Mass Storage

Centralized Disk Storage • Large SAN served via NFS • Processed data store + user home directories and scratch • 16 Brocade switches and 150TB of Fibre Channel Raid5 managed by Veritas (MTI & Zzyzx peripherals) • 25 Sun Servers (E450 & V480) running Solaris 8 (load issues with nfsd and mountd precluded update to Solaris 9) • Can deliver data to farm at up to 55MB/sec/server • RHIC and USAtlas AFS cells • Software repository + user home directories • Total of 11 AIX servers, 1.2TB (RHIC) & 0.5TB (Atlas) • Transarc on server side, OpenAFS on client side • RHIC cell recently renamed (standardized)

Centralized Disk Storage E450’s MTI Zzyzx

The Linux Farm • 1097 dual Intel CPU VA and IBM rackmounted servers – total of 918 kSpecInt2000 • Nodes allocated by expt and further divided for reconstruction & analysis • 1GB memory typically + 1.5GB swap • Combination of local SCSI & IDE disk with aggregate storage of >120TB available to users • Experiments starting to make significant use of local disk through custom job schedulers, data repository managers and rootd

The Linux Farm

The Linux Farm • Most RHIC nodes recently upgraded to latest RH8 rev. (Atlas still at RH7.3) • Installation of customized image via Kickstart server • Support for networked file systems (NFS, AFS) as well as distributed local data storage • Support for open source and commercial compilers (gcc, PGI, Intel) and debuggers (gdb, totalview, Intel)

Linux Farm - Batch Management • Central Reconstruction Farm • Up to now, data reconstruction was managed by a locally produced Perl-based batch system • Over the past year, this has been completely rewritten as a Python-based custom frontend to Condor • Leverages DAGman functionality to manage job dependencies • User defines task using JDL identical to former system, then Python DAG-builder creates job and submits to Condor pool • Tk GUI provided to users to manage their own jobs • Job progress and file transfer status monitored via Python interface to a MySQL backend

Linux Farm - Batch Management • Central Reconstruction Farm (cont.) • New system solves scalability problems of former system • Currently deployed for one expt. with others expected to follow prior to Run-4

Linux Farm - Batch Management • Central Analysis Farm • LSF 5.1 licensed on virtually all nodes, allowing use of CRS nodes in between data reconstruction runs • One master for all RHIC queues, one for Atlas • Allows efficient use of limited hardware, including moderation of NFS server loads through (voluntary) shared resources • Peak dispatch rates of up to 350K jobs/week and 6K+ jobs/hour • Condor is being deployed and tested as a possible complement or replacement – still nascent, awaiting some features expected in upcoming release • Both accepting jobs through Globus gatekeepers

Security & Authentication • Two layers of firewall with limited network services and limited interactive access exclusively through secured gateways • Conversion to Kerberos5-based single sign-on paradigm • Simplify life by consolidating password databases (NIS/Unix, SMB, email, AFS, Web). SSH gateway authentication  password-less access inside facility with automatic AFS token acquisition • RCF Status: AFS/K5 fully integrated, Dual K5/NIS authentication with NIS to be eliminated soon • USAtlas Status: “K4”/K5 parallel authentication paths for AFS with full K5 integration on Nov. 1, NIS passwords already gone • Ongoing work to integrate K5/AFS with LSF, solve credential forwarding issues with multihomed hosts, and implement a Kerberos certificate authority

US Atlas Grid Testbed giis01 Information Server LSF (Condor) pool amds Mover HPSS AFS server Globus RLS Server aftpexp00 Globus-client Gatekeeper Job manager aafs 70MB/S GridFtp atlas02 Grid Job Requests 17TB Disks Internet Local Grid development currently focused on monitoring and user management

Monitoring & Control • Facility monitored by a cornucopia of vendor-provided, open-source and home-grown software...recently, • Ganglia was deployed on the entire farm, as well as the disk servers • Python-based “Farm Alert” scripts were changed from SSH push (slow), to multi- threaded SSH pull (still too slow), to TCP/IP push, which finally solved the scalability issues • Cluster management software is a requirement for linux farm purchases (VACM, xCAT) • Console access, power up/down…really came in useful this summer!

The Great Blackout of ‘03

Future Plans & Initiatives • Linux farm expansion this winter: addition of >100 2U servers packed with local disk • Plans to move beyond NFS-served SAN with more scalable solutions: • Panasas - file system striping at block level over distributed clients • dCache - potential for managing distributed disk repository • Continuing development of grid services with increasing implementation by the two large RHIC experiments • Very successful RHIC run with a large high-quality dataset!

RHIC Computing Facility at BNL - Overview

RHIC Computing Facility at BNL - Overview

Presentation Transcript

Psychology Computing Facility

HENP Computing at BNL

Polarized beam in RHIC in Run 2011. Polarimetry at RHIC A.Zelenski, BNL

PetaByte Storage Facility at RHIC

The RHIC-ATLAS Computing Facility at BNL

RHIC/US ATLAS Tier 1 Computing Facility Site Report

FroNTier at BNL

RHIC/USATLAS Grid Computing Facility Overview

Эксперимент PHENIX на ускорителе RHIC , BNL.

Mass Storage @ RHIC Computing Facility

The MECO Experiment at BNL

Lancaster Computing Facility

Visualization at the Leadership Computing Facility

DESIGN OF THE BNL SUPER NEUTRINO BEAM FACILITY

The Leadership Pipeline at BNL

Site Report: The RHIC Computing Facility

RHIC/USATLAS Grid Computing Facility Overview

Condor at BNL

DESIGN OF THE BNL SUPER NEUTRINO BEAM FACILITY

BNL Computing Environment

200 GeV Au+Au Collisions, RHIC at BNL