ATLAS Tier 1 Meeting

ATLAS Tier 1 Meeting Linux Farm & Facility Operations May 22, 2007 Lyon, France

Outline • Overview • Linux Farm • Condor • Virtualization • Monitoring • Facility Operations • Experience with RHIC operations • A New Operational Model for the RACF • Summary

Overview • RACF is a heterogeneous large-scale, multi-purpose facility with 24x7 year-round support at Brookhaven. • Supports computing activities for RHIC experiments, ATLAS and LSST (processing, storage, www, email, backup, printing, etc). • Primary computing facility for RHIC experiments; Tier 1 Center for ATLAS computing in the U.S.; a component of LSST computing. • Over 7 PB of tape storage capacity, 200+ TB of centralized disk storage, 1.3+ PB of distributed storage and almost 5 million SI2K of computing capacity. • Maintained and operated by 36 staff members (and growing).

Tape Storage System

Disk Storage

Linux Farm

Linux Farm (cont.) • Majority of processing power and distributed storage at RACF. • Consumes significant infrastructure resources (power, cooling, space and network). • Selected interactive systems provide access to Linux Farm resources. • Challenge is managing large numbers of similar servers, not high-availability systems  innovation and automation to keep staff workload manageable. • Migration to dual-core in 2006 (quad-core evaluation in 2008). • Migration 32-bit SL-4.x (for now).

Linux Farm (cont.) • Established procedure for evaluation and procurement of Linux Farm computing and Linux Farm-based distributed storage. • Adding 1.5 million SI2K and 800 TB of distributed storage in 2007. • Total of 5500 cpu`s (Intel and AMD) in nearly 1700 rack-mounted servers by mid-2007. • Productivity gains from: • Condor • Virtualization • Meeting projected resource requirements for ATLAS will be a challenge.

RACF Linux Farm Computing Power

RACF Linux Farm Distributed Storage

Expected Computing Capacity Evolution

Expected Storage Capacity Evolution

Projected Power Requirements

Projected Space Requirements

Condor • Replaced LSF in 2004 (still some legacy LSF support for STAR experiment). • Steep learning curve for Condor configuration  required significant man-hours in meetings with developers. • Custom configuration for each experiment. • Common (general) queue for all experiments to allow back-fill of unused cpu cycles on a low-priority basis. • Higher utilization but also more complexity.

Condor Policy for ATLAS

Condor configuration for ATLAS

RCF/ACF Farm Occupancy Condor general queue fully enabled in Aug. 2006

Condor Jobs in the RACF Linux Farm Transition from LSF to Condor began on June 2004 (*) May 1-15, 2007 only

Virtualization • Condor is built for throughput, not performance  built-in inefficiency. • Addressed with general queue concept  blurring boundaries between experiment-exclusive resources. • Not entirely successful  different software environments for different experiments. • Can have higher utilization levels if we support multiple software environments on the same physical hardware  virtualization. • Increasing hardware/software support for virtualization (ie, RedHat and Intel). • Pursuing both Xen and VMware. • Xen testbed created in February 07. Vmware testbed available summer 07. • Virtualization can also be used to create a software testbed without taking resources from the production environment.

Virtualization

Monitoring • Evolution of RACF from local to globally available resource highlights the importance of a reliable, well-instrumented monitoring system. • RACF monitors service availability, system performance and facility infrastructure (power and cooling). • Mixture of commercial, open-source and RACF-written components. • RT • Ganglia • Nagios • Infrastructure • Condor • Choices guided by desired features: historical logs, alarm escalation, real time information.

RT • Trouble tickets system. • Historical records available. • Currently coupled to the monitoring software for alarm escalation and event logging. • Integration into SLA for ATLAS.

Ganglia • Open-source, distributed hierarchical monitoring tool for federations of clusters. • Leverages existing tools (ie, XML for data representation and RRDtool for data storage and visualization) for ease of management. • Low-overhead and scalable to thousands of systems at RACF. • Monitors cluster performance (storage capacity, computing throughput, etc).

Nagios • Open-source software used to monitor service availability. • Host-based daemons configured to use externally-supplied “plugins” to obtain service status. • Host-based alarm configured to take specified actions (e-mail notification, system reboot, etc). • Native web-interface not scalable. • Connected to RT ticketing system for alarm escalation and logging.

Nagios (cont.)

Infrastructure • The growth of the RACF has put considerable strain on power and cooling to the building`s infrastructure. • UPS back-up power for RACF equipment. • Custom RACF-written script to monitor power and cooling issues. • Alarm escalation through RT ticketing system. • Automatic shutdown of Linux Farm during cooling or power failures.

Condor • Condor batch system does not provide a monitoring interface. • RACF created its own web-based, monitoring interface. • Interface available to staff for performance tuning and to facility users. • Connected to RT for critical servers. • Monitoring functions • Throughput • Service Availability • Configuration Optimization

Condor (cont.)

Facility Operations • Facility operations is a manpower-intensive activity at the RACF. • Careful choice of technologies required for scaling of capacity and services. • Operational responsibility divided among major support groups within the facility (tape storage, disk storage, linux farm, general computing) • Software upgrades • Hardware lifecycle management • Integrity of facility services • User account lifecycle management • Cyber-security • Experience of RHIC operations for the past 8 years. • Can be used as a starting point for ATLAS Tier 1 facility operations.

Experience with RHIC Operations • 24x7 year-round operations already in place with RHIC experiments since 2000. • Facility components classified into 3 categories: non-essential, essential and critical. • Response to component failure commensurate with component classification: • Critical components are covered 24x7 year-round. Immediate response is expected from on-call staff. • Essential components have built-in redundancy/duplication and are addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability. • Non-essential components are addressed the next business day. • Staff provides primary coverage during normal business hours. • Operators are first point of contact during off-hours and weekends.

Experience with RHIC Operations (cont.) • Operators are responsible for contacting appropriate on-call person. • Users report problems via e-mail-based trouble ticketing system, pagers and phone. • Monitoring software instrumented with alarm system. • Alarm system connected to selected pagers and cell phones. • Alarm escalation procedure for staff (ie, contact back-up if primary is not available) during off-hours and weekends. • Periodic rotation of primary and back-up on-call list for each subsystem. • Automatic response to alarm conditions in certain cases (ie, shutdown of Linux Farm cluster in case of cooling failure). • Facility operations in RHIC has worked well over past 8 years.

Table 1 Summary of RCF Services and Servers Service Server Rank Comments Network to Ring 1 Internal Network 1 External Network 1 ITD handles RCF firewall 1 ITD handles HPSS rmdsXX 1 AFS Server rafsXX 1 AFS File systems 1 NFS Server 1 NFS home directories rmineXX 1 CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2 Web server (internet) www.rhic.bnl.gov 1 Web server (intranet) www.rcf.bnl.gov 1 NFS data disks rmineXX 1 Instrumentation 2 SAMBA rsmb00 DNS rnisXX 2 Should fail over NIS rnisXX 2 Should fail over NTP rnisXX 2 Should fail over RCF gateways 2 Multiple gateway machines ADSM backup 2 Wincenter rnts00 2/3 CRS Farm 2 LSF rlsf00 2 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Email 2/3 Printers 3 Service Level Agreement

A New Operational Model for the RACF • RHIC facility operations is a system-based approach. • ATLAS needs support for (mostly) remote users. • Service-based operational approach better suited for a distributed computing environment. • Dependency of services and systems for an integrated approach. • Service Coordinators responsible for availability of services. • New SLA for RACF to incorporate service-based approach. • Implementation details and timelines not yet finalized.

A Dependency Matrix

A New Response Approach

Summary • Growth in scale and complexity of the RACF operations. • Virtualization can help increase cluster productivity. • Monitoring integrated with alarm system for increased productivity. • Well-established procedures from RHIC operational experience. • New service-based SLA for distributed computing model needed. • New operational approach to meet ATLAS requirements to be implemented.

ATLAS Tier 1 Meeting