280 likes | 389 Vues
This presentation discusses the evolution of the Linux Farm at Brookhaven National Lab (BNL), emphasizing the transition from a localized to a global computing resource. It examines the background of the RHIC and ATLAS Computing Facilities, highlights staffing and operational changes, and explores future challenges including improved software interoperability and evolving security protocols. The presentation also details the importance of automated management, scalable hardware, and the need for a new management philosophy to support increasingly complex computing environments.
E N D
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL
Background • Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government. • BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. • The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.
Background (cont.) • BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. • RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).
Background (cont.) • The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF • RCF/ACF is transforming itself from a local resource into a national and global resource • Growing design and operational complexity • Increasing staffing levels to handle additional responsibilities
The Pre-Grid Era • Rack-mounted commodity hardware • Self-contained, localized resources • Resources available only to local users • Little interaction with external resources at remote locations • Considerable freedom to set own usage policies
The (Near-Term) Future • Resources available globally • Distributed computing architecture • Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth • Constraints on freedom to set own policies
How do we get there? • Change in management philosophy • Evolution in hardware requirements • Evolution in software packages • Different security protocol(s) • Change in access policy
Change in Management Philosophy • Automated monitoring & management of servers in large clusters a must • Remote power management, predictive hardware failure analysis and preventive maintenance are important • High-availability based on large number of identical servers, not on 24-hour support • Increasingly larger clusters only manageable if servers are identical avoid specialized servers
Evolution in Hardware Requirements • Early acquisitions emphasized CPU power over local storage capacity • Increasing affordability of local disk storage has changed this philosophy • Hardware chosen by optimal combination of CPU power, storage capacity, server density and price • Buy from high-quality vendors to avoid labor-intensive maintenance issues
The Factors Enforcing Evolution in Software Packages • Cost • Farm size / scalability • Security • External influences / wide acceptance
Cost • Red Hat Linux →Scientific Linux • LSF →Condor
Farm Size / Scalability • Home built batch system for data reconstruction→ Condor based batch system • Home built monitoring system → Ganglia
Security • Started with NIS/telnet in the 90’s • Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh scricter security standards than in the past • On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. • Testing GSI limited support for GSI
Security Changes (cont.) • Authorization & authentication controlled by local site (NIS and Kerberos) • Migration to GSI requires a central CA and regional VO’s for authentication local sites performs final authentication before granting access • Accept certificates from multiple CA’s? • Difficult transition from complete to partial control over security issues
External Influences / Wide Acceptance • Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. • HRM / dCACHE – used by other labs • Condor – widely used by Atlas community
Summary • RCF/ACF going through a transition from a local facility to a regional (global) facility many changes • Linux Farm built with commodity hardware is increasingly affordable and reliable • Distributed storage is also increasingly affordable management software issues.
Summary (cont.) • Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices • Transition with security and access issues • Migration will take longer and be more difficult than generally expected change in hardware and software needs to be complemented by a change in management philosophy