LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN

LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN Geneva, Switzerland BNL March 2005 ian.bird@cern.ch

Overview • LCG Project Overview • Overview of main project areas • Deployment and Operations • Current LCG-2 Status • Operations and issues • Plans for migration to gLite + … • Service Challenges • Interoperability • Outlook & Summary

LHC Computing Grid Project Aim of the project To prepare, deploy and operate the computing environment for the experiments to analyse the data from the LHC detectors Applications development environment, common tools and frameworks Build and operate the LHC computing service The Grid is just a tool towards achieving this goal

Project Areas & Management Project Leader Les Robertson Resource Manager – Chris Eck Planning Officer – Jürgen Knobloch Administration – Fabienne Baud-Lavigne • Distributed Analysis - ARDA • Massimo Lamanna • Prototyping of distributed • end-user analysis using • grid technology Joint with EGEE Applications Area Torre Wenaus Development environment Joint projects, Data management Distributed analysis Middleware Area Frédéric Hemmer Provision of a base set of grid middleware (acquisition, development, integration)Testing, maintenance, support Pere Mato from 1 March 05 CERN Fabric AreaBernd Panzer Large cluster management Data recording, Cluster technology Networking, Computing service at CERN Grid Deployment Area Ian Bird Establishing and managing the Grid Service - Middleware, certification, security operations, registration, authorisation,accounting

Goal • Create a European-wide production quality • multi-science grid infrastructure on top of • national & regional grid programs • Scale • 70 partners in 27 countries • Initial funding (€32M) for 2 years • Activities • Grid operations and support (joint LCG/EGEE • operations team) • Middleware re-engineering (close attention to LHC data analysis requirements) • Training, support for applications groups (inc. contribution to the ARDA team) • Builds on • LCG grid deployment • Experience gained in HEP • LHC experiments  pilot applications Relation with EGEE

Applications Area • All Applications Area projects have software deployed in production by the experiments • POOL, SEAL, ROOT, Geant4, GENSER, PI/AIDA, Savannah • 400 TB of POOL data produced in 2004 • Pre-release of Conditions Database (COOL) • 3D project will help POOL and COOL in terms of scalability • 3D:DistributedDeployment of Databases • Geant4 successfully used in ATLAS, CMS and LHCb Data Challenges with excellent reliability • GENSER MC generator library in production • Progress on integrating ROOT with other Applications Area components • Improved I/O package used by POOL; Common dictionary, maths library with SEAL • Pere Mato (CERN, LHCb) has taken over from Torre Wenaus (BNL, ATLAS) as Applications Area Manager • Plan for next phase of the applications area being developed for internal review at end of March

The ARDA project • ARDA is an LCG project • main activity is to enable LHC analysis on the grid • ARDA is contributing to EGEE NA4 • uses the entire CERN NA4-HEP resource • Interface with the new EGEE middleware (gLite) • By construction, use the new middleware • Use the grid software as it matures • Verify the components in an analysis environments (users!) • Provide early and continuous feedback • Support the experiments in the evolution of their analysis systems • Forum for activity within LCG/EGEE and with other projects/initiatives

ARDA activity with the experiments • The complexity of the field requires a great care in the phase of middleware evolution and delivery: • Complex (evolving) requirements • New use cases to be explored (for HEP: large-scale analysis) • Different communities in the loop - LHC experiments, middleware experts from the experiments and other communities providing large middleware stacks (CMS GEOD, US OSG, LHCb Dirac, etc…) • The complexity of the experiment-specific part is comparable (often larger) to the “general” one • The experiments do require seamless access to a set of sites (computing resources) but the real usage (therefore the benefit for the LHC scientific programme) will come by exploiting the possibility to build their computing systems on a flexible and dependable infrastructure • How to progress? • Build end-to-end prototype systems for the experiments to allow end users to perform analysis tasks

LHC prototype overview

LHC experiments prototypes (ARDA) All prototypes have been “demoed” within the corresponding user communities

CERN Fabric

Extremely Large Fabric management system configuration, installation andmanagement of nodes lemon LHC Era Monitoring - system & service monitoring LHC Era Automated Fabric – hardware / state management Includes technology developed by European DataGrid CERN Fabric • Fabric automation has seen very good progress • The new systems for managing large farms are in production at CERN since January • New CASTOR Mass Storage System • Being deployed first on the high throughput cluster for the ongoing ALICE data recording computing challenge • Agreement on collaboration with Fermilab on Linux distribution • Scientific Linux based on Red Hat Enterprise 3 • Improves uniformity between the HEP sites serving LHC and Run 2 experiments • CERN computer centre preparations • Power upgrade to 2.5 MW • Computer centre refurbishment well under way • Acquisition process started

Preparing for 7,000 boxes in 2008

High Throughput Prototype (openlab + LCG prototype) 4 * GE connections to the backbone 10GE WAN connection 12 Tape Server STK 9940B 24 Disk Server (P4, SATA disks, ~ 2TB disk space each) 4 *ENTERASYS N7 10 GE Switches 2 * Enterasys X-Series 2 * 50 Itanium 2 (dual 1.3/1.5 GHz, 2 GB mem) • Experience with likely ingredients in LCG: • 64-bit programming • next generation I/O(10 Gb Ethernet, Infiniband, etc.) • High performance cluster used for evaluations, and for data challenges with experiments • Flexible configuration – components moved in and out of production environment • Co-funded by industry and CERN 36 Disk Server (dual P4, IDE disks, ~ 1TB disk space each) 10GE 10 GE per node 10 GE per node 80 * IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.) 10GE 1 GE per node 10GE 28 TB , IBM StorageTank 12 Tape Server STK 9940B 40 * IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.) 80 IA32 CPU Server (dual 2.8 GHz P4, 2 GB mem.)

Alice Data Recording Challenge • Target – one week sustained at 450 MB/sec • Used the new version of Castor mass storage system • Note smooth degradation and recovery after equipment failure

Deployment and Operations

LHC Computing Model (simplified!!) • Tier-0 – the accelerator centre • Filter raw data  reconstruction  event summary data (ESD) • Record the master copy of raw and ESD • Tier-1 – • Managed Mass Storage –permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service • Data-heavy (ESD-based) analysis • Re-processing of raw data • National, regional support • “online” to the data acquisition processhigh availability, long-term commitment • Tier-2 – • Well-managed, grid-enabled disk storage • End-user analysis – batch and interactive • Simulation

Computing Resources: March 2005 • Country providing resources • Country anticipating joining • In LCG-2: • 121 sites, 32 countries • >12,000 cpu • ~5 PB storage • Includes non-EGEE sites: • 9 countries • 18 sites

Infrastructure metrics Countries, sites, and CPU available in LCG-2 production service EGEE partner regions Other collaborating sites

Service Usage • VOs and users on the production service • Active HEP experiments: • 4 LHC, D0, CDF, Zeus, Babar • Active other VO: • Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics) • 6 disciplines • Registered users in these VO: 500 • In addition to these there are many VO that are local to a region, supported by their ROCs, but not yet visible across EGEE • Scale of work performed: • LHC Data challenges 2004: • >1 M SI2K years of cpu time (~1000 cpu years) • 400 TB of data generated, moved and stored • 1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity) Number of jobs processed/month

Current production software (LCG-2) • Maintenance agreements with: • VDT team (inc Globus support) • DESY/FNAL - dCache • EGEE/LCG teams: • WLM, VOMS, R-GMA, Data Management • Evolution through 2003/2004 • Focus has been on making these reliable and robust • rather than additional functionality • Respond to needs of users, admins, operators • The software stack is the following: • Virtual Data Toolkit • Globus (2.4.x), Condor, etc • EU DataGrid project developed higher-level components • Workload management (RB, L&B, etc) • Replica Location Service (single central catalog), replica management tools • R-GMA as accounting and monitoring framework • VOMS being deployed now • Operations team re-worked components: • Information system: MDS GRIS/GIIS  LCG-BDII • edg-rm tools replaced and augmented as lcg-utils • Developments on: • Disk pool managers (dCache, DPM) • Not addressed by JRA1 • Other tools as required: • e.g. GridIce – EU DataTag project

Software – 2 • Platform support • Was an issue – limited to RedHat 7.3 • Now ported to: Scientific Linux (RHEL), Fedora, IA64, AIX, SGI • Another problem was heaviness of installation • Now much improved and simpler – with simple installation tools, allow integration with existing fabric management tools • Much lighter installation on worker nodes – user level

Overall status • The production grid service is quite stable • The services are quite reliable • Remaining instabilities in the IS are being addressed • Sensitivity to site management • Problems in underlying services must be addressed • Work on stop-gap solutions (e.g. RB maintains state, Globus gridftp  reliable file transfer service) • The biggest problem is stability of sites • Configuration problems due to complexity of the middleware • Fabric management at less experienced sites • Job efficiency is not high, unless • Operations/Applications select stable sites (BDII allows a application-specific view) • In large tests, selecting stable sites, achieve >>90% efficiency • Operations workshop last November to address this • Fabric management working group – write fabric management cookbook • Tighten operations control of the grid – escalation procedures, removing bad sites • Complexity is in the number of sites – not number of cpu

Operations Structure • Operations Management Centre (OMC): • At CERN – coordination etc • Core Infrastructure Centres (CIC) • Manage daily grid operations – oversight, troubleshooting • Run essential infrastructure services • Provide 2nd level support to ROCs • UK/I, Fr, It, CERN, + Russia (M12) • Taipei also run a CIC • Regional Operations Centres (ROC) • Act as front-line support for user and operations issues • Provide local knowledge and adaptations • One in each region – many distributed • User Support Centre (GGUS) • In FZK – manage PTS – provide single point of contact (service desk) • Not foreseen as such in TA, but need is clear

RC RC RC RC ROC RC RC RC RC RC ROC RC RC RC CIC CIC RC ROC RC CIC OMC CIC CIC CIC RC RC RC ROC RC RC RC Grid Operations • The grid is flat, but • Hierarchy of responsibility • Essential to scale the operation • CICs act as a single Operations Centre • Operational oversight (grid operator) responsibility • rotates weekly between CICs • Report problems to ROC/RC • ROC is responsible for ensuring problem is resolved • ROC oversees regional RCs • ROCs responsible for organising the operations in a region • Coordinate deployment of middleware, etc • CERN coordinates sites not associated with a ROC RC = Resource Centre

Start with service level definitions What a site supports (apps, software, MPI, compilers, etc) Levels of support (# admins, hrs/day, on-call, operators…) Response time to problems Define metrics to measure compliance Publish metrics & performance of sites relative to their commitments Remote monitoring/management of services Can be considered for small sites Middleware/services Should cope with bad sites Clarify what 24x7 means Service should be available 24x7 Does not mean all sites must be available 24x7 Specific crucial services that justify cost Classify services according to level of support required Operations tools need to become more and more automated Having an operating production infrastructure should not mean having staff on shift everywhere “best-effort” support The infrastructure (and applications) must adapt to failures SLAs and 24x7

Operational Security • Operational Security team in place • EGEE security officer, ROC security contacts • Concentrate on 3 activities: • Incident response • Best practice advice for Grid Admins – creating dedicated web • Security Service Monitoring evaluation • Incident Response • JSPG agreement on IR in collaboration with OSG • Update existing policy “To guide the development of common capability for handling and response to cyber security incidents on Grids” • Basic framework for incident definition and handling • Site registration process in draft • Part of basic SLA • CA Operations • EUGridPMA – best practice, minimum standards, etc. • More and more CAs appearing • Security group and work was started in LCG – was from the start a cross-grid activity. • Much already in place at start of EGEE: usage policy, registration process and infrastructure, etc. • We regard it as crucial that this activity remains broader than just EGEE

Best practice Guides Policy – Joint Security Group Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy Application Development & Network Admin Guide User Registration http://cern.ch/proj-lcg-security/documents.html

User support – Call centre/helpdesk Coordinated through GGUS ROCs as front-line Task force in place to improve the service VO Support Was an oversight in the project and is not really provisioned In LCG there is a team (5 FTE): Help apps integrate with m/w Direct 1:1 support Understanding of needs Act as advocate for app This is really missing for the other apps – adaptation to the grid environment takes expertise User Support We have found that user support has 2 distinct aspects: Operations Centres (CIC / ROC)Operations Problems Resource Centres (RC)Hardware Problems Deployment SupportMiddleware Problems Global Grid User Support (GGUS)Single Point of Contact Coordination of UserSupport Application Specific User SupportVO specific problems Other Communities e.g. Biomed LHC experiments non-LHC experiments

Certification process • Process was decisive to improve the middleware • The process is time consuming (5 releases 2004) • Many sequential steps • Many different site layouts have to be tested • Format of internal and external releases differ • Multiple packaging formats (tool based, generic) • All components are treated equal • same level of testing for non vital and core components • new tools and tools in use by other projects are tested to the same level • Process to include new components is not transparent • Timing for releases difficult • Users: now; sites: scheduled • Upgrades need a long time to cover all sites • Some sites had problems to become functional after an upgrade

Additional Input • Data Challenges • client libs need fast and frequent updates • core services need fast patches (functional/fixes) • applications need a transparent release preparation • many problems only become visible during full scale production • Configuration is a major problem on smaller sites • Operations Workshop • smaller sites can handle major upgrades only every 3 months • sites need to give input in the selection of new packages • resolve conflicts with local policies

Changes I • Simple Installation/Configuration Scripts • YAIM (YetAnotherInstallMethod) • semi-automatic simple configuration management • based on scripts (easy to integrate into other frameworks) • all configuration for a site are kept in one file • APT (Advanced Package Tool) based installation of middleware RPMs • simple dependency management • updates (automatic on demand) • no OS installation • Client libs packaged in addition as user space tar-ball • can be installed like application software

Changes II • Different frequency of separate release types • client libs (UI, WN) • services (CE, SE) • core services (RB, BDII,..) • major releases (configuration changes, RPMs, new services) • updates (bug fixes) added any time to specific releases • non-critical components will be made available with reduced testing • Fixed release dates for major releases (allows planning) • every 3 months, sites have to upgrade within 3 weeks • Minor releases every month • based on ranked components available at a specific date in the month • not mandatory for smaller RCs to follow • client libs will be installed as application level software • early access to pre-releases of new software for applications • client libs. will be made available on selected sites • services with functional changes are installed on EIS-Applications testbed • early feedback from applications

5 User Level install of client tools prioritization & selection EIS List for next release (can be empty) Service Release Client Release 7 Applications 2 Updates Release Core Service Release Certification Process 3 Bugs/Patches/Task Savannah RC Applications integration & first tests Developers C&T EIS GIS 4 C&T GDB assign and update cost Internal Releases Internal Client Release Bugs/Patches/Task Savannah 1 CICs EIS 6 full deployment on test clusters (6) functional/stress tests ~1 week Developers C&T C&T Head of Deployment components ready at cutoff

Re-Certify CIC Release Release Client Release Deploy Client Releases (User Space) 11 GIS Deploy Major Releases (Mandatory) Deploy Service Releases (Optional) ROCs RCs CICs RCs Deployment Process YAIM Release(s) Update Release Notes Update User Guides EIS GIS Every 3 months on fixed dates ! User Guides Release Notes Installation Guides Every Month Certification is run daily Every Month at own pace

Operations Procedures • Driven by experience during 2004 Data Challenges • Reflecting the outcome of the November Operations Workshop • Operations Procedures • roles of CICs - ROCs - RCs • weekly rotation of operations centre duties (CIC-on-duty) • daily tasks of the operations shift • monitoring (tools, frequency) • problem reporting • problem tracking system • communication with ROCs & RCs • escalation of unresolved problems • handing over the service to the next CIC

Implementation • Evolutionary Development • Procedures • documented (constantly adapted) • available at the CIC portal http://cic.in2p3.fr/ • in use by the shift crews • Portal http://cic.in2p3.fr • access to tools and process documentation • repository for logs and FAQs • provides means of efficient communication • provides condensed monitoring information • Problem tracking system • currently based on Savannah at CERN • is moving to the GGUS at FZK • exports/imports tickets to local systems used by the ROCs • Weekly Phone Conferences and Quarterly Meetings

All in One Dashboard Grid operator dashboard Cic-on-duty Dashboardhttps://cic.in2p3.fr/pages/cic/framedashboard.html

OMC Blacklist OMC phone • nd 2 mail • CIC 1 mail Operator procedure Escalation • st • SEVERITY • Incident • ESCALATION • PROCEDURE • closure • Savannah • ROC • (5.1) • (6) • In Depth • Diagnosis • Follow up • (5.2) • Testing • help • Report • Cic • Monitoring tools • GIIS • Wiki • Savannah • mailing • GridIce, GOC • Monitor • pages • tool • (1) • (2) • (3) • (4) • (5) • (5.1) • RC

GIIS Monitor graphs Sites Functional Tests and History GIIS Monitor GOC Data Base Scheduled Downtimes Live Job Monitor GridIce – VO view GridIce – fabric view Certificate Lifetime Monitor Selection of monitoring tools

Middleware

Architecture & Design • Design team including representatives from Middleware providers (AliEn, Condor, EDG, Globus,…) including US partners produced middleware architecture and design. • Takes into account input and experiences from applications, operations, and related projects • DJRA1.1 – EGEE Middleware Architecture (June 2004) • https://edms.cern.ch/document/476451/ • DJRA1.2 – EGEE Middleware Design (August 2004) • https://edms.cern.ch/document/487871/ • Much feedback from within the project (operation & applications) and from related projects • Being used and actively discussed by OSG, GridLab, etc. Input to various GGF groups

gLite Services and Responsible Clusters JRA3 UK Access Services Grid AccessService API CERN IT/CZ Security Services Authorization Information & Monitoring Services ApplicationMonitoring Information &Monitoring Auditing Authentication Data Services Job Management Services MetadataCatalog File & ReplicaCatalog JobProvenance PackageManager Accounting StorageElement DataManagement WorkloadManagement ComputingElement Site Proxy CERN - Computing Challenges

gLite Services for Release 1 JRA3 UK Access Services Grid AccessService API CERN IT/CZ Security Services Authorization Information & Monitoring Services Application Monitoring Information &Monitoring Auditing Focus on key servicesaccording to gLite Mgmt taskforce Authentication Data Services Job Management Services MetadataCatalog File & ReplicaCatalog JobProvenance PackageManager Accounting StorageElement DataManagement WorkloadManagement ComputingElement Site Proxy CERN - Computing Challenges

gLite Services for Release 1Software stack and origin (simplified) • Catalog • File and Replica Catalog (EGEE) • Metadata Catalog (EGEE) • Information and Monitoring • R-GMA (EDG) • Security • VOMS (DataTAG, EDG) • GSI (Globus) • Authentication for C and Java based (web) services (EDG) • Computing Element • Gatekeeper (Globus) • Condor-C (Condor) • CE Monitor (EGEE) • Local batch system (PBS, LSF, Condor) • Workload Management • WMS (EDG) • Logging and bookkeeping (EDG) • Condor-C (Condor) • Storage Element • File Transfer/Placement (EGEE) • glite-I/O (AliEn) • GridFTP (Globus) • SRM: Castor (CERN), dCache (FNAL, DESY), other SRMs

Summary • FTS • FTS is being evolved with LCG • Milestone on March 15, 2005 • Stress tests in service challenges • UI • Available in the prototype • Incudes data management • Not yet formally tested • R-GMA • Available in the prototype • Testing has shown deployment problems • VOMS • Available in the prototype • No tests available • WMS • Task Queue, Pull mode, Data management interface • Available in the prototype • Used in the testing testbed • Now working on the certification testbed • Submission to LCG-2 demonstrated • Catalog • MySQL and Oracle • Available in the prototype • Used in the testing testbed • Delivered to SA1 • But not tested yet • gLite I/O • Available in the prototype • Used in the testing testbed • Basic functionality and stress test available • Delivered to SA1 • But not tested yet

Schedule • All of the Services are available now on the development testbed • User documentation currently being added • On a limited scale testbed • Most of the Services are being deployed on the LCG Preproduction Service • Initially at CERN, more sites once tested/validated • Scheduled in April-May • Schedule for deployment at major sites by the end of May • In time to be included in the LCG service challenge that must demonstrate full capability in July prior to operate as a stable service in 2H2005

LCG-2 (=EGEE-0) 2004 prototyping prototyping product 2005 product LCG-3 (=EGEE-x?) Migration Strategy • Certify gLite components on existing LCG-2 service • Deploy components in parallel – replacing with new service once stability and functionality is demonstrated • WN tools and libs must co-exist on same cluster nodes • As far as possible must have a smooth transition

Service Challenges

Problem Statement • ‘Robust File Transfer Service’ often seen as the ‘goal’ of the LCG Service Challenges • Whilst it is clearly essential that we ramp up at CERN and the T1/T2 sites to meet the required data rates well in advance of LHC data taking, this is only one aspect • Getting all sites to acquire and run the infrastructure is non-trivial (managed disk storage, tape storage, agreed interfaces, 24 x 365 service aspect, including during conferences, vacation, illnesses etc.) • Need to understand networking requirements and plan early • But transferring ‘dummy files’ is not enough… • Still have to show that basic infrastructure works reliably and efficiently • Need to test experiments’ Use Cases • Check for bottlenecks and limits in s/w, disk and other caches etc. • We can presumably write some test scripts to ‘mock up’ the experiments’ Computing Models • But the real test will be to run your s/w… • Which requires strong involvement from production teams

LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN