EGEE Infrastructure, Services, & Operations

EGEE Infrastructure, Services, & Operations Ian Bird, CERN IT SA1 Activity Leader 1st EGEE User Forum 2nd March 2006

Outline • Introduction – history • Middleware and Services • Middleware distributions • Operations • User Support • Access to resources & Introducing new VOs • What can you get from EGEE? • And what does it cost? • From EGEE to EGEE-II • Outlook SA1 – Operations & Management  97% SA2 – Network Services  3% EGEE Infrastructure & Operations

Introduction

History • EGEE infrastructure (middleware distribution and operations) was built up during 18 months prior to the start of EGEE by the LCG project • The LCG work formed the basic infrastructure of EGEE • The middleware distribution retained this name (LCG-2.x) as it was expected to be replaced by gLite • Now the middleware distribution will evolve with additional or replacement services coming from gLite or elsewhere • EGEE started in April 2004 with a running grid infrastructure • 40 sites, 3000 CPU • Basic operations • Developed certification and deployment process • Now expanded to: • 200 sites, >20 000 CPU, 40 countries • Managed operations – stability of sites • >10 000 jobs / day sustained over the last year Sites CPU Jobs/day EGEE Infrastructure & Operations

EGEE Infrastructure & Operations

Middleware & Services

Grid middleware • Middleware is software and services that sit between the user application and the underlying computing and storage resources, to provide a uniform access to those resources. • The GRID middleware services: should • Find convenient places for the application to be run • Optimise use of resources • Organise efficient access to data • Deal with authentication to the different sites that are used • Run the job & monitor progress • Recover from problems • Transfer the result back to the scientist EGEE Infrastructure & Operations

Middleware Distributions and Stacks • Terminology: • EGEE deploys a middleware distribution • Drawn from various middleware products, stacks, etc. • Do not confuse the distribution with development projects or with software packages • Count on 6 months from software developer “release” to production deployment • The EGEE distribution: • Current production version labelled: LCG-2.7.0 • Next version labelled: gLite-3.0 • Name change to hopefully reduce confusion • EGEE distribution contents: evolution • LCG-2.7.0: • VDT – packaging Globus 2.4, Condor, MyProxy • EDG workload management • LCG components: • BDII (info sys), • catalogue (LFC), • DPM, data management libraries and CLI tools • monitoring tools • gLite: R-GMA, VOMS, FTS • gLite-3.0: • Based on LCG-2.7.0, and • gLite workload management • Other gLite components (not in the distribution but provided as services): • AMGA, Hydra, Fireman • gLite-IO EGEE Infrastructure & Operations

Authentication Use of GSI, X.509 certificates Generally issued by national certification authorities Agreed network of trust: International Grid Trust Federation (IGTF) EUGridPMA APGridPMA TAGPMA All EGEE sites will usually trust all IGTF root CAs Authorization Until LCG-2.7.0 via grid-map files only From LCG-2.7.0 using VOMS extended proxies Call-outs to local authorization services Integration with grid services under way – compute elements, storage systems For some time the authorization will be a mixture of call-outs and grid-map files until all services understand extended proxies CAs, Authentication, Authorization APGridPMA EUGridPMA TAGPMA Asia-Pacific Grid PMA The Americas Grid PMA European Grid PMA EGEE Infrastructure & Operations

Job Management: Workload Management – Resource Broker DLI/SI interface to catalogues for data-based scheduling Bulk job submission (gLite-3.0) DAGs (gLite-3.0) Push/pull mode (pull untested – gLite-3.0) Compute Element (CE): Globus/EDG/LCG  Condor_C (VO-based scheduling) in gLite-3.0 Logging & Bookkeeping Local Batch systems: LSF, PBS, Condor, (Sun Grid Engine) Additional tools: Ability to “peek” at stdout/stderr of running jobs User job monitoring – look at the status (state, cpu time, etc) of running jobs Data Management File and replica catalogues (LFC) Central or local (not distributed) Replication via Oracle, or squid caches tested by LCG Secure File Transfer Service (FTS) Reliable data transfer Uses gridftp or srmcopy as transport Storage Elements based on SRM interface DPM: implements Posix ACLs, VOMS roles/groups (gLite-3.0) Other available SEs: dCache, Castor Deprecated: “Classic SE” – basically just gridftp Metadata catalogue: AMGA (gLite-3.0 – partial support) Secure Keystore: Hydra (gLite-3.0 – partial support) Utilities and IO libraries: Lcg-utils GFAL – this is the SRM client library gLiteIO – expect functionality to be replaced Basic Services EGEE Infrastructure & Operations

Information system BDII (implementation of Globus MDS) GLUE schema Several tools to access information FCR site selection tool (see next slide) Monitoring & Accounting R-GMA used as monitoring framework Aggregation for various sources of monitoring data Accounting: APEL package: After-the-fact accounting Uses GGF User Record as schema Does not provide user-level data – but this is a legal/privacy issue not technical! Other services EGEE Infrastructure & Operations

Selecting resources: Tool that uses dynamically updated data about sites Site functional tests VO can: Select critical tests White/black list sites VO gets a customised set of “good” sites – a view in the information system VO can add VO-specific tests Can be used by RB or other workload management system to run on good/stable sites Selecting resources EGEE Infrastructure & Operations

Selecting resources EGEE Infrastructure & Operations

Middleware distributions  Deployment

Process to deployment Support, analysis, debugging VDT/OSG SA3 OMII- Europe Testing & Certification Integration Pre-production service Production service … Middleware providers JRA1 SA3 Certification activities SA3+SA1 SA1 EGEE Infrastructure & Operations

5 User Level install of client tools prioritization & selection EIS List for next release (can be empty) Service Release Client Release 7 Applications 2 Updates Release Core Service Release Release Process (simplified) 3 Bugs/Patches/Task Savannah RC Applications integration & first tests Developers C&T EIS GIS 4 C&T GDB assign and update cost Internal Releases Internal Client Release Bugs/Patches/Task Savannah 1 CICs EIS 6 full deployment on test clusters (6) functional/stress tests ~1 week Developers C&T C&T Head of Deployment components ready at cutoff EGEE Infrastructure & Operations

Re-Certify CIC Release Release Client Release Deploy Client Releases (User Space) 11 GIS Deploy Service Releases (Optional) Deploy Major Releases (Mandatory) ROCs RCs CICs RCs Deployment process Release(s) Update Release Notes Update User Guides EIS GIS YAIM User Guides Release Notes Installation Guides Every Month Every 3 months on fixed dates ! Certification is run daily Every Month at own pace EGEE Infrastructure & Operations

Certification test bed EGEE Infrastructure & Operations

Time to upgrade ~constant (~2.5 sites/day) Takes a long time to upgrade entire infrastructure Better now than it was – site functional tests and operational oversight Need to move away from the need to do full upgrades more than 1-2 times / year But need to be able to deploy updates, new tools, security patches, etc. Time to upgrade LCG-2.6.0 EGEE Infrastructure & Operations

Desired scenario • Steady-state with: • Components delivered (as far as possible) independent of each other • Developed according to realistic schedules – not constrained by artificial release deadlines • Production service running stable, tested (certified) versions of services and tools • Major upgrades only 1 or 2 times per year • Potential for upgrading individual services • Client tools: new versions deployed as needed • Emphasis on reliability, stability, performance, backward compatibility, … • Pre-production service running new, but certified versions of services • Anticipated as upgrades to production services (beta releases of next versions or new services) • Allowing reasonable scale application testing and integration with new versions • Certification testbed running full regression, stress, and functional tests • Pre-requisite before moving to pre-production and production • Software can be rejected (not working, not ready, … ) • During testing/certification • During pre-production • Net result must be that the production service is stable and as reliable as possible; and evolves incrementally and in a controlled way EGEE Infrastructure & Operations

Checklist for a new service • First level support procedures • How to start/stop/restart service • How to check it’s up • Which logs are useful to send to CIC/Developers • and where they are • SFT Tests • Client validation • Server validation • Procedure to analyse these • error messages and likely causes • Tools for CIC to spot problems • GIIS monitor validation rules (e.g. only one “global” component) • Definition of normal behaviour • Metrics • CIC Dashboard • Alarms • Deployment Info • RPM list • Configuration details • Security audit • User support procedures (GGUS) • Troubleshooting guides + FAQs • User guides • Operations Team Training • Site admins • CIC personnel • GGUS personnel • Monitoring • Service status reporting • Performance data • Accounting • Usage data • Service Parameters • Scope - Global/Local/Regional • SLAs • Impact of service outage • Security implications • Contact Info • Developers • Support Contact • Escalation procedure to developers • Interoperation • Documented issues • This is what is takes to make a reliable production service from a middleware component • Not much middleware is delivered with all this … yet EGEE Infrastructure & Operations

Operations

Grid Operations • Services: • Production service • Pre-production service • Operational security – incident response • Operation process, includes: • Problem detection • Reporting • Problem solving • Escalation procedures EGEE Infrastructure & Operations

EGEE Operations Structure • Operations Management Centre (OMC) • Core Infrastructure Centres (CIC) • Manage daily grid operations – oversight, troubleshooting • “Operator on Duty” • Run infrastructure services • UK/I, Fr, It, CERN, Ru,Taipei • Regional Operations Centres (ROC) • Front-line support for user and operations issues • Provide local knowledge and adaptations • One in each region – many distributed • User Support Centre (GGUS) • In FZK: provide single point of contact (service desk) + portal. EGEE Infrastructure & Operations

EGEE Operations Process • Grid operator on duty • 6 teams working in weekly rotation • CERN, IN2P3, INFN, UK/I, Ru,Taipei • Crucial in improving site stability and management • Operations coordination • Weekly operations meetings • Regular ROC, CIC managers meetings • Series of EGEE Operations Workshops • Nov 04, May 05, Sep 05, (June 06?) • Geographically distributed responsibility for operations: • There is no “central” operation • Tools are developed/hosted at different sites: • GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) • Procedures described in Operations Manual • Introducing new sites • Site downtime scheduling • Suspending a site • Escalation procedures • etc EGEE Infrastructure & Operations

Problem categories •` Sites list (reporting new problems) •` Test summary (SFT,GSTAT) GGUS Ticket status Operations tools: Dashboard • Dashboard provides top level view of problems: • Integrated view of monitoring tools (SFT, GStat) shows only failures and assigned tickets • Single tool for ticket creation and notification emails with detailed problem categorisation and templates • Detailed site view with table of open tickets and links to monitoring results • Ticket browser highlighting expired tickets Developed and operated by CC-IN2P3: http://cic.in2p3.fr/ EGEE Infrastructure & Operations

Operations Coordination Centre JSPG 2nd Level support OSCT Operational security coordination Coordination, Middleware deployment 1st Level support Regional Operations Centre Regional Operations Centre Regional Operations Centre … … Coordination, Middleware deployment Coordination, Middleware deployment Resource Centre Resource Centre … … Resource Centre Resource Centre Operations/deployment support JSPG: Joint Security Policy Group OSCT: Operational Security Coordination Team EGEE Infrastructure & Operations

Operator submits a GGUS ticket against the ROC and cc’s the site. The ticket is followed until it is solved Monitoring shows a problem Regional Operations Centre Regional Operations Centre Resource Centre Resource Centre … … Resource Centre Resource Centre Operations support workflows Grid Operator on-duty OSCT 1st Level support Regional Operations Centre … … 2nd Level support ROC and Site work to resolve the problem EGEE Infrastructure & Operations

Evolution of SFT metric Available CPU Available sites Missing log data Daily: July  November EGEE Infrastructure & Operations

Joint Security Policy Group EGEE with strong input from OSG Policy Set: Policy Revisions Grid Acceptable Use Policy (AUP) https://edms.cern.ch/document/428036/ common, general and simple AUP for all VO members using many Grid infrastructures EGEE, OSG, SEE-GRID, DEISA, national Grids… VO Security https://edms.cern.ch/document/573348/ responsibilities for VO managers and members VO AUP to tie members to Grid AUP accepted at registration Incident Handling and Response https://edms.cern.ch/document/428035/ defines basic communications paths defines requirements (MUSTs) for IR reporting response protection of data analysis not to replace or interfere with local response plans Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy VOSecurity Application Development & Network Admin Guide User Registration & VO Management Security Policy EGEE Infrastructure & Operations

What it is not: Not focused on middleware security architecture Not focused on vulnerabilities (see Vulnerabilities Group) Focus on Incident Response Coordination Assume it’s broken, how do we respond? Planning and Tracking Focus on ‘Best Practice’ Advice Monitoring Analysis Coordinators for each EGEE ROC plus OSG LCG Tier 1 + Taipei Monitoring Tools SecurityServiceChallenge IncidentResponse Procedures Resources HANDBOOK Reference Playbook Operational Security Coordination Team (OSCT) • OSCT membership  ROC security contacts Policy Infrastructure Infrastructure SSC1 - Job Trace Agents SSC2 - Storage Audit Deployment 3 strategies EGEE Infrastructure & Operations

Vulnerability Group • Has been set up last summer (CCLRC lead) • Purpose: inform developers, operations, site managers of vulnerabilities as they are identified and encourage them to produce fixes or to reduce their impact • Set up (private!) database of vulnerabilities • To inform sites and developers • Urgent action  OSCT to manage • After reaction time (45 days) • Vulnerability and risk analysis given to OSCT to define action – publication? • Will not publish vulnerabilities with no solution • Intend to report progress and statistics on vulnerabilities by middleware component and response of developers • Balance between open responsible public disclosure and creating security issues with precipitous publication • Following first experience in implementing this process, review of procedures under way, including need for appropriate risk analyses EGEE Infrastructure & Operations

User Support

Goals • A single access point for support • A portal with a well structured information and updated documentation • Knowledgeable experts • Correct, complete and responsive support • Tools to help resolve problems • search engines • monitoring applications • resources status • Examples, templates, specific distributions for software of interest • Interface with other Grid support systems • Connection with developers, deployment, operation teams • Assistance during production use of the grid infrastructure EGEE Infrastructure & Operations

Interface Webportal The Support Model “Regional Support with Central Coordination" Regional Support units The ROCs, VOs and other project-wide groups such as the Core Infrastructure Center (CIC), middleware groups (JRA), network groups (NA), service groups (SA) are connected via a central integration platform provided by GGUS. Operations Support ROC 1 ROC… ROC 10 Deployment Support Central Application (GGUS) TPM Middleware Support VOSupport Network Support User Support units Technical Support units EGEE Infrastructure & Operations

The GGUS System EGEE Infrastructure & Operations

GGUS Portal: user services Browseable tickets Search through solved tickets Useful links (Wiki FAQ) Broadcast tools Latest News GGUS Search Engine Updated documentation (Wiki FAQ) EGEE Infrastructure & Operations

Second line support VO Support Units Middleware Support Units Deployment Support Units ROC Support Units Network Support Operations Support GGUS Supporters User First line support TPM Grid experts VO-TPM VO experts EGEE Infrastructure & Operations

October October October Performance statistics September November 2005: 315 tickets A peak of 80 tickets per day has been reached. EGEE Infrastructure & Operations

New VOs; Access to Resources; Benefits & Costs

How new VOs find resources Various possibilities: • Pilot applications: • Expectation that they have access to resources provided by many partners • For EGEE-II this is specified in TA • Applications reviewed and approved by EGAAP: • Negotiation via OAG to understand which ROCs/sites are willing to • Run services on behalf of the VO • Provide compute and/or storage resources • Other (self supporting) applications • Own their own resources • Use EGEE infrastructure, operations, support • Many successful examples of such VOs • 1 & 2: • Formal agreements (TA or MoU) • Should expect support via NA4 – but should also build up internal support teams • Expected to collaborate on improving the service – not just “users” • 1, 2 & 3: • Full user and operations support • VOs need to provide support teams – some problems are application problems! EGEE Infrastructure & Operations

Negotiation Operations Advisory Group (OAG) • Brings together VOs and resource providers (ROCs) • Negotiate for services and resources • Should not always be an expectation of “free” resources • In future applications should bring some resources with them • Computational and storage resources are not funded (!) by the project EGEE Infrastructure & Operations

EGEE – What can it deliver? • A managed operation – providing a service: • A large number of sites of different sizes and capabilities • Developed operational procedures • Monitoring of the grid services providing access to resources • Operational security support; incident response coordination • Support services: user support, training, etc. • Building up considerable experience in grid-enabling a variety of different applications • Tools for monitoring of resources at a site … if required • A new VO joining EGEE with a few sites: • Benefits from the operations and support – the VO sites can be monitored and supported as part of the infrastructure • Potentially access to other resources • It is a significant effort to set up a grid infrastructure from scratch EGEE Infrastructure & Operations

… and what does it cost? • “The application VO buys into the EGEE model” • Actually not so restrictive now – supports many linux flavours, IA64, (other teams have worked on AIX, SGI ports) • Simple installation of client software now (can be done on the fly) • Basic grid services are quite general, nothing really application-specific • Some unresolved issues: • Commercial licensed software used by an application • Levels of privacy/security needed in some life-science applications • True interactivity • … and of course, this is all new, rapidly evolving and many problems still to be overcome • VOs should: • Provide application support effort to help other VO users • Invest effort into helping improve the infrastructure and services – should not be simple “client – server” – rather a collaboration EGEE Infrastructure & Operations

Future

From EGEE to EGEE-II • Simplify operations structure • ROCs absorb CIC roles – spread of expertise • Introduce SA3 • Integration, certification, distribution preparation • Emphasises focus on stability, reliability, performance rather than new features • Mechanism for integrating non-EGEE software – according to need • Increased emphasis on • Platform support (OS, 64-bit, etc) • Interoperability with other grids (international, regional, national, local, campus,) and other middleware stacks (Unicore, ARC, …) SA: 54% of total • SA1 (operations) : 86% • SA2 (network) : 3% • SA3 (certification): 11% EGEE Infrastructure & Operations

Outlook • LHC VOs must achieve reliable production and analysis in 2006 • Will be making significant use of resources • Consolidate and improve existing services: Focus on • Reliability, robustness • Manageability • Performance, scalability • Evolution or replacement of services driven by needs of application (or security/manageability) • Expand grid operations • Spread expertise to ROCs • Collaboration with OSG, A-P • Start to negotiate SLAs • New applications • Must bring resources – show commitment • Resource sharing and negotiation – must become streamlined • Will need a mechanism for cost/credit for use of resources EGEE Infrastructure & Operations

Summary • EGEE Infrastructure – world’s largest multi-science production grid service • But does not exist in isolation: interoperability and interoperation is essential • Significant improvements in reliability and stability over the last year • Is in constant use for significant production work • Many VOs now use it as their primary resource • Middleware distribution is • Consolidating existing and new services • Basis for evolution according to needs • Shift from EGEE to EGEE-II • No major changes, but adjustments based on experience and anticipated evolution • Refine and improve processes EGEE Infrastructure & Operations

EGEE Infrastructure, Services, & Operations