Status of EGEE Operations

Status of EGEE Operations Ian Bird, CERN SA1 Activity Leader EGEE 3rd Conference Athens, 18th April, 2005

Overview • Overall activity status • Service & Operations • Planning for remainder of project • Main focus of activities • gLite migration • Summary • Tomorrow’s plenary session for technical details Athens Conference; 18th April 2005

Operations Status

Computing Resources: April 2005 Country providing resources Country anticipating joining In LCG-2: • 131 sites, 30 countries • >12,000 cpu • ~5 PB storage Includes non-EGEE sites: • 9 countries • 20 sites

Infrastructure metrics Countries, sites, and CPU available in EGEE production service EGEE partner regions Other collaborating sites Athens Conference; 18th April 2005

Service Usage • VOs and users on the production service • Active HEP experiments: • 4 LHC, D0, CDF, Zeus, Babar • Active other VO: • Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics) • 6 disciplines • Registered users in these VO: 600 • In addition to these there are many VO that are local to a region, supported by their ROCs, but not yet visible across EGEE • Scale of work performed: • LHC Data challenges 2004: • >1 M SI2K years of cpu time (~1000 cpu years) • 400 TB of data generated, moved and stored • 1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity) Number of jobs processed/month Athens Conference; 18th April 2005

SA1 – Operations Structure • Operations Management Centre (OMC): • Core Infrastructure Centres (CIC) • Manage daily grid operations – oversight, troubleshooting • Run essential infrastructure services • Provide 2nd level support to ROCs • UK/I, Fr, It, CERN, + Russia (M12) • Weekly rotation in place since October • Taipei also run a CIC • Regional Operations Centres (ROC) • Act as front-line support for user and operations issues • Provide local knowledge and adaptations • One in each region – many distributed • User Support Centre (GGUS) • In FZK – manage PTS – provide single point of contact (service desk) • Not foreseen as such in TA, but need is clear Athens Conference; 18th April 2005

Operations Procedures • Driven by experience during 2004 Data Challenges, & • Reflecting the outcome of the November Operations Workshop • Operations Procedures • roles of CICs - ROCs - RCs • weekly rotation of operations centre duties (CIC-on-duty) • Process in place since October • daily tasks of the operations shift • monitoring (tools, frequency) • problem reporting • problem tracking system • communication with ROCs&RCs • escalation of unresolved problems • handing over the service to the next CIC Athens Conference; 18th April 2005

5 User Level install of client tools prioritization & selection EIS List for next release (can be empty) Service Release Client Release 7 Applications 2 Updates Release Core Service Release New Release Process (simplified) 3 Bugs/Patches/Task Savannah RC Applications integration & first tests Developers C&T EIS GIS 4 C&T GDB assign and update cost Internal Releases Internal Client Release Bugs/Patches/Task Savannah 1 CICs EIS 6 full deployment on test clusters (6) functional/stress tests ~1 week Developers C&T C&T Head of Deployment components ready at cutoff Athens Conference; 18th April 2005

Re-Certify CIC Release Release Client Release Deploy Client Releases (User Space) 11 GIS Deploy Service Releases (Optional) Deploy Major Releases (Mandatory) ROCs RCs CICs RCs Deployment process Release(s) Update Release Notes Update User Guides EIS GIS YAIM User Guides Release Notes Installation Guides Every Month Every 3 months on fixed dates ! Certification is run daily Every Month at own pace Athens Conference; 18th April 2005

Planning for next year

Future work – comments from review • Testing and software packaging will be critical to success. Reinforce these also intellectually very demanding activities even further. • Yes – this is agreed! • Work hard on event-based monitoring techniques, triggering preventive maintenance actions, to improve the stability of the Grid infrastructure. • Implement a strong mechanism to quickly isolate unstable sites in the production Grid. • These are both part of ongoing program of work • Use R-GMA as monitoring framework; build triggers and alarms on top • Better mechanism to remove sites – web interface to allow VO to select • Improve the middleware deployment process (technical, organisational) even further to increase the stability of the infrastructure and consequently improve the job success rate and reduce the load on the support team. • Already updated and streamlined deployment and release process and improved configuration mechanisms Athens Conference; 18th April 2005

15 month plan • No major changes to goals or work • Areas of work focus: • Migration to gLite • See next slides • Improving operational and grid reliability • Follow recommendations of review discussed above • Improve monitoring systems – build reactive alarms • Site isolation – need simple mechanism (CIC tool) to remove sites • Bad sites, security problems, etc. • Improving user support • In progress – need recognised usable service by mid-year • 24x7 service availability • Availability of service rather than components • Identify critical services • Isues: on-call support; hot stand-by machines; etc (might need work on middleware to support this!) Athens Conference; 18th April 2005

Review recommendations to SA1 • The migration path to gLite needs to be better planned, as it is inherently difficult to support two different grid software stacks indefinitely. More specifically, establishing a fixed time-line for migration as well as deprecation deadlines for LCG-2 services, plus possibly identifying who would be the earliest adopters from the application side and the time-line for their possible early committal, would be essential; otherwise, existing users may not be motivated to migrate. Migration plan is being worked out in detail – but will be driven by experience in the certification and pre-production deployment Must be a migration plan and not a switch from old to new Early adopters include LCG, others should be identified via NA4 Athens Conference; 18th April 2005

LCG-2 (=EGEE-0) 2004 prototyping prototyping product 2005 product LCG-3 (=EGEE-x?) Migration to gLite • Migration strategy • Needs to be incremental rather than big-bang – as has been stated for a year • 2 Activities in parallel: • Deploy components into LCG-2 certification test-bed and then to pre-production • Deploy pre-production sites in parallel • PPS and Production • Are evolutionary LCG-2  gLite components • Cannot provide LCG-2 end-of-life estimate/deadlines • LCG-2 is the fallback solution • Applications must test services and decide which ones they need Athens Conference; 18th April 2005

Review recommendations to SA1 • Consider the current gLite as a stepping stone towards a more robust standards-based infrastructure, rather than a final deployment solution. Select additional components for integration and deployment through collaborations with other international middleware R&D initiatives. Work with Globus, VDT, OSG, etc on common solutions/interfaces – but has to be driven by the applications and experience from operations Should be in situation to be able to deploy components needed by the applications Integration and certification process mechanism from selecting other components Athens Conference; 18th April 2005

Review recommendations to SA1 • Continue to conduct application-driven investigation that may result in complex usage scenarios and consider how the advanced middleware and infrastructure would support them in a viable manner. As such, keep a keen eye on new generations of production-level Grid middleware from various international groups that go beyond gLite features. For HEP – Data challenges and service challenges bring specific goals and targets (and timescales) – this will continue Other applications might consider similar exercises – define some goals Athens Conference; 18th April 2005

Milestones for rest of project • M14: full production grid in production • 9 ROCs, 5 CICs (include Russia at M12), 20 sites • Should be based on EGEE re-engineered middleware. • This is dependent on the quality and robustness of gLite components • Experience: takes 6 months to put new software into production • Will not deploy new components unless they improve upon existing components or add new required functionality • M21: expanded production infrastructure in place • As above, but expanded to 50 sites • Now decoupled from specific gLite release Athens Conference; 18th April 2005

Deliverables for rest of project • Release notes corresponding to milestones • Updated relative to first set of release notes; snapshots corresponding to milestones • NB. ALL releases are accompanied by full set of release notes • EGEE “Cookbook” • Foreseen as planning guides to assist new participants join or build components of the infrastructure. • Resource centres and their administrators • ROCs, CICs, and VOs • Templates and checklists to assist administrators to: design a facility, determine what resources to acquire, how to configure them, etc. • Detailed enough to allow admins to understand limitations of the system are and how to address them (e.g. what services can run on 1 machine, how to configure, etc.) • Make use of expertise of CICs, ROCs and staff in RCs (“and use technical writers in NA3”) • M24: Assessment of infrastructure operation throughout the project • Remove suggestions on long-term sustainability  put into EGEE-2 planning Athens Conference; 18th April 2005

Summary • Production grid is operational and in use • Larger scale than foreseen, use in 2004 probably the first time such a set of large scale grid productions has been done • Modest growth in resources foreseen over next year • Operational infrastructure in place and working • Need to continue to improve reliability of service • Need to continue to improve user support • Support for applications and VOs • VO deployment should become still simpler and more routine • Application support needs more resources than foreseen • Deployment and migration to gLite is now a major focus Athens Conference; 18th April 2005

Status of EGEE Operations