170 likes | 304 Vues
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004. LCG Operations. Grid Operations: Scope of Responsibilities. Certification activities Certification of middleware as a coherent set of services
E N D
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22nd October 2004 LCG Operations
Grid Operations: Scope of Responsibilities • Certification activities • Certification of middleware as a coherent set of services • Preparing that package for deploying • Operational and support activities • Coordinating and supporting the deployment to collaborating computer centres • Coordinating Grid Operations activities • Providing Operational support • Providing Operational security support • Providing User support • CA management • VO registration and management • Policy • CA and user registration policies • Operational policy • Security policies • Resource usage and access policies 22 October 2004 2
LHC Computing Model (simplified!!) • Tier-0 – the accelerator centre • Filter raw data reconstruction event summary data (ESD) • Record the master copy of raw and ESD • Tier-1 – • Managed Mass Storage –permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service • Data-heavy (ESD-based) analysis • Re-processing of raw data • National, regional support • “online” to the data acquisition processhigh availability, long-term commitment • Tier-2 – • Well-managed, grid-enabled disk storage • End-user analysis – batch and interactive • Simulation
30 sites 3200 cpus 25 Universities 4 National Labs 2800 CPUs Grid3 LCG-2 Total: 78 Sites ~9000 CPUs 6.5 PByte
Operations services for LCG • Operational support • Hierarchical model • CERN acts as 1st level support for the Tier 1 centres • Tier 1 centres provide 1st level support for associated Tier 2s • Tier 1 “Primary sites” • Grid Operations Centres (GOC) • Provide operational monitoring, troubleshooting, coordination of incident response, etc. • RAL (UK) led sub-project to prototype a GOC • 2nd GOC in Taipei now in prototype • User support • Central model • FZK provides user support portal • Problem tracking system web-based and available to all LCG participants • Experiments provide triage of problems • CERN team provide in-depth support and support for integration of experiment sw with grid middleware 22 October 2004 5
Support Teams within LCG Grid Operations Center (GOC)Operations Problems Resource Centers (RC)Hardware Problems CERN Deployment Support (CDS)Middleware Problems Global Grid User Support (GGUS)Single Point of Contact Coordination of User Support Experiment Specific User Support (ESUS)Software Problems OtherCommunities (VOs) 4 LHC experiments(Alice Atlas CMS LHCb) 4 non-LHC experiments(BaBar CDF Compass D0) 22 October 2004 6
Experiences in deployment • LCG covers many sites (>70) now – both large and small • Large sites – existing infrastructures – need to add-on grid interfaces etc. • Small sites want a completely packaged, push-button, out-of-the-box installation (including batch system, etc) • Satisfying both simultaneously is hard – requires very flexible packaging, installation, and configuration tools and procedures • A lot of effort had to be invested in this area • There are many problems – but in the end we are quite successful • System is stable and reliable • System is used in production • System is reasonably easy to install now – 60 sites • Now have a basis on which to incrementally build essential functionality • This infrastructure forms the basis of the initial EGEE production service 22 October 2004 7
LCG Operations EGEE Operations 22 October 2004 8
What is EGEE ? (I) • EGEE (Enabling Grids for Escience in Europe) is a seamless Grid infrastructure for the support of scientific research, which: • Integrates current national, regional and thematic Grid efforts • Provides researchers in academia and industry with round-the-clock access to major computing resources, independent of geographic location Applications Grid infrastructure Geantnetwork 22 October 2004 9
What is EGEE ? (II) • 70 leading institutions in 28 countries, federated in regional Grids • 32 M Euros EU funding (2004-5), O(100 M) total budget • Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled) • ~ 300 persons 22 October 2004 10
EGEE Activities • Emphasis on operating a production grid and supporting the end-users • 48 % service activities(Grid Operations, Support and Management, Network Resource Provision) • 24 % middleware re-engineering(Quality Assurance, Security, Network Services Development) • 28 % networking(Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation) 22 October 2004 11
LCG and EGEE Operations • EGEE is funded to operate and support a research grid infrastructure in Europe • The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service • LCG includes US and Asia-Pacific, EGEE includes other sciences • Substantial part of infrastructure common to both • LCG Deployment Manager is the EGEE Operations Manager • CERN team (Operations Management Centre) provides coordination, management, and 2nd level support • Support activities are expanded with the provision of • Core Infrastructure Centres (CIC) (4) • Regional Operations Centres (ROC) (9) • ROCs are coordinated by Italy, outside of CERN (which has no ROC) 22 October 2004 12
LCG EGEE in Europe • User support: • Becomes hierarchical • Through the Regional Operations Centres (ROC) • Act as front-line support for user and operations issues • Provide local knowledge and adaptations • Coordination: • At CERN (Operations Management Centre) and CIC for HEP • Operational support: • The LCG GOC is the model for the EGEE CICs • CIC’s replace the European GOC at RAL • Also run essential infrastructure services • Provide support for other (non-LHC) applications • Provide 2nd level support to ROCs 22 October 2004 13
Summary • Data challenges – demonstrated: • Many m/w functional and performance issues (documented) • Main problem is service stability • Site fabric management, configuration, change control • Etc • Grid3 report similar problems … • User support process needs improvement • Now moving into continuous production + service & data challenges 22 October 2004 14
How to move forward – 1 • Build an agreed operations model for the next year • Should be able to evolve • Operations/Fabric workshop Nov 2 – 4 • Hepix ½ day – input from some sites and Grid3/OSG on their plans • Documenting use-cases (based on experience), propose support mechanisms for each • EGEE SA1 infrastructure • 5 working groups: • Operations support • User support • Operational security • Fabric management issues • SW needs and tools requirements from operations • Need fabric management training for many sites 22 October 2004 15
Some issues • Resource Centres: • Large sites – have operations staff and/or on-call support • Small sites – have no on-call and often little support at all • Regional Operations Centres: • Probably do not provide after-hours or on-call support. If this were the case then the model of support could more include the ROCs. However, it is clear that most ROCs will not have this level of support. • Core Infrastructure Centres: • Must have on-call support after-hours • To be rotated through the 4 or 5 active CICs Thus, a basic question to answer is how much power or control can the CICs have in order to deal with problems when staff at RCs and ROCs are not available? • Either CICs have rights to manage critical services on sites where there is no support, or • Have the right to remove “broken” sites and services from the infrastructure. • Likely that we have all combinations of these … 22 October 2004 16
Immediate actions • Weekly operations meeting (Monday afternoon) • Weekly reports from ROCs, CICs, other Tier 1s etc • Operations Manager – • Role rotates through 4 EGE CIC’s – manage problem reporting and follow up • Hand over responsibility in weekly meeting • Operational security team • Being set up – led by Ian Neilson, strong collaboration between US and Europe on these issues. 22 October 2004 17