1 / 25

Grid Operations Centre Proposal (bis)

This proposal outlines key changes for the Grid Operations Centre, focusing on critical grid services, service level agreements, monitoring service quality, and other activities. The proposal addresses previous criticisms and proposes a new approach.

christyn
Télécharger la présentation

Grid Operations Centre Proposal (bis)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Operations CentreProposal (bis) Trevor Daniels, John Gordon GDB 8 July 2003

  2. Structure of Proposal • Background • Criticisms of Previous Proposal, GOC Group • Key Features of revised Proposal • Changes, Critical Grid Services, SLAs, Monitoring Critical Services • Other Activities • Summary of other GOC activities, which are largely unchanged from previous proposal • Outstanding Issues • ‘research’ topics – form of SLA, approach to monitoring, Incident Response Team • Accounting, Change Control Authority, Relation to User Support, Federation of Grids Trevor.Daniels@rl.ac.uk

  3. Criticisms of Previous Proposal • General Approach • The approach was too low-level, concentrating unduly on monitoring individual Grid components rather than Grid-related services • Federal Nature of LCG • There was little recognition of the federal nature of the Grid, and the importance of Service Level Agreements • Relation of GOC to the proposed LCG Call Centre • Consideration should be given to the relationship between the GOC and the proposed LCG Call Centre • Issues for Investigation • The issues which require investigation during the evolutionary phase should be more clearly identified Trevor.Daniels@rl.ac.uk

  4. GOC Group • The June GDB agreed that a task force should be created to define the requirements and agree on a prototype for a Grid Operations Service • The members of this GOC Group and their current status are • Trevor Daniels (RAL) Convenor • Markus Shultz (CERN) Accepted • John Gordon (RAL) Accepted • Rolf Rumler (IN2P3) Accepted • Cristina Vistoli (INFN) Invited • US Members Being identified • All identified members were invited to comment on drafts of the revised proposal, and received comments have been incorporated Trevor.Daniels@rl.ac.uk

  5. The Key Changes • Focus on Critical Grid Services • Concentrate on the Services required to make the Grid function • Identify Service Levels • Schedule • Availability • Reliability • Performance • Service Level Agreements • Specify minimum service levels for each service • Centres publish their design service levels • Monitoring Service Quality • Measure actual levels of service being achieved • GOC publishes them for comparison with design levels Trevor.Daniels@rl.ac.uk

  6. Critical Services • User Interface (UI) • (runs on client so not susceptible to monitoring) • Resource Broker (RB) • Job Submission Service (JSS) • Logging and Bookkeeping (LB) • Information Service (IS) • Computing Element (CE) • Storage Element (SE) • Replica Catalog (RC) (and friends) • Grid Security Infrastructure (GSI) • CRLs, GridMap files, etc Trevor.Daniels@rl.ac.uk

  7. Key Service Parameters • Schedule • Dates on which it is intended to provide a useable service • Availability • Ratio of time the Service is effectively operational to the time it was scheduled to be operational, calculated over some agreed period, eg a month, a quarter or a year • Reliability • The reciprocal of the rate of failure of a Service, where a failure is a break in the service of sufficient magnitude and duration to be noticeable to users (to be defined for each Service), calculated over some agreed period • Performance • The rate of carrying out one or several typical or critical functions • Defined separately and specifically for each Service Trevor.Daniels@rl.ac.uk

  8. The SLA in LCG Context • Formal Contract with GOC? – No, because • GOC is not (likely to be) a legal body • GOC will not (be likely to) have any formal powers over Service Providers • GOC will not (be likely to) pay for any Services • So difficult for GOC to enforce a traditional SLA • Instead, prefer a virtual contract between Service Provider and the LCG Grid Community • Any Centre wishing to provide a Service must publish its design levels for the specified service level parameters of that Service • GOC will then monitor the actual levels achieved and publish them so they may be compared with the design levels • Service Providers (Centres) will then compete on quality or possibly quality/cost, either to attract work or enhance reputation Trevor.Daniels@rl.ac.uk

  9. Monitoring Service Quality • The Availability, Reliability and the Performance-related parameters will all be constantly monitored during the times the Schedule says the Service should be operational • Performance will be monitored by conducting activities which are as close to the activities carried out by users as possible – usually by submitting minimal jobs designed to test specific aspects of the Service, for example • RB performance could be measured by the time taken to process a standard job from receipt of JDL to submission to a JSS • IS (R-GMA) performance could be measured by the time taken to create a new table or respond to a request • Service failures will usually be detected by suitable heartbeat monitors, and will cause an Alert to be raised. By logging failures, Availability and Reliability can be calculated. Trevor.Daniels@rl.ac.uk

  10. Other Activities • GOC Processes and Activities • Coordinating Grid Operations • Defining Service Level Parameters • Monitoring Service Performance Levels • First-Level Fault Analysis • Interacting with Local Support Groups • Change Control • Coordinating Security Activities • Operations Development Trevor.Daniels@rl.ac.uk

  11. Coordinating Grid Operations • The GOC should convene coordinating meetings of • Local Network and Operations Groups • Regional Centers • Grid Deployment Group • GOC staff • to ensure appropriate operational regimes are debated, agreed and followed • and generally provide initiative and impetus to operational developments Trevor.Daniels@rl.ac.uk

  12. First-Level Fault Analysis • Most Service failures will be detected by the staff at the local computing centre (hopefully!) and the GOC will not be involved, other than to check the local staff at the centre are aware • However, some malfunctions may not be immediately obvious to any local centre, and the GOC then has the duty to diagnose the cause of the fault until the responsibility for rectifying it has been localised to a single support group. • NB The GOC has no responsibility for rectifying faults (other than its own!) • But it will track all faults as they are being rectified by others Trevor.Daniels@rl.ac.uk

  13. Interacting with Local Support Groups • Following a Service Failure • Once a service failure is localised, responsibility for all further action rests with the appropriate Local Support Group or Regional Centre • The GOC may alert the Local Support Group through agreed procedures • The GOC will notify Call Centres of service failures • Assisting with local problems • The Grid or the Grid infrastructure may on occasion cause local problems at Centres. Local Support Groups can call on the GOC to assist, since the GOC may have experience of similar problems elsewhere Trevor.Daniels@rl.ac.uk

  14. Change Control • Intended to ensure • major changes to the Grid are made in a coordinated way • all interested parties are informed of changes in prospect • changes have been adequately tested • changes have a means of being backed out • Interested parties • Local Network, Operations and Support Groups • Regional Centres • Call Centres • Grid Deployment Group • GOC Trevor.Daniels@rl.ac.uk

  15. Coordinating Security Activities • General site security is the responsibility of the local centre’s security officer; GOC staff would not be involved in this • However, GOC could • facilitate the resolution of the security issues that are unique to the Grid by prompting discussion, organising meetings etc • develop a Grid Security Policy • develop and promulgate Incident Procedures • provide an IRT with specific expertise in confining and recovering security incidents on the Grid. This could • maintain knowledge of Grid software relevant to security issues • develop and make available Grid-specific Incident Response Kits • direct confinement and recovery actions in the event of a serious wide-spread incident • be available to advise local support groups tackling a single-site incident, if needed Trevor.Daniels@rl.ac.uk

  16. Operations Development • No extensive development of tools is anticipated; rather existing tools and those being developed elsewhere will be employed • However, some tools require specific plug-ins or add-ons and these may need to be developed either within the GOC or elsewhere to a GOC specification • Many tools will require configuring, which could involve some modest development effort • A range of applications will be required to carry out GOC duties. These will be procured from elsewhere, but some may need tailoring to suit the requirement • Specific jobs will need to be created to facilitate performance monitoring of Grid Services Trevor.Daniels@rl.ac.uk

  17. Operational Issues • A number of operational issues are far from clear at this stage, and the evolutionary phases of the GOC will enable these issues to be investigated. • The principle operational issues are • The form the SLAs should take for each Grid Service • What performance metrics should be adopted for each service • How should high-level monitoring best be carried out? • What procedures should be adopted to guard against the effect of intrusions? • How should a Grid-aware IRT be constituted? • In addition there are a number of open issues more concerned with the nature and responsibilities of a GOC which are covered later. Trevor.Daniels@rl.ac.uk

  18. Organisation Three GOCs, in • Europe • USA • Asia • Each has primary responsibility for the geographically local part of the Grid for Change Control and Service commitments • Responsibilities for Coordination, Reporting and Development are to be shared • Each GOC is to be on duty for fault detection, fault diagnosis and security incident assistance for the whole Grid for a 10-hour day shift 7 days a week, timed to give 24/7 cover world-wide with a 1 hour overlap at each shift change. Trevor.Daniels@rl.ac.uk

  19. Staffing • Prototype GOC Staffing • A total of 6 staff-years spread over 2 years from Jun 03 to Jun 05 • GOC Operations Staffing • Each has 1 Ops Manager plus 2 technical assistants • This provides 1 technical assistant at all times for pro-active monitoring, responding to Alerts and Security Incidents world-wide • Alternatively, with 3 technical assistants, 2 staff could be on duty during the working work and 1 at weekends • Or with 4 technical assistants, 2 staff could be on duty at all times Trevor.Daniels@rl.ac.uk

  20. Open Issues • What should be the relation of the GOC to the Call Centre? • What type of Accounting is required, and should this be a responsibility of the GOC? • To what extent should the GOC exert control over changes to the production Grid? • How should the issues raised by a federation of Grids be approached? Trevor.Daniels@rl.ac.uk

  21. Relation to LHC Call Centre • Similarities • Both need to provide 24/7 coverage around the world • Both need comprehensive awareness of the Grid • Both need to know the operational state of the Grid • Both need a means of tracking faults • Differences • The GOC is concerned primarily with Services, Security, and operational Policy and Practice. They deal mainly with Local Support Groups at Centres • The LCC works principally with users – responding to their concerns, providing them with information, investigating faults affecting their work and dealing with their complaints • Suggest • GOC and LCC should be two arms of a single body working closely together and housed in the same premises at three centres around the world Trevor.Daniels@rl.ac.uk

  22. Accounting • Planned • LB will log Grid resources used by various users or VOs • GOC could organise, aggregate, summarise in reports • What else is needed? • Attempt to balance load from VOs or users (fair scheduler)? • Attempt to meet agreed workload schedule? • Attempt to meet deadlines? • What role does GOC play? • What responsibilities will GOC have? Trevor.Daniels@rl.ac.uk

  23. Change Control • 1) LCG is a loose federation of free agents • who autonomously provide, maintain (and remove) resources • 2) LCG is a resilient and reliable service for LHC processing • Are these compatible? • GOC can help (2) by providing information about Changes, but • Given (1), to what extent should the GOC have authority to control Changes, requiring them to be adequately tested and capable of rapid and easy fall-back? Without this the services are unlikely to be reliable. Trevor.Daniels@rl.ac.uk

  24. Federation of Grids • Questions • What does operational control within a federation of Grids mean? • How might monitoring work? • How might the several GOCs interact? • Are Grid Services related to a single Grid or to the whole federation? • Also wider issues like inter-working of heterogeneous software • Perhaps all the answers will appear in due course as OGSI (Open Grid Services Infrastructure) matures • But should some discussion of these issues be taking place within LCG now? Trevor.Daniels@rl.ac.uk

  25. Conclusion • GDB is invited to comment on • this proposal as a whole • the open issues, and • advise on the next steps to be taken Trevor.Daniels@rl.ac.uk

More Related