html5-img
1 / 25

LHC Computing Grid Deployment

LHC Computing Grid Deployment. Ian Bird IT Division, CERN LCG Deployment Area Manager US LHC Software and Computing Review January 14, 2003. Outline. LCG Phase I Deployment Goals Deployment Organisation LCG Deployment Plan Middleware Test beds and Services

kenley
Télécharger la présentation

LHC Computing Grid Deployment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHC Computing Grid Deployment Ian Bird IT Division, CERN LCG Deployment Area Manager US LHC Software and Computing Review January 14, 2003

  2. Outline • LCG Phase I Deployment Goals • Deployment Organisation • LCG Deployment Plan • Middleware • Test beds and Services • Certification, Testing, Packaging, Distribution • Operations and Security • Support • Collaborative projects • Summary Ian.Bird@cern.ch

  3. LCG-1 Deployment Goals • Production service for Data Challenges in 2H03 & 2004 • Initially focused on batch production work • Experience in close collaboration between the Regional Centres • Must have wide enough participation to understand the issues, • Learn how to maintain and operate a global grid • Focus on a production-quality service • Robustness, fault-tolerance, predictability, and supportability take precedence; additional functionality gets prioritized • LCG should be integrated into the sites’ physics computing services – should not be something apart • This requires coordination between participating sites in: • Policies and collaborative agreements • Resource planning and scheduling • Operations and Support Ian.Bird@cern.ch

  4. Elements of an LCG Service • Middleware: • Testing and certification • Packaging, configuration, distribution and site validation • Support – problem determination and resolution; feedback to middleware developers • Operations: • Grid infrastructure services • Site fabrics run as production services • Operations centres – trouble and performance monitoring, problem resolution – 24x7 globally • Support: • Experiment integration – ensure optimal use of system • User support – call centres/helpdesk – global coverage; documentation; training Ian.Bird@cern.ch

  5. Deployment Organisation • Grid Deployment Board (GDB) – chair Mirco Mazzucato • Representatives from the experiments and from each country with an active Regional Centre taking part in the LCG Grid Service • Forges the agreements, takes the decisions, defines the standards and policies that are needed to set up and manage the LCG Global Grid Services • Coordinates the planning of resources for physics and computing data challenges • First task is the detailed definition of LCG-1 • includes defining the set of grid middleware tools to be deployed • LCG Deployment Group - coordinated by Deployment Area Manager • Teams at CERN and Regional Centres • Implement the elements of the LCG service: • Certification, testing etc; Operations; Support; Scheduling • Guided by agreements negotiated by GDB • Reports to Project Execution Board; SC2 Ian.Bird@cern.ch

  6. GDB Status • 1st Meeting in Milano – Oct 4, 2002, 4 since • Set up Technical Working groups: • WG1: Define LCG-1 functionality and services • WG2: Define the schedule for rolling out the infrastructure and resources. Propose process & metrics to be used for allocation, accounting, and reporting. • WG3: Define a straightforward security and authentication model to be used in LCG-1, and identify the technical issues. Set up agreements to enable implementation. • WG4: Define ops procedures & responsibilities. Make agreements to ensure coordination of these activities. Define the requirements for a Grid Operations Centre to coordinate operational activities. • WG5: Propose a support model for LCG-1, including the scope of responsibilities for call centre/helpdesk (delayed until Feb 03). • Status: • LCG-1 on track to be defined by end Jan 03 – certainly in essentials needed to progress – functionality and services, participating sites, resources and schedules, initial operational and security models • Working group final reports end ~Jan 03 – should indicate where LCG should focus effort, process for agreements, etc. Ian.Bird@cern.ch

  7. LCG Deployment PlanLevel 1 Milestones Ian.Bird@cern.ch

  8. LCG Phase I Timescale in a nutshell • LCG-1 must be defined – end Jan 03 • 2 major areas to be addressed by GDB working groups • Define LCG-1 in terms of required functionality and services • Deployment schedule • Set up distributed organisational structure • Resources and scheduling, • Policies – security, authentication, etc. • Operational agreements and responsibilities • Support services • LCG-1 service must be in place – July 2003 • 6 months testing, integration, certification, packaging and deployment • Need to demonstrate performance – end 2003 • This should include adding current production services into LCG • Provide production service for data challenges in 2004 • LCG-3 Follows LCG-1 by 1 year – provides “50% complexity” service in 2005. Ian.Bird@cern.ch

  9. Activities to Achieve:Initial Availability of First Global Service • Define LCG-1 in terms of • functionality, resources, operations, security, support • Deploy a series of evolving pilot services for testing, with increasing resources • First pilot service – Feb 1 2003 • Incremental deployment to Tier 1 and Tier 2 centres: • ~10 in 3 continents by June • Testing, certification, packaging and release of software • Certification, testing, release process defined – January 2003. • Packaging/configuration mechanism defined– March 2003. • Delivery of middleware software packages – March 1, 2003 • Iterative, incremental release cycle, with major functional releases: • V1.0 – June 1, 2003 Ian.Bird@cern.ch

  10. activities – cont. • Set up infrastructure and operational procedures • Certificate Authorities and VO management systems in place – May 2003 • Based on existing EU and US inter-operating systems • Resource accounting and reporting procedures set up – May 2003 • Security procedures defined and agreed and in place– June 2003 • Incident response and security management • Set up operations centre and help desk (call centre) • Identify operations and call centre locations – February 1, 2003 • In place by June 2003 • LCG-1 commissioning and acceptance • 30 day commissioning period with user productions and stress tests, including 7 day acceptance period Ian.Bird@cern.ch

  11. Activities to Achieve:LCG-1 Fully operational • Define LCG-1 performance goals – July 2003 • In concert with experiments and their data challenge requirements, set performance goals in terms of capacity, throughput, reliability, etc. A GDB working group. • 10 Regional Centres participating – October 2003 • WG2 defines the implementation schedule – may be adjusted in July. • LXBatch service merged into LCG-1 – October 2003 • All resources of LXBATCH will be grid-enabled and accessible as part of the LCG-1 service. This is a CERN activity but hopefully reflects what happens at other sites too. • Milestone release of middleware – October 2003 • V1.1 release with improved functionality – October 2003 • Review of service – November 2003 • The LCG-1 service level should be that required for the 2004 data challenges. The determination and acceptance of achieving the target will be done in a review of the service by representatives from the experiments, the regional centres and LCG. Ian.Bird@cern.ch

  12. Deployment Details

  13. Middleware • Deployed middleware to be based on US and EU toolkits: • VDT • Globus, Condor, GLUE schema, EDG CA and VO tools, etc. • EDG • Resource broker • Reptor - WP2 (Data Management) – using RLS • WP4 will be used to manage CERN fabric (available for others) • VOMS • Monitoring tools • Initially based on work done by Worldgrid (iVDGL + DataTag) • Specifics (functionality, version, delivery) being firmed up now by GDB WG1 – final by beginning February. • This will provide the initial basic functionality and will evolve significantly • LCG will focus on building a robust service – changes in basic functionality driven by experiments • Deployment of these components involves not only obtaining the software but also agreeing the essential support and maintenance. Ian.Bird@cern.ch

  14. Testbeds and Services • The deployed systems will be in several versions and functions: • Certification testbeds – both local and distributed • Integration of middleware components, Controlled changes, in-depth application testing • Prepare for release • Production service – deployed at Regional Centres • Development service – deployed at Regional Centres • Certification testbeds parallel Production and Development services – i.e. need to debug and stabilise production release in parallel with development • This is 2 prongs of the “Gordon trident” – • 3rd prong are the grid projects’ development systems: iVDGL Datagrid LCG Production Grid developers’ testbed development testbed Ian.Bird@cern.ch

  15. Jan 05 Jan 04 Jan 03 July 03 July 05 July 04 LCG-1 Testbed LCG Services Pilot-1 LCG Certification & Test Pilot-2 LCG-3 5% DC04 10% DC05 CMS DC-2 5% 10% ALICE Timelines – LCG Phase 1 Incremental middleware releases Incrementally add regional centres  LCG-1 Defined LCG-1 Fulfils Performance Goals LCG-1 Initial Service Available Computing TDR LCG-1 Full Service Available LCG-3 Fulfils Performance Goals Data Challenges ATLAS LHCb Ian.Bird@cern.ch

  16. Certification and Testing • Will be an ongoing major activity of LCG • Part of what will make LCG a production-level service • Goals: • Certify/validate that middleware behaves as advertised and provides the required functionality (HEPCAL) • Stabilise and robustify middleware • Provide debugging, problem resolution and feedback to developers • Testing activities at all levels • Component/unit tests • Basic functional tests, including tests of distributed (grid) services • Application level tests – based on HEPCAL use-cases • Driven/implemented by the experiments – GAG set up by SC2 • Experiment beta-testing before release • Site configuration verification • JTB collaborative project - LCG, Trillium, EDG • Gather existing tests • Write/obtain missing tests Ian.Bird@cern.ch

  17. Certification & TestingTestbeds • CERN testbed • Several “clusters” forming a local grid • Basic tests, basic grid functionality • Distributed testbed • CERN testbed + testbeds at a few other remote sites • Grid functionality • Application benchmarks and experiment beta testing • Needs several versions • Current production version – for reproducing and fixing problems • Development version • + OS versions … Ian.Bird@cern.ch

  18. Certification & TestingRelease Strategy • Small release cycles with incremental functionality, rather than major releases where many things change • Somewhat depends on technology suppliers and their responsiveness to LCG needs, since LCG is not in control of development • There will however, be milestone functional releases in June and October 2003. • Continuous, evolutionary process • Each release goes through certification/test cycle • Only way to keep control of bugs • Goal is stability and robustness … Ian.Bird@cern.ch

  19. Packaging and distribution • Obviously a major issue for a deployment project • Joint activity started – • Discussions LCG, EDG, VDT, EDT, iVDGL, etc. • Have produced a draft discussion document • Will soon lead to a JTB joint project • Want to provide a tool that satisfies needs of the participating sites, • Interoperate with existing tools where appropriate and necessary • Does not force solution on sites with established infrastructure • Solution for sites with nothing • Configuration is essential component • Essential to understand and validate correct site configuration • Effort will be devoted to providing configuration tools • Verification of correct configuration will be required before sites join LCG Ian.Bird@cern.ch

  20. LCG Operations • Responsible for operating and maintaining the grid infrastructure and associated services • Gateways, information services, resource broker etc. – i.e. grid specific services • Will be a coordination between teams at CERN and at Regional Centres • Responsible also for the VO infrastructure, Authentication and Authorisation services • Security operations – incident response etc. • Build Grid Operations Centre(s) • Performance and problem monitoring; troubleshooting and coordination with site operations, user support, network operations etc. • Leverage existing experience/ideas (WorldGrid – iVDGL,EDT, etc.) • Started discussions with DataTag about future developments • Indian group provide development effort • LCG site to lead this (FZK? – not certain yet) • Once have a activity lead will expand collaborative activities • Assemble monitoring, reporting, performance, etc. tools • Start with what exists, understand what is missing and needed and build from there Ian.Bird@cern.ch

  21. Grid Operation queries monitoring & alarms corrective actions User Local user support Local operation Local site Call Centre Grid Operations Centre Grid information service Grid operations Grid logging & bookkeeping Virtual Organisation Network Operations Centre Ian.Bird@cern.ch

  22. Security • GOAL: Do not want to make exceptions for LCG services – they must run integrated into a site infrastructure, and be subject to all usual security and good management procedures and policies • BUT: Initially, certain to need exceptions and compromises since until now most grid middleware has sidestepped security issues • THUS: We must have a sound security policy and an agreed plan that provides for these exceptions in the short term, but shows a clear path to reach the state that the sites require Ian.Bird@cern.ch

  23. User Support • Essential for a production service • Two aspects • Experiment integration/ consultancy • Work directly with the experiments’ computing projects to ensure efficient use of LCG services, and optimum use of resources • Act as liaison to ensure experiment specific issues are resolved • User support • Helpdesk/call centre operation • Globally distributed – 24x7, ensure single point of contact for user • Collaborative and distributed operation • Documentation • Training, tutorials, etc. Ian.Bird@cern.ch

  24. Collaborative Projects • LCG is not a middleware development project and can only succeed by leveraging the existing and ongoing work of the various grid development projects – and (hopefully) becoming a focus for them. • There are many opportunities for common solutions, which are being actively pursued • HICB – JTB • GLUE • Schema definitions & interoperability work • New collaborative activities: • Validation and Test Suites • Distribution, Meta-Packaging, Configuration • Grid Operations Centre • DataTag, iVDGL, DTF, etc • Storage interfaces; e.g. SRM • Authentication, authorisation and security • Security managers are beginning to collaborate in the context of LCG • HEPiX/LCCWS as collaborative vehicle for RC managers, site coordinators • E.g. certification process for operating environments; upgrade procedures; configuration management; helpdesk tools, etc. • GGF – production grids area, etc. Ian.Bird@cern.ch

  25. Deployment Summary • Deploy middleware to support essential functionality, but goal is to evolve and incrementally add functionality • Added value is to robustify, support and make into a 24x7 production service • How? • Certification & test procedure – tight feedback to developers • must develop support agreements with grid projects to ensure this • Define missing functionality – require from providers • Provide documentation and training • Provide missing operational services • Provide a 24x7 Operations and Call Centre • Guarantee to respond • Single point of contact for a user • Make software easy to install – facilitate new centres joining • Deployment is a major activity of LCG • Encompasses all operational and practical aspects of a grid • There is a lot of work already done that must be leveraged • Many opportunities for synergy and collaboration Ian.Bird@cern.ch

More Related