LCG-1 Deployment Plan

LCG-1 Deployment Plan Ian Bird LCG Project Deployment Area Manager IT Division, CERN GridPP 7th Collaboration Meeting Oxford 1 July 2003

Overview • Milestones and goals for 2003 • LCG-1 Roll-out plan • Where, how, when • Infrastructure Status • Middleware functionality & status • Security & operational issues • Plans for rest of 2003 • Additional resources • Additional functionality • Operational improvements Ian.Bird@cern.ch

LCG - Goals • The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments • Two phases: • Phase 1: 2002 – 2005 • Build a service prototype, based on existing grid middleware • Gain experience in running a production grid service • Produce the TDR for the final system • Phase 2: 2006 – 2008 • Build and commission the initial LHC computing environment • LCG is not a development project – it relies on other grid projects for grid middleware development and support Ian.Bird@cern.ch

LCG - Timescale • Why such a rush – LHC won’t start until 2007 ??? • TDR must be written in mid-2005: • Approval of TDR • Need 1 year to procure, build, test, deploy, commission the computing fabrics and infrastructure – to be in place end 2006 • In order to write the TDR, essential to have at least 1 year experience • In running a production service • At a scale that is representative of the final system (50% of 1 expt) • Running data challenges – including analysis, not just simulations • It can easily take 6 months to prepare such a service • We must start now … goal is to have a service in place in July Ian.Bird@cern.ch

LCG - Milestones • The agreed Level 1 project milestones for Phase 1 are: • deployment milestones are in red Ian.Bird@cern.ch

Tier 0 CERN Tier 1 Centres Brookhaven National Lab CNAF Bologna Fermilab FZK Karlsruhe IN2P3 Lyon Rutherford Appleton Lab (UK) University of Tokyo CERN Other Centres Academica Sinica (Taipei) Barcelona Caltech GSI Darmstadt Italian Tier 2s(Torino, Milano, Legnaro) Manno (Switzerland) Moscow State University NIKHEF Amsterdam Ohio Supercomputing Centre Sweden (NorduGrid) Tata Institute (India) Triumf (Canada) UCSD UK Tier 2s University of Florida– Gainesville University of Prague …… LCG Regional Centres Centres taking part in the LCG prototype service : 2003 – 2005 Confirmed Resources: http://cern.ch/lcg/peb/rc_resources Ian.Bird@cern.ch

Elements of a Production LCG Service • Middleware: • Testing and certification • Packaging, configuration, distribution and site validation • Support – problem determination and resolution; feedback to middleware developers • Operations: • Grid infrastructure services • Site fabrics run as production services • Operations centres – trouble and performance monitoring, problem resolution – 24x7 globally • Support: • Experiment integration – ensure optimal use of system • User support – call centres/helpdesk – global coverage; documentation; training Ian.Bird@cern.ch

2003 Milestones Project Level 1 Deployment milestones for 2003: • July: Introduce the initial publicly available LCG-1 global grid service • With 10 Tier 1 centres in 3 continents • November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges • Additional Tier 1 centres, several Tier 2 centres – more countries • Expanded resources at Tier 1s (e.g. at CERN make the LXBatch service grid-accessible) • Agreed performance and reliability targets Ian.Bird@cern.ch

LCG Resource Commitments – 1Q04 Ian.Bird@cern.ch

Deployment Goals for LCG-1 • Production service for Data Challenges in 2H03 & 2004 • Initially focused on batch production work • But ’04 data challenges have (as yet undefined) interactive analysis • Experience in close collaboration between the Regional Centres • Must have wide enough participation to understand the issues • Learn how to maintain and operate a global grid • Focus on a production-quality service • Robustness, fault-tolerance, predictability, and supportability take precedence; additional functionality gets prioritized • LCG should be integrated into the sites’ physics computing services – should not be something apart • This requires coordination between participating sites in: • Policies and collaborative agreements • Resource planning and scheduling • Operations and Support Ian.Bird@cern.ch

Middleware Deployment • LCG-0 was deployed and installed at 10 Tier 1 sites • Installation procedure was straightforward and repeatable • Many local integration issues were addressed • LCG-1 will be deployed to these 10 sites to meet the July milestone • Time is short – integrating the middleware components took much longer than anticipated • Planning under way to do the deployment in a short time once the middleware is packaged • LCG team will work directly with these sites during the deployment • Initially testing activities to stabilise service will take priority • Expect experiments to start to test the service by mid-August Ian.Bird@cern.ch

LCG-0 Deployment Status These sites deployed the LCG-0 pilot system and will be the first sites to deploy LCG-1 Ian.Bird@cern.ch

LCG-1 Distribution • Packaging & Configuration • Service machines – fully automated installation • LCFGng – either full or light version • Worker nodes – aim is to allow sites to use existing tools as required • LCFGng – provides automated installation • Installation scripts provided by us – manual installation • Instructions allowing system managers to use their existing tools • User interface • LCFGng • Installed on a cluster (e.g. Lxplus at CERN) • Pacman? • Distribution • Distribution web site being set up now (updated from LCG-0) • Sets of rpm’s etc organised by service and machine type • User guide, Installation guides, release notes, etc., being written now Ian.Bird@cern.ch

Middleware Status • Integration work of EDG 2.0 has taken longer than hoped • EDG has not quite released version 2.0 – imminent • LCG has a working system – able to run jobs: • Resource Broker: many changes since previous version, needs significant testing to determine scalability and limitations • RLS: Initial deployment will be single instance (per VO) of LRC/RMC • Distributed service with many LRC and indexes not yet debugged • Initially will run LRC for all VOs at CERN with Oracle service backend • Information system: • R-GMA is not yet stable • We will initially use MDS: work to improve stability (bug fixes), and redundancy – based on experience with EDG testbeds and Nikhef, NorduGrid work • Intend to make direct comparison between MDS and R-GMA on certification testbed • Waiting for bug fixes – of several components • Still to do before release: • Reasonable level of testing • Packaging and preparation for deployment Ian.Bird@cern.ch

Certification & Testing • This is primary tool to stabilise and debug the system • Process and testbed has been set up • This is intended to parallel the production service • Certification testbed: • Set of 4 clusters at CERN – simulates a grid on LAN • External sites that will be part of cert. tb. • U. Wisconsin, FNAL – currently • Moscow, Italy – soon • This testbed is being used to test the release candidate • Will be used to reproduce and resolve problems found in the production system, and to do regression testing for updated middleware components before deployment Ian.Bird@cern.ch

Infrastructure for initial service - 2 • Security issues • Agreement on set of CAs that all LCG sites will accept • EDG list of traditional CAs • FNAL on-line KCA • Agreement on basic registration procedure for users • LCG VO where users sign Acceptable Usage Rules for LCG • 4 experiment VOs – will use existing EDG services run by Nikhef • Agreement on basic set of information to be collected • All initial registrations will expire in 6 months – we know the procedures will change • Experiment VO managers will verify bona fides of users • Acceptable Use Rules – adaptation based on EDG policy for now • Audit trails – basic set of tools and log correlations to provide basic essential functions Ian.Bird@cern.ch

Infrastructure – 3 • Operations Service: • RAL is leading sub-project on developing operations services • Initial prototype for July – • Basic monitoring tools • Mail lists and rapid communications/coordination for problem resolution • Monitoring: • GridICE (development of DataTag Nagios-based tools) being integrated with release candidate • Existing Nagios-based tools • GridPP job submission monitoring • Together these give a reasonable coverage of basic operational issues • User support • FZK leading sub-project to develop user support services • Initial prototype for July – • Web portal for problem reporting • Expectation that initially experiments will triage problems and experts will submit LCG problems to the support service Ian.Bird@cern.ch

PBS LSF VO ALICE VO ATLAS CE-1 CE-3 CE-2 CE-4 PBS/ ???? WN WN WN VO LHCb CE WN WN WN WN WN WN WN WN WN WN WN WN VO CMS UI-b Lxplus SE Disk UI UI AFS users @ NIKHEF SE Disk RB-1 Proxy Proxy RB-1 RB-2 Initial Deployment Services RLS (RMC&LRC) CMS RLS (RMC&LRC) ATLAS RLS (RMC&LRC) ALICE RLS (RMC&LRC) LHCb RLS (RMC&LRC) LCG-Team VO LCG VO LCG-Team LCG Registration Server LCG CVS Server Services at other sites Services at CERN Ian.Bird@cern.ch

Using multiple BDIIs requires RB changes BDII LDAP LDAP Swap&Restart BDII A BDII B LDAP /dataCurrent/.. /dataNew/.. Query GIIS GIIS GIIS GIIS GIIS GIIS GIIS GIIS RegionB2 RegionA1 RegionB1 RegionA2 SiteB SiteA SiteD SiteC GRIS GRIS GRIS GRIS GRIS GRIS GRIS GRIS primary GRIS GRIS GRIS GRIS GRIS GRIS GRIS GRIS SE1 SE2 SE2 SE2 SE1 SE1 SE1 SE2 CE2 CE1 CE2 CE1 CE1 CE2 CE1 CE2 secondary RB Query While using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted. The restart takes less than 0.5 sec. To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David). This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes Register LCG-1 First Launch Information System Overview Ian.Bird@cern.ch

LCG-1 First Launch Information System Sites and Regions RAL CERN WEST1 RegionGIIS EAST1 RegionGIIS FZK TOKYO FNAL LYON WEST2 RegionGIIS MOSCOW BNL CNAF EAST3 RegionGIIS TAIWAN EAST2 RegionGIIS A Region should contain not too many sites since we have observed problems with MDS if a large number of sites are involved. To allow for future expansion, but not to make the system too complex I suggest to start with two regions and if needed split later to smaller regions. The regions are: West of 0 degree and East. The idea is to have a large region and a small one and see how they work For the West 2 region GIISes and for the East 3 should be setup at the beginning, Ian.Bird@cern.ch

Plans for remainder of 2003 • Once the service has been deployed, priorities are: • Problem resolution and bug fixing – to address problems of reliability and scalability in existing middleware • Incrementally adding additional functionality • Adding additional sites • Expanding site resources accessible to the grid service • Addressing integration issues • Worker node WAN connectivity, etc. • Developing distributed prototypes of • Operations centres • User support services • To provide reasonable level of global coverage • Improving security model • Developing tools to facilitate operating the service Ian.Bird@cern.ch

Plans for 2003 – 2 Middleware functionality • Top priority is problem resolution and issues of stability/scalability • RLS developments • Distributed service – multiple LRC, and RLI • Later: develop a service to replace client command-line tools • VOMS service • To permit user and role-based authorization • Validation of R-GMA • And then deployment of multiple registries – initial implementation has singleton • Grid File Access Library • LCG development: POSIX-like I/O layer to provide local file access • Development of SRM/SE interfaces to other MSS • Work that must happen at each site with a MSS • Basic upgrades • Compiler support • Move to Globus 2.4 (release supported through 2004) • Cut-off for functionality improvements is October – in order to have a stable system for 2004 Ian.Bird@cern.ch

… VDT upgrade VOMS RLS (distributed) R-GMA RB RLS (basic) VDT Globus Continuous bug fixing & re-release Incremental Deployment Development of LCG middleware October 1: cut-off defines functionality for 2004 … RH 8.x gcc 3.2 EDG Integration, ends September July starting point: as much as feasible Ian.Bird@cern.ch

Expansion of LCG resources • Adding new sites • Will be a continuous process as sites are ready to join the service • Expect as a minimum 15 sites (15 countries have committed resources for LCG in 1Q04), reasonable to expect 18-20 sites by end 2003 • LCG team will work directly with Tier 1 (or primary site in a region) • Tier 1s will provide first level support for bringing Tier 2 sites into the service • Once the Tier 1s are stable this can go in parallel in many regions • LCG team will provide 2nd level support for Tier 2s • Increase grid resources available at many sites • Requires LCG to demonstrate utility of service – experiments in agreement with site managers add resources to the LCG service Ian.Bird@cern.ch

Operational plans for 2003 • Security • Develop full security policy • Develop longer term user registration procedures and tools to support it • Develop Acceptable Use policy for longer term – requires legal review • Operations • Develop distributed prototype operations centres/services • Monitoring developments driven by experience • Provide at least 16 hr/day global coverage – problem response • Basic level of resource use accounting – by VO and user • Minimal level of security incident response and coordination • User Support • Development direction depends strongly on experience in the deployed system • Operations and User Support must address the issues of interchanging problem reports – with each other and with sites, network ops, etc. Ian.Bird@cern.ch

Middleware roadmap • Short term (2003) • Use what exists – try and stabilize, debug, fix problems, etc. • Exceptions may be needed – WN connectivity, client tools rather than services, user registration, … • Medium term (2004 - ?) • Same middleware, but develop missing services, remove exceptions • Separate services from WNs – aim for more generic clusters • Initial tests of re-engineered middleware (service based, defined interfaces, protocols) • Longer term (2005? - ) • LCG service based on service definitions, interfaces, protocols, - aim to be able to have interoperating, different implementations of a service Ian.Bird@cern.ch

Inter-operability • Since LCG will be VDT + higher level EDG components: • Sites running same VDT version should be able to be part of LCG, or continue to work as now • LCG (as far as possible) has goal of appearing as a layer of services in front of a cluster, storage system, etc. • State of the art currently implies compromises … Ian.Bird@cern.ch

Integration Issues • LCG will try to be non-intrusive: • Will assume base OS is already installed • Provide installation & config tool for service nodes • Provide recipes for installation of WNs – assume sites will use existing tools to manage their clusters • No imposition of a particular batch system • As long as your batch system talks to Globus • (OK for LSF, PBS, Condor, BQS, FBSng) • No longer requirement for shared filesystem between gatekeeper and WNs • was a problem for AFS, NFS does not scale to large clusters • Information publishing • Define what information a site should provide (accounting, status, etc), rather than imposing tools • But … maybe some compromises in short term (2003) Ian.Bird@cern.ch

Worker Node connectivity • In general (and eventually) it cannot be assumed that the cluster nodes will have connectivity to remote sites • Many clusters on non-routed networks (for many reasons) • Security issues • In any case this assumption will not scale • BUT… • To achieve this several things are necessary: • Some tools (e.g. replica management) must become services • Databases (e.g. conditions db) must either be replicated to each site (or equivalent), or proxy service, or … • Analysis models must take this into account • Again, short term exceptions (up to a point) possible • Current additions to LXbatch at CERN have this limitation Ian.Bird@cern.ch

Timeline for the LCG services Agree LCG-1 Spec Computing model TDR’s LCG-1 service opens LCG-2 with upgraded m/w, management etc. TDR for Phase 2 LCG-3 full multi-tier prototype batch+interactive service LCG-1 LCG-2 LCG-3 2003 2004 2005 2006 Stabilize, expand, develop Evaluation 2nd generation middleware CMS DC04 Event simulation productions Service for Data Challenges, batch analysis, simulation Validation of computing models Acquisition, installation, testing of Phase 2 service Phase 2 service in production Ian.Bird@cern.ch

Resource Requests ALICE ATLAS CMS LHCb Resources – compute & storage Grid Infrastructure Services Grid Deployment Organisation Grid Deployment manager policies, strategy, scheduling, standards, recommendations Grid Deployment Board (GDB) Grid Resource Coordinator LCG security group LCG operations team LCG toolkit integration & certification grid infrastructure team experiment support team Joint Trillium/ EDG/LCG testing team CERN-based teams regional centre operations regional centre operations regional centre operations regional centre operations core infrastructure security tools operations call centre grid monitoring regional centre operations regional centre operations regional centre operations regional centre operations anticipated teams at other institutes Ian.Bird@cern.ch

Conclusions • Essential to start operating a service as soon as possible – we need 6 months to be able to develop this to a reasonably stable service • Middleware components are late – but we will still deploy a service of reasonable functionality and scale • Much work will be necessary on testing and improving the basic service • Several functional and operational improvements are expected during 3Q03 • Expansion of sites and resources foreseen during 2003 should provide adequate resources for 2004 data challenges • There are many issues to resolve and a lot of work to do – but this must be done incrementally on the running service Ian.Bird@cern.ch

Conclusions • From the point of view of the LCG plan – we are late in having testable middleware with the functionality that we had hoped for • We will keep to the July deployment schedule • We expect to have the major components – the user view of the middleware (i.e. via the RB) should not change • Expect to be able to do less testing and commissioning than planned • But hopefully, with a suitable process we will incrementally improve & add functionality as it becomes available and tested Ian.Bird@cern.ch

LCG-1 Deployment Plan

LCG-1 Deployment Plan

Presentation Transcript

LCG-Poverty 1 December 2005

Technology Deployment Plan

LCG Deployment

IPv6 Deployment Plan

LCG Deployment in Japan

LCG Deployment Overview

Grid Deployment Introduction and Overview Ian Bird LCG Deployment Area Manager

CREAM deployment Update on criteria for replacement of lcg -CE

LCG-1 Status and Issues

1 er Colloquium LCG France

LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN

SPP Deployment Plan

LCG-2 Plan in Taiwan

LCG-1 Status

LCG-1 Status

LCG-1 Regional Centres View

Experimental Deployment 1

LCG deployment workshop summary

LCG Deployment in the UK

IPv6 Deployment Plan

LCG-1 Deployment and usage experience

LCG-1 Status