1 / 33

GridPP Deployment Status

GridPP Deployment Status. Steve Traylen. 28th October 2004 GOSC Face to Face, NESC, UK. Steve Traylen s.traylen@rl.ac.uk. Contents. GridPP 2 – From Prototype to Production. Status of the current operational Grid. Middleware components of the GridPP Production System.

alyson
Télécharger la présentation

GridPP Deployment Status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP Deployment Status Steve Traylen 28th October 2004 GOSC Face to Face, NESC, UK Steve Traylen s.traylen@rl.ac.uk

  2. Contents • GridPP 2 – From Prototype to Production • Status of the current operational Grid • Middleware components of the GridPP Production System • Future plans and challenges • Summary

  3. The physics driver TheLHC 1 Megabyte (1MB) A digital photo 1 Gigabyte (1GB) = 1000MB A DVD movie 1 Terabyte (1TB) = 1000GB World annual book production 1 Petabyte (1PB) = 1000TB Annual production of one LHC experiment 1 Exabyte (1EB) = 1000 PB World annual information production • 40 million collisions per second • After filtering, 100-200 collisions of interest per second • 1-10 Megabytes of data digitised for each collision = recording rate of 0.1-1 Gigabytes/sec • 1010 collisions recorded each year • = ~10 Petabytes/year of data CMS LHCb ATLAS ALICE les.robertson@cern.ch

  4. The UK response GridPP GridPP – A UK Computing Grid for Particle Physics 19 UK Universities, CCLRC (RAL & Daresbury) and CERN Funded by the Particle Physics and Astronomy Research Council (PPARC) GridPP1 - Sept. 2001-2004 £17m "From Web to Grid“ GridPP2 – Sept. 2004-2007 £16(+1)m "From Prototype to Production"

  5. Current context of GridPP

  6. Our grid is working … NorthGrid **** Daresbury, Lancaster, Liverpool, Manchester, Sheffield SouthGrid * Birmingham, Bristol, Cambridge, Oxford, RAL PPD, Warwick ScotGrid * Durham, Edinburgh, Glasgow LondonGrid *** Brunel, Imperial, QMUL, RHUL, UCL

  7. … and is part of LCG • Resources are being used for data challenges • Within the UK we have some VO/experiment Memorandum of Understandings in place • Tier-2 structure is working well

  8. Scale GridPP prototype Grid > 1,000 CPUs • 500 CPUs at the Tier-1 at RAL > 500 CPUs at 11 sites across UK organised in 4 Regional Tier-2s > 500 TB of storage > 800 simultaneous jobs • Integrated with international LHC Computing Grid (LCG) > 5,000 CPUs > 4,000 TB of storage > 85 sites around the world > 4,000 simultaneous jobs • monitored via Grid Operations Centre (RAL) (hyperthreading enabled on some sites) http://goc.grid.sinica.edu.tw/gstat/

  9. Operational status (October)

  10. VOs active

  11. Who is directly involved?

  12. Past upgrade experience at RAL Previously utilisation of new resources grew steadily over weeks or months.

  13. Tier-1 update 27-28th July 2004 Hardware Upgrade With the Grid we see a much more rapid utilisation of newly deployed resources.

  14. The infrastructure developed in EDG/GridPP1 UI JDL AA server (VOMS) Resource broker (C++ Condor MM libraries, Condor-G for submission) Logging & Book keeping MySQL DB – stores job state info Berkely Database Information Index Replica catalogue per VO (or equiv.) GridFTP Server Gatekeeper (PBS Scheduler) Batch workers Job submission Python – default Java – GUI APIs (C++,J,P) NFS, Tape, Castor User Interface Computing Element Storage Element

  15. Common Grid Components • LCG uses middleware common to other Grid Projects. • VDT (v1.1.14) • Globus Gatekeeper. • Globus MDS. • GlueCE Information Provider. • Used by NGS, Grid3 and Nordugrid. • Preserving this core increases chances of inter grid interoperability.

  16. Extra Grid Components • LCG extends VDT with fixes and the deployment of other grid services. • This is only done when there is a shortfall or performance issue with the existing middleware. • Most are grid wide services for LCG rather than extra components for sites to install. • Minimise conflicts between grids. • Not always true – see later.

  17. LCG PBSJobManager • Motivation • Standard Globus JobManager starts one perl process per job, queued or running. • One user can destroy a Gatekeeper easily. • Also assumes a shared /home file system is present. • Not scalable to 1000s of nodes. • NFS a single failure point. • The Resource Broker must poll jobs indivdually.

  18. LCG PBSJobManager • Solution • LCG jobmanager stages files to batch worker with scp and GridFTP. • Creates new problems though. • Even harder to debug and there is more to go wrong. • MPI jobs more difficult though an rsync work around exists.

  19. LCG PBSJobManager • Solution • JobManager starts up a “GridMonitor” on the gatekeeper. • One GridMonitor per Resource Broker is started currently. • Resource Broker communicates with the monitor instead of polling jobs individually. • Moving this to one GridMonitor per user is possible. • Currently deployed at almost all GridPP sites.

  20. Storage in LCG • Currently there are three active solutions. • GridFTP servers, the so called ClassicSE • SRM interfaces at CERN, IHEP(Russia), DESY and RAL (this week). • edg-se – Only one as a front end the atlas data store tape system at RAL. • The edg-rm and lcg-* commands abstract the end user from these interfaces.

  21. Storage - SRM • SRM = Storage Resource Manager. • Motivation • Sites need to move files around and reorganise data dynamically. • The end user wants/requires a consistent name space for their files. • End users want to be able to reserve space this space as well. • SRM will in time be the preferred solution supported within LCG.

  22. SRM Deployment • Current storage solution for LCG is dCache with an SRM interface. Produced by DESY and FNAL. • This is currently deployed at RAL in a test state and is slipping into production initially for the CMS experiment. • Expectation is that dCache with SRM will provide a solution for many sites. • Edinburgh, Manchester, Oxford all keen to deploy.

  23. SRM/dCache at RAL

  24. Resource Broker • Allows selection of and submission to sites based on what they publish into the information system. • Queues are published with • Queue lengths • Software available. • Authorised VOs or individual DNs. • The RB can query the replica catalogue to run at a site with a particular file. • Three RBs are deployed in the UK.

  25. L&B • L&B = Logging and Bookkeeping Service • Jobs publish their Grid State to L&B. • Either by calling commands installed on batch worker. • Or by GridFTP’ing the job wrapper back. • The second requires no software on batch workers but the first gives better feedback.

  26. Application Installation with LCG • Currently a sub VO of software managers owns an NFS mounted space. • Software area managed by jobs. • Software validated in process. • The drop a status file on to the file which is published by the site. • With the RB • End users match jobs to tagged sites. • SW managers install SW at non tagged sites. • This is being extended to allow DTEAM to install grid clients SW on WNs.

  27. R-GMA • Developed by GridPP within both EDG and now EGEE. • Takes the role of a grid enabled SQL database. • Example applications include CMS and D to publish their job bookkeeping. • Can also be used to transport the Glue values and allows SQL lookups of Glue. • R-GMA is deployed at most UK HEP sites. • RAL currently runs the single instance of the R-GMA registry.

  28. Next LCG Release • LCG 2_3_0 is due now. • Built entirely on SL3 (RHE3 clone). • RH73 still an option. • Many stability improvements. • Addition of accounting solution. • Easier addition of VOs. • Addition of DCache/SRM. • and lots more… • This release will last into next year. • Potentially the last release before gLite components appear.

  29. There are still challenges • Middleware validation • Meeting experiment requirements with the Grid • Distributed file (and sub-file) management • Experiment software distribution • Production accounting • Encouraging an open sharing of resources • Smoothing deployment and service upgrades. • Security

  30. Middleware validation JRA1 SA1 CERTIFICATION TESTING APP INTEGR SERVICES DEPLOY Integrate HEP EXPTS Basic Functionality Tests BIO-MED DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Run Certification Matrix OTHER TBD DEPLOYMENT PREPARATION PRE-PRODUCTION PRODUCTION Run tests C&T suites Site suites APPS SW Installation Release candidate tag Certified release tag Deployment release tag Production tag Dev Tag Is starting to be addressed through a Certification and Testing testbed… RAL is involved with both JRA1 and Pre Production systems.

  31. Software distribution Physics Models Monte Carlo Truth Data Trigger System Detector Simulation Data Acquisition Run Conditions Level 3 trigger MC Raw Data Calibration Data Raw Data Trigger Tags Reconstruction Reconstruction Event Summary Data ESD Event Tags MC Event Summary Data MC Event Tags • ATLAS Data Challenge to validate world-wide computing model • Packaging, distribution and installation:Scale:one release build takes 10 hours produces 2.5 GB of files • Complexity:500 packages, Mloc, 100s of developers and 1000s of users • ATLAS collaboration is widely distributed:140 institutes, all wanting to use the software • needs ‘push-button’ easy installation.. Step 1: Monte Carlo Data Challenges Step 2: Real Data

  32. Summary • The Large Hadron Collider data volumes make Grid computing a necessity • GridPP1 with EDG developed a successful Grid prototype • GridPP members have played a critical role in most areas – security, work load management, information systems, monitoring & operations. • GridPP involvement continues with the Enabling Grids for e-SciencE (EGEE) project – driving the federating of Grids • As we move towards a full production service we face many challenges in areas such as deployment, accounting and true open sharing of resources

  33. Useful links GRIDPP and LCG: • GridPP collaboration http://www.gridpp.ac.uk/ • Grid Operations Centre (inc. maps) http://goc.grid-support.ac.uk/ • The LHC Computing Grid http://lcg.web.cern.ch/LCG/ Others • PPARC http://www.pparc.ac.uk/Rs/Fs/Es/intro.asp • The EGEE project http://cern.ch/egee/ • The European Data Grid final review http://eu-datagrid.web.cern.ch/eu-datagrid/

More Related