1 / 25

LHCC Comprehensive Review 19-20 November 2007

Planning and Communication. LHCC Comprehensive Review 19-20 November 2007. Planning and Reporting Tools (until mid-2007). Milestones Plans for Sites, Areas, Projects and Experiments including the Tier-1 regional centers Level 1 Milestones Reports Quarterly Reports

evers
Télécharger la présentation

LHCC Comprehensive Review 19-20 November 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Planning and Communication LHCC Comprehensive Review 19-20 November 2007

  2. Planning and Reporting Tools (until mid-2007) • Milestones Plans • for Sites, Areas, Projects and Experiments • including the Tier-1 regional centers • Level 1 Milestones Reports • Quarterly Reports • prepared by each site/project/experiment every quarter • all milestones due or late are commented in the report • projects need to “fill” the Quarterly Report • provide a summary of progress • highlight problems (and issues with other projects) • add future milestones • Meetings and Communication • LCG/EGEE/OSG Operations Meeting • Experiment Coord. and Service Coord. Meetings • WLCG Bulletin

  3. High Level Milestones (until 2007)

  4. Quarterly Reports in 2007: • High Level Milestones + • LCG Services + • GDB + • 12 Sites + • 6 Projects/Areas + • 4 Experiments • Now we are in a different phase of the project and can focus on • Common Milestones for all Sites • Common Metrics • Transfers • Availability/Reliability • Job Success • Automation and Monitoring

  5. Planning and Communication (recent changes) • Planning • Milestones Dashboard • Specific plans for Areas and Projects • Metrics • Sites Reliability • Job Efficiency • Monitoring • Gridview • Monitoring tools • Communication • Meetings, Bulletin • Reporting • (Simplified) Quarterly Reports

  6. High Level Milestones Dashboard • We are now in a different phase compared to 2005-2007 when each site had different preparations to implements and therefore different milestones • E.g. installations, infrastructure, networking, buildings, etc • Each site had its Milestones Plan and a Quarterly Report focusing on the specific milestones and progress of each site. • On several occasions the Referees had expressed interest in a higher overview of the milestones across all sites • Now the services are installed and common milestones can be expressed and should be met by all sites • E.g. DB Services, gLite Services (or equivalent by other MW), SRM Services, 24x7 Support, VO Box Support, etc. • A new High Level Milestone Dashboard has been introduced, with milestones across all sites • Green=“Done”, Orange=“Late<1 Month”, Red=“Late>1 Month) • This new representation is very clear and reviewed monthly at the MB Meetings.

  7. Sites Milestones

  8. Sites Availability and Reliability Metrics • The SAM system has been developed to provide Site Availability Monitoring • Tests the Services at the Tier-0 and Tier-1 Sites • E.g. CE, SE, SRM, Data Transfers, Certificates, etc • Is extensible to more tests and also to VO-specific tests • Can check different implementations depending on the site and VO (e.g. EGEE, OSG, NGDF services, etc) • Critical and non-critical tests have been developed for the general tests (OPS VO) and for the Experiments (ALICE, ATLAS, CMS, LHCB VOs). • Downtimes are commented weekly in the Operations Meeting reports • Since the beginning of 2007 we use the SAM data to review the reliability of the sites • Targets have been set • 88% (Jan 07) 91% (Jun 07) 93% (Dec 07)

  9. CA-TRIUMF CERN IT-INFN-CNAF ES-PIC US-FNAL-CMS FR-CCIN2P3 (IN2P3) NL-T1 (SARA-NIKHEF) NDGF TW-ASGC DE-KIT (GridKa/FZK) UK-T1-RAL US-T1-BNL Last 6 Months http://cern.ch/LCG/MB/availability/site_reliability.pdf

  10. CERN 99% DE-KIT (GridKa/FZK) 76% FR-CCIN2P3 90% UK-T1-RAL 95% NL-T1 (SARA-NIKHEF) 89% IT-INFN-CNAF 97% CA-TRIUMF 91% US-FNAL-CMS 75% TW-ASGC 51% Every Month ES-PIC 96% US-T1-BNL 89% NDGF 89%

  11. Monthly Reliability of Tier-0, Tier-1 Sites January - October 2007 • Avg. 8 best sites: Apr 92% May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct 93% • Avg. all sites: Apr 89% May 89% Jun 80%Jul 89% Aug 88% Sept 89% Oct 86% • * BNL: LCG/gLite CE probed by SAM but not installed with the SL4 upgrade

  12. Sites Availability and Reliability Reports • Every week the Sites report about unavailability at the Operations Meeting • Explaining the problem, the solution found and the severity of the downtime • The SAM tests are executed automatically and provide an objective (although not perfect) view of which services work at the sites • Critical and non-critical tests are added to improve the verifications • They are executed on all sites but depending on the site they test they can be adapted to specific Services (e.g. ARC at NDGF instead of gLite) • VO can add their tests and can check what interests them or add verifications of their systems (e.g. PhedEx, DIRAC, etc) • The VOS can also choose which sites to check Note: The VO-specific SAM results are not yet published - Experiments and Sites still finding out the problems with the tests

  13. Comparison with VO-Specific SAM Tests September 2007

  14. Monitoring and Reporting Tools • GridView • Gridview is a monitoring and visualization tool being developed to provide a high level view of various functional aspects of the Worldwide LHC Computing Grid (LCG). • Currently it shows the statistics of data transfers, jobs running and service availability information for the WLCG • It shows the SAM results, accessing the SAM database and one can find out exactly which test has failed on which host • One has a GUI where it is possible to select T1s, T2, VOs, and many options for the display • Grid Monitoring Working Group (on going) • Common definitions for sensors and metrics • Interface between a site and the grid monitoring fabric • Allow sites within different grid infrastructures to publish and consume the monitoring data • Provide views of the system (“dashboards”) adapted to each of the stakeholder communities

  15. GridView

  16. http://gridview.cern.ch

  17. New Quarterly Reports • The new Quarterly Reports will be simplified to only report High Level Milestones and Metrics for each of the Sites • Projects and Area will still have dedicated milestone plans because there are no commonalities • Experiments’ progress is presented at the MB and summarized • Sites will be asked to comment late milestones or performance below targets • i.e. if a site is above targets and milestones and is all “green” will have nothing else to report • Proposed by the MB and accepted by the Overview Board in October 2007

  18. Next Steps: Job Efficiency • Sites Reliability tests show only whether the Services are running • Are the necessary condition for the Experiments application to run • But one needs to verify what the success rates of REAL Experiments jobs are at the Sites • Experiments monitor and display the execution of their jobs at the sites (e.g. ARDA Dashboard) and they have specific job submission and control systems • ALICE Agent, ATLAS Ganga, CMS Crab, LHCb Pilot • With specific verification to check exit status and verify the success/failure of the jobs • This data is used to calculate the Site Job Efficiency

  19. LCG Bulletins https://cern.ch/twiki/bin/view/LCG/LcgBulletins

  20. Summary Status • Services are in place and equipment is installed therefore Monitoring and Metrics are more appropriate • Added Metrics for Reliability, Accounting, and (soon) Job Efficiency • Dedicated projects have specific milestones (DB, SRM, CCRC, etc) Reporting • Milestones Dashboard and Quarterly Reports (simplified) Monitoring • Information is displayed in a better way (dashboards, targets, colors, etc) • Site reliability available online, weekly reporting and MB reviewing Communication • Unchanged communication tools. Meetings (Operations, Services, Experiments) and Bulletin Next Steps • Success rates and Job Efficiency for the Experiments applications WEB: http://cern.ch/LCG/planning WIKI: https://cern.ch/twiki/bin/view/LCG/Planning

  21. Backup Slides

  22. Job Efficiency Table(data below is preliminary)

More Related