250 likes | 495 Vues
Planning and Communication. LHCC Comprehensive Review 19-20 November 2007. Planning and Reporting Tools (until mid-2007). Milestones Plans for Sites, Areas, Projects and Experiments including the Tier-1 regional centers Level 1 Milestones Reports Quarterly Reports
E N D
Planning and Communication LHCC Comprehensive Review 19-20 November 2007
Planning and Reporting Tools (until mid-2007) • Milestones Plans • for Sites, Areas, Projects and Experiments • including the Tier-1 regional centers • Level 1 Milestones Reports • Quarterly Reports • prepared by each site/project/experiment every quarter • all milestones due or late are commented in the report • projects need to “fill” the Quarterly Report • provide a summary of progress • highlight problems (and issues with other projects) • add future milestones • Meetings and Communication • LCG/EGEE/OSG Operations Meeting • Experiment Coord. and Service Coord. Meetings • WLCG Bulletin
Quarterly Reports in 2007: • High Level Milestones + • LCG Services + • GDB + • 12 Sites + • 6 Projects/Areas + • 4 Experiments • Now we are in a different phase of the project and can focus on • Common Milestones for all Sites • Common Metrics • Transfers • Availability/Reliability • Job Success • Automation and Monitoring
Planning and Communication (recent changes) • Planning • Milestones Dashboard • Specific plans for Areas and Projects • Metrics • Sites Reliability • Job Efficiency • Monitoring • Gridview • Monitoring tools • Communication • Meetings, Bulletin • Reporting • (Simplified) Quarterly Reports
High Level Milestones Dashboard • We are now in a different phase compared to 2005-2007 when each site had different preparations to implements and therefore different milestones • E.g. installations, infrastructure, networking, buildings, etc • Each site had its Milestones Plan and a Quarterly Report focusing on the specific milestones and progress of each site. • On several occasions the Referees had expressed interest in a higher overview of the milestones across all sites • Now the services are installed and common milestones can be expressed and should be met by all sites • E.g. DB Services, gLite Services (or equivalent by other MW), SRM Services, 24x7 Support, VO Box Support, etc. • A new High Level Milestone Dashboard has been introduced, with milestones across all sites • Green=“Done”, Orange=“Late<1 Month”, Red=“Late>1 Month) • This new representation is very clear and reviewed monthly at the MB Meetings.
Sites Milestones
Sites Availability and Reliability Metrics • The SAM system has been developed to provide Site Availability Monitoring • Tests the Services at the Tier-0 and Tier-1 Sites • E.g. CE, SE, SRM, Data Transfers, Certificates, etc • Is extensible to more tests and also to VO-specific tests • Can check different implementations depending on the site and VO (e.g. EGEE, OSG, NGDF services, etc) • Critical and non-critical tests have been developed for the general tests (OPS VO) and for the Experiments (ALICE, ATLAS, CMS, LHCB VOs). • Downtimes are commented weekly in the Operations Meeting reports • Since the beginning of 2007 we use the SAM data to review the reliability of the sites • Targets have been set • 88% (Jan 07) 91% (Jun 07) 93% (Dec 07)
CA-TRIUMF CERN IT-INFN-CNAF ES-PIC US-FNAL-CMS FR-CCIN2P3 (IN2P3) NL-T1 (SARA-NIKHEF) NDGF TW-ASGC DE-KIT (GridKa/FZK) UK-T1-RAL US-T1-BNL Last 6 Months http://cern.ch/LCG/MB/availability/site_reliability.pdf
CERN 99% DE-KIT (GridKa/FZK) 76% FR-CCIN2P3 90% UK-T1-RAL 95% NL-T1 (SARA-NIKHEF) 89% IT-INFN-CNAF 97% CA-TRIUMF 91% US-FNAL-CMS 75% TW-ASGC 51% Every Month ES-PIC 96% US-T1-BNL 89% NDGF 89%
Monthly Reliability of Tier-0, Tier-1 Sites January - October 2007 • Avg. 8 best sites: Apr 92% May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct 93% • Avg. all sites: Apr 89% May 89% Jun 80%Jul 89% Aug 88% Sept 89% Oct 86% • * BNL: LCG/gLite CE probed by SAM but not installed with the SL4 upgrade
Sites Availability and Reliability Reports • Every week the Sites report about unavailability at the Operations Meeting • Explaining the problem, the solution found and the severity of the downtime • The SAM tests are executed automatically and provide an objective (although not perfect) view of which services work at the sites • Critical and non-critical tests are added to improve the verifications • They are executed on all sites but depending on the site they test they can be adapted to specific Services (e.g. ARC at NDGF instead of gLite) • VO can add their tests and can check what interests them or add verifications of their systems (e.g. PhedEx, DIRAC, etc) • The VOS can also choose which sites to check Note: The VO-specific SAM results are not yet published - Experiments and Sites still finding out the problems with the tests
Monitoring and Reporting Tools • GridView • Gridview is a monitoring and visualization tool being developed to provide a high level view of various functional aspects of the Worldwide LHC Computing Grid (LCG). • Currently it shows the statistics of data transfers, jobs running and service availability information for the WLCG • It shows the SAM results, accessing the SAM database and one can find out exactly which test has failed on which host • One has a GUI where it is possible to select T1s, T2, VOs, and many options for the display • Grid Monitoring Working Group (on going) • Common definitions for sensors and metrics • Interface between a site and the grid monitoring fabric • Allow sites within different grid infrastructures to publish and consume the monitoring data • Provide views of the system (“dashboards”) adapted to each of the stakeholder communities
New Quarterly Reports • The new Quarterly Reports will be simplified to only report High Level Milestones and Metrics for each of the Sites • Projects and Area will still have dedicated milestone plans because there are no commonalities • Experiments’ progress is presented at the MB and summarized • Sites will be asked to comment late milestones or performance below targets • i.e. if a site is above targets and milestones and is all “green” will have nothing else to report • Proposed by the MB and accepted by the Overview Board in October 2007
Next Steps: Job Efficiency • Sites Reliability tests show only whether the Services are running • Are the necessary condition for the Experiments application to run • But one needs to verify what the success rates of REAL Experiments jobs are at the Sites • Experiments monitor and display the execution of their jobs at the sites (e.g. ARDA Dashboard) and they have specific job submission and control systems • ALICE Agent, ATLAS Ganga, CMS Crab, LHCb Pilot • With specific verification to check exit status and verify the success/failure of the jobs • This data is used to calculate the Site Job Efficiency
LCG Bulletins https://cern.ch/twiki/bin/view/LCG/LcgBulletins
Summary Status • Services are in place and equipment is installed therefore Monitoring and Metrics are more appropriate • Added Metrics for Reliability, Accounting, and (soon) Job Efficiency • Dedicated projects have specific milestones (DB, SRM, CCRC, etc) Reporting • Milestones Dashboard and Quarterly Reports (simplified) Monitoring • Information is displayed in a better way (dashboards, targets, colors, etc) • Site reliability available online, weekly reporting and MB reviewing Communication • Unchanged communication tools. Meetings (Operations, Services, Experiments) and Bulletin Next Steps • Success rates and Job Efficiency for the Experiments applications WEB: http://cern.ch/LCG/planning WIKI: https://cern.ch/twiki/bin/view/LCG/Planning