1 / 20

Service Challenge 4: Preparation, Planning and Outstanding Issues at INFN

Service Challenge 4: Preparation, Planning and Outstanding Issues at INFN Tiziana.Ferrari@cnaf.infn.it Workshop sul Calcolo e Reti dell'INFN Jun 7, 2006. Outline. Service Challenge 4: overall schedule SC4 service planning at T1 and T2 sites Results of pre-SC4 testing at INFN

lisle
Télécharger la présentation

Service Challenge 4: Preparation, Planning and Outstanding Issues at INFN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Service Challenge 4: Preparation, Planning and Outstanding Issues at INFN Tiziana.Ferrari@cnaf.infn.it Workshop sul Calcolo e Reti dell'INFN Jun 7, 2006

  2. Outline • Service Challenge 4: overall schedule • SC4 service planning at T1 and T2 sites • Results of pre-SC4 testing at INFN • Accounting results • VO Box • Outstanding issues

  3. SC Milestones 2006 January SC3 disk repeat – Nominal rate (200 MB/s) capped at 150MB/s/ February CHEP Workshop; T1-T1 Use Cases, SC3 disk - tape repeat (50MB/s, 5 drives) March Detailed plan for SC4 service agreed (M/W + DM service enhancements). gLite 3.0 release beta testing April SC4 disk – disk (200 MB/s) and disk – tape (75MB/s) throughput tests. gLite 3.0 release available for distribution MayInstallation, configuration and testing of gLite 3.0 release at sites June Start of SC4 production tests by experiments of ‘T1 Use Cases’ T2 Workshop: identification of key Use Cases and Milestones for T2s July Tape throughput tests at full nominal rates! August T2 Milestones – debugging of tape results if needed September LHCC review – rerun of tape tests if required October WLCG Service Officially opened. Capacity continues to build up November 1st WLCG ‘conference’. All sites have network / tape h/w in production(?) DecemberFinal service / middleware review leading to early 2007 upgrades for LHC data taking??

  4. INFN sites in SC4 • ALICE • Catania • Torino • (Bari TBD) • (Legnaro TBD) • ATLAS • Frascati (in) • Napoli • Milano (sj) • Roma • CMS • Bari (in) • Legnaro • Pisa (sj) • Roma • LHCb • CNAF

  5. SC4 gLite 3.0 Service Planning

  6. SC4 Preparation • Activities: • Mar: Installation and testing of gLite 3.0 in PPS  passed, with the exception of the FTS service (Oracle backend), under finalization • Apr, May: testing of CERN to CNAF disk-disk throughput target rate (200 MB/s) with Castor2  ongoing • May: testing of CERN to CNAF disk-tape throughput target rate (75 MB/s) with Castor2  passed • May 22: start of deployment of gLite 3.0.0 RC5 (released on May 4) at CNAF (all WN, CE gLite, CE LCG still waiting for upgrade, some UIs, FTS)

  7. Disk – Tape Throughput (CERN – CNAF) • Tue Apr 19 to Thu Apr 27: estimated overall rate to tape 76.94 MB/s • Target: 75 MB/s • Estimated Daily Rate to Tape (MB/s): Wed 19     5.32 MB/s Thu 20   103.80 MB/s Fri 21   148.00 MB/s Sat 22    25.46 MB/s Sun 23    11.60 MB/s Mon 24    62.50 MB/s Tue 25    62.50 MB/s Wed 26    91.34 MB/s Thu 27   182.00 MB/s • 6 tape drives • long down times of the WAN transfer sessions from CERN to CNAF due to LSF Castor2 issues already experienced during the disk-disk throughput phase in April

  8. Disk-disk throughput (CERN – CNAF) 1/3 • Apr: LSF issues caused instability and seriously limited the avg throughout • May 2: Castor2 upgrade to version 2.0.4-0; • May 3: local write tests and remote transfers (few concurrent file transfers). Results show good local Castor-2-to-LSF interaction. Power problem in the Tier-1 premises in the afternoon; • May 4: transfers re-activated by gradually increasing the number of concurrent transfers by steps of 10 files. Throughput increases linearly (10 files -> 100 MB/s, 40 files -> 200 MB/s) • Problems with the name server DB (this service is shared by Castor1 and Castor2). Tests with 50 concurrent files, 2 parallel streams: 1800 Mb/s (200 MB/s) • Stable run at 200 MB/s for approximately 1 and a half day (until May 6, 8 p.m.)

  9. Disk-disk throughput (CERN – CNAF) 2/3 • Average always around 170 MB/s  bottleneck exists, still to be understood • Daily statistics (May 26)

  10. Disk-disk throughput (CERN – CNAF) 3/3 • RTT: 11.1 ms, default tx/rx socket buffer: 2 MB (tuning of TCP kernel parameters - net.core.rmem_max, net.core.wmem_max, net.core.rmem_default and net.core.wmem_default - on the 4 file servers does not improve the overall performance) • From 4 to 5 file servers: no improvement, back to 4 • Aggregate local write performance: more than 350 MB/s • LSF: increase of the number of slots per file server (i.e. the number of transfer jobs that a given file server can receive), no improvement • Next: direct gridftp sessions to the Castor file servers

  11. SC4 Accounting • Accounting metrics for SC4 monthly reports: • CPU usage in KSI2K-days (via DGAS) • kSI2K days = raw_cpu_hours * GlueHostBenchmarkSI00 / (1000*24) • GlueHostBenchmarkSI00 can be replaced by a weighted average of the computing power of Worker Nodes in the farm • Wall-clock time in KSI2K-days (via DGAS) • Disk space used in TB • Disk space allocated in TB • Tape space used in TB • Validation of raw data gathered, by comparison via different tools

  12. CPU Grid Accounting Results (T2)Jan-Apr 2006 T2 sites: Bari, Catania, Frascati, Legnaro, Napoli, Pisa Roma1-CMS, Roma1-Atlas, Torino

  13. Disk Space Accounting Results (T2)Jan-Apr 2006

  14. Planning (1/3) • June 01-15: • Ongoing deployment of gLite 3.0.0 RC5 at T1 • FTS production server now working, deployment of catch-all channels • Start of gLite 3.0.0 RC5 deployment in other INFN sites • INFN Grid 3.0 to be released soon • SRM still not deployed in every SC4 site • June 01: start of service phase (see following slides) • July: • Testing of concurent data transfer T1  all SC4 INFN sites • FTS: Testing of VO channel shares • Upgrade of the existing 1 GE production link at CNAF to 10 GE

  15. Planning (2/3) • ATLAS transfer tests (GOAL: stability of DM infrastructure): • From June 19, for 3 weeks • T0  CNAF (59 MB/s)  T2’s (max 20 MB/s, full set of AOD) • raw data to tape at T1 • Network reliability still to be evaluated for some T2s • Read performance of Castor2? • Usage of LFC catalogues at T1 sites • DQ2 to submit, manage and monitor the T1 export: • DQ2 is based on the concept of dataset subscriptions: a site is subscribed by the T0 management system @ CERN to a dataset that has been reprocessed • The DQ2 site service running at the site's VObox will then pick up subscriptions, submit and manage the corresponding FTS requests • T2s subscribing to datasets stored on their associated Tier1also using DQ2

  16. Planning (3/3) • CMS: load transfer tests (T1 and T2’s) from May 26: • T1  T2: bursty transfers driven by analysis • 20 MB/sec aggregate Phedex (FTS)traffic to/from temporary disk at each T1 • Currently still based on srmcp  ftscp • SC3 functionality rerun using gLite infrastructure, trying to rise to 25000 jobs/day aggregated over all sites by end of June • Bari, Legnaro and Pisa already running jobs • T2  T1: continuous Monte Carlo transfers • aggregating to 1TB/day to CERN • Total of 25M events/month • Test T2  T1 transfers at 20 MB/sec per Tier 2 • Last 2 weeks of July: 150 MB/sec simulated raw data to tape at T1s, 25 MB/s to CNAF

  17. VO Box • “The term VO Box designates a system provided by the Site for use by one or more designated Virtual Organisations (VOs) for interactive access and running persistent processes.” (from VO-Box Security Recommendations and Questionnaire, CERN-LCG-EDMS-639856) • installed and maintained by the Site Resource Administrators, as the VO Box is part of the trusted network fabric of the Site • Interactive access to the VO box, and the capability to run services on the system, MUST be limited to a • Interactive access and capability to run services limited to specific, named list of individuals within the VO. These individuals MUST have user-level access only. • Well defined network connectivity requirements (ports in use) • Vo Box subject to security service challenges.

  18. VO Box Task Force: Some Conclusions (from GDB Meeting slides) • Classification of VO services into two types • Class 1 : sites not worried • Class 2 : sites want to see them go away ASAP • VO boxes with purely class 1 services: • Needed by all VOs, high-level distributed VO services • Not really a problem for sites (given solution of a few policy and security aspects) • Class 2 services: • Potential threat to operational stability and security • Try to avoid new ones as much as possible • Class 2 services may need to be put somewhere else (otherwise entire VO box is class 2)

  19. VO Box Core Services • Gsissh or other authenticated way of logging in • Grid client tools (UI) • Proxy renewal service • Shared file system (for ALICE packman, class 2, need replacement ASAP) • Candidates for moving into LCG stack: • Xrootd • Need some integration / cooperation with SRM products • Package manager • Incorporate AliEn packman or develop some alternative • MonaLisa • Monitoring • Many other Longer Term Issues

  20. Outstanding issues • Accounting (monthly reports): • CPU usage in KSI2K-days  DGAS • Wall-clock time in KSI2K-days  DGAS • Disk space used in TB • Disk space allocated in TB • Tape space used in TB • Validation of raw data gathered, by comparison via different tools • Monitoring of data transfer: GridView or other tools? • Routing in LHC Optical Private Network? Backup connection to FZK becoming urgent • Implementation of a LHC OPN monitoring infrastructure still in its infancy • SE Reliability when in unattended mode • Several mw components still require in-depth testing • SE kernel configuration: • TCP send/receive socket buffer autotuning only for kernels ≥ 2.4.27 and ≥ 2.6.7 • What VO Box?

More Related