1 / 13

Large scale data flow in local and GRID environment

Large scale data flow in local and GRID environment. Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow). Research objectives. Plans: Large scale data flow simulation in local and GRID environment. Done: Large scale data flow optimization in realistic DC environment (ALICE and LHCb).

britain
Télécharger la présentation

Large scale data flow in local and GRID environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large scale data flow in localand GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)

  2. Research objectives Plans: Large scale data flow simulation in local and GRID environment. Done: Large scale data flow optimization in realistic DC environment (ALICE and LHCb) more interesting more useful (hopefully)

  3. ITEP LHC computer farm (1) main components A. Selivanov (ITEP-ALICE)a head of the ITEP-LHC farm 64  Pentium IV PC modules(01.01.2004)

  4. 100 Mbit/s 20 (LCG test) + 44 (DCs) 2-3 Mbit/s ITEP LHC computer farm (2) BATCH nodes CPU: 64 PIV-2.4GHz (hyperthreading) RAM: 1 GB Disks: 80 GB Mass storage Disk servers: 6 x 1.6 TB + 1 x 1.0 TB + 1 x 0.5 TB CERN

  5. ITEP LHC FARM usage in 2004 Monitoring available at http://egee.itep.ru Main ITEP players in 2004 – ALICE and LHCb

  6. ALICE DC Goals • Determine readiness of the off-line framework for data processing • Validate the distributed computing model • PDC’2004:10% test of the final capacity • PDC’04 physics: hard probes (jets, heavy flavours) & pp physics Strategy • Part 1:underlying (background) events (March-July) • Distributed simulation • Data transfer to CERN • Part 2:signal events & test of CERN as data source (July-November) • Distributed simulation, reconstruction, generation of ESD • Part 3:distributed analysis Tools • AliEn– Alice Environment for the distributed computing • AliEn – LCG Interface

  7. LHCb DC Physics Goals (170M events) 1. HLT studies 2. S/B studies, consolidate background estimates, background properties Gather information for the LHCb computing TDR ● Robustness test of the LHCb software and production system ● Test of the LHCb distributed computing model ● Incorporation of the LCG application software ● Use of LCG as a substantial fraction of the production capacity Strategy: • MC Production (April-September) • Stripping (event preselection) still going on • Analysis

  8. Details ALICE AliEn LHCb DIRAC 1 job – 1 event Raw event size: 2 GB ESD size: 0.5-50 MB CPU time: 5-20 hours RAM usage: huge Store local copies Backup sent to CERN 1 job – 500 events Raw event size: ~1.3 MB DST size: 0.3-0.5 MB CPU time: 28-32 hours RAM usage: moderate Store local copies of DSTs DSTs and LOGs sent to CERN Massive data exchange with disk servers --- Often communication with central services -

  9. Optimization April – start massive LHCb DC 1 job/CPU – everything OK use hyperthreading - 2jobs/CPU - increase efficiency by 30-40% May – start massive ALICE DC bad interference with LHCb jobs often crashes of NFS restrict ALICE queue to 10 simultaneous jobs, optimize communication with disk server June – September smooth running share resources, LHCb - June July, ALICE – August September careful online monitoring of jobs (on top of usual monitoring from collaboration)

  10. Monitoring Often power cuts in summer (4-5 times) -5% all intermediate steps are lost (…) provide reserve power line and more powerful UPS Stalled jobs -10% infinite loops in GEANT4 (LHCb) crashes of central services write simple check script and kill such jobs (bug report is not sent…) Slow data transfer to CERN poor and restricted link to CERN problems with CASTOR automatic retry

  11. ALICE Statistics

  12. LHCb Statistics

  13. Summary Quite visible participation in ALICE and LHCb DCs ALICE → ~5% contribution (ITEP part ~70%) LHCb → ~5% contribution (ITEP part ~70%) With only 44 CPUs Problems reported to colleagues in collaborations More attention to LCG now Distributed analysis – very different pattern of work load

More Related