1 / 15

BINP/GCF Status Report

BINP/GCF Status Report. A.S.Zaytsev@inp.nsk.su. Jan 2010. Overview. Current status Resource accounting Summary of recent activities and achievements BINP/GCF & NUSC (NSU) integration BINP LCG site related activities Proposed hardware upgrades Future prospects.

gage
Télécharger la présentation

BINP/GCF Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BINP/GCF Status Report A.S.Zaytsev@inp.nsk.su Jan 2010

  2. Overview • Current status • Resource accounting • Summary of recent activities and achievements • BINP/GCF & NUSC (NSU) integration • BINP LCG site related activities • Proposed hardware upgrades • Future prospects BINP/GCF Status Report

  3. BINP LCG Farm: Present Status CPU: 40 cores (100 kSI2k) | 200 GB RAM HDD: 25 TB raw (22 TB visible) Input power limit: 15 kVA Heat output: 5 kW

  4. Resource Allocation Accounting(up to 80 VM slots are now available within 200 GB of RAM) Computing Power Centralized Storage LCG: 0.5 TB (VM images) 15 TB (DPM + VO SW) KEDR: 0.5 TB (VM images) 4 TB (local backups) CMD-3: 1 TB is reserved for the scratch area & local home NUSC / NSU: up to 4 TB reserved for the local NFS/PVFS2 buffer • LCG: • 4 host systems now (40%) • 70% share is prospected for production with ATLAS VO (near future) • KEDR: • 4.0 – 4.5 host systems(40-45%) • VEPP-2000, CMD-3, SND, test VMs, etc.: • 1.5 – 2.0host systems(15-20%) 90% full, 150% reserved (200% limit) 35% full, 90% reserved (100% limit) BINP/GCF Status Report

  5. BINP/GCF Activities in 2009Q4Sorted by priority (from highest to lowest) • [done] Testing and tuning 10 Gbps NSC/SCN channel to NSU and gettingit to production state • [done] Deploying a minimalistic LCG site locally at BINP • [done] BINP/GCF and NUSC (NSU) cluster network and virtualization systems integration • [done] Probing the feasibility of efficient use of resources under VMware with native KEDR VMs deployed in various ways • [done] Finding the long term stable configuration of KEDR VMs while running on several host systems in parallel • [in progress] Getting to production with ATLAS VO with 25 kSI2k / 15 TB SLC4 based LCG site configuration • [in progress] Preparing LCG VMs for running on NUSC (NSU) side • [in progress] Studying the impact of BINP-MSK & BINP-CERN connectivity issues on GStat & SAM test failures BINP/GCF Status Report

  6. BINP/GCF & NUSC (NSU) Integration • BINP/GCF: XEN images • NUSC: VMware images (converted from XEN) • Various deployment options were studied: • IDE/SCSI virtual disk (VD) • VD performance/reliability tuning • Locally/centrally deployed • 1:1 and 2:1 VCPU/real CPU core modes • Allowing disabling swap on the host system • Up to 2 host systems with 16 VCPUs combined are tested(1 GB RAM/VCPU) • Long term stability (up to 5 days) is shownfor locally deployed VMs yet, most likely the problems are related to the centralized storage system of NUSC cluster • Works are now suspended due to the hardware failure of NSC/SCN switch on the BINP side (more news by the end of the week) BINP/GCF Status Report

  7. BINP LCG Site Related Activities • STEP 1: DONE • Defining the basic site configuration, deploying the LCG VMs, going through the GOCDB registration, etc. • STEP 2: DONE • Refining the VMs configuration, tuning up the network, getting new RDIG host certs, VOs registration, handling the errors reported by SAM tests, etc. • STEP 3: IN PROGRESS • Getting OK for all the SAM tests (currently being dealt with) • Confirm the stability of operations for 1-2 weeks • Upscale the number of WNs to the production level(from 12 up to 32 CPU cores = 80 kSI2k max) • Ask ATLAS VO admins to install the experimental software on the site • Test the site for ability to run ATLAS production jobs • Check if the 110 Mbps SB RAS channel is capable to carry the loadof 80 kSI2k site • Get to production with ATLAS VO BINP/GCF Status Report

  8. BINP/GCF Activities in 2010Q1-2Sorted by priority (from highest to lowest) • Recovering from the 10 Gbps NSC/SCN failure on the BINP side • Getting to production with 32-64 VCPUs for KEDR VMs on the NUSC side • Recovering BINP LCG site visibility under GStat 2.0 • Getting to production with ATLAS VO with 25 kSI2k / 15 TB LCG site configuration • Testing LCG VMs on NUSC (NSU) side • Finding stable configuration for LCG VMs for NUSC • Upscaling LCG site to 80-200 kSI2k by using both BINP/GCF and NUSC resources • Migrating LCG site to SLC5.x86_64 and CREAM CE as suggested by ATLAS VO and RDIG • Making a quantitative conclusion on how the existing NSC networking channel is limiting our LCG site performance/reliability • Allowing other local experiments to access NUSC resources via GRIDfarm interfaces (using the farm as pre-production environment) BINP/GCF Status Report

  9. Future Prospects • Major upgrade of the BINP/GCF hardware focusing on the storage system capacity and performance • Up to 0.5 PB of online storage • Switch SAN fabric • Further extension of SC Network and virtualization environment • TSU with 1100+ CPU cores is the most attractive target • Solving the problem with NSK-MSK connectivity for the LCG site • Dedicated VPN to MSK-IX seem to be the best solution • Start getting the next generation hardware this year • 8x increase of CPU cores density • Adding DDR IB (20 Gbps) network to the farm • 8 Gbps FC based SAN • 2x increase of storage density • Establish private 10 Gbps links between the local experiments and BINP/GCF farm thus allowing them to use NUSC resources BINP/GCF Status Report

  10. 680 CPU cores/540 TB Configuration 2012 (prospected) 95 kVA UPS subsystem 1.4 M$ in total 16 CPU cores / 1U, 4 GB RAM / CPU core, 8 Gbps FC SAN fabric, 20 Gbps (DDR IB) / 10 Gbps (Ethernet) / 4x 1 Gbps (Ethernet) interconnect BINP/GCF Status Report

  11. 168 CPU cores/300 TB Configuration 2010 (proposed) 55 kVA UPS subsystem +14 MRub 5x CPU power, 10x storage capacity, adding DDR IB & 8 Gbps FC already BINP/GCF Status Report

  12. PDU & Cooling Requirements • PDU • 15 kVA are available now (close to the limits, no way to plug the proposed 20 kVA UPS devices!) • 170-200 kVA (0.4kV) & APC EPO subsystems are needed (draft of the tech. specs was prepared in 2009Q2) • Engineering drawings for BINP/ITF hall have been recovered by CSD • The list of requirements is to be finalized yet • Cooling • 30-35 kW are available now (7 kW modules, open tech. water circuit) • 120-150 kW of extra cooling is required (assuming N+1 redundancy schema) • Various cooling schemas were studied though locally installed water cooled air conditioners seem to bethe best solution (18kW modules, closed water loop) • No final design yet Once the plans for hardware purchasing are settled for 2010 the upgrade must be initiated BINP/GCF Status Report

  13. Prospected 10 Gbps SC Network Layout 1000+ CPU cores (2010Q3-4) 1100+ CPU cores (since 2007) BINP/GCF Status Report

  14. Summary • Major success is achieved in BINP/GCF and NUSC (NSU) computing resources • The schema tested with KEDR VMs should be exploited by other experiments as well (e.g. CMD-3) • 10 Gbps channel (once restored) will allow the direct use of NUSC resources from the BINP site (e.g. ANSYS for needs of VEPP-2000) • LCG site may take advantage of using the NUSC resources as well (200 kSI2k will give us much better appearance) • The upgrade of the BINP/ITF infrastructure is required for installing the new hardware (at least for PDU subsystem) • If we are able to get extra networking hardware as proposed we may start plugging the experiments to the GRID farm and NUSC resources with 10 Gbps Ethernet uplinks this year BINP/GCF Status Report

  15. Questions & Comments

More Related