T2 storage issues

T2 storage issues M. Biasotto – INFN Legnaro

T2 issues • Storage management is the main issue for a T2 site • CPU and network management are easier and we have reached some stability • years of experience • stable tools (batch systems, installation, etc.) • total number of machines for average T2 is small: O(100) • Several different issues in storage • hardware: which kind of architecture and technology? • hw configuration and optimization • storage  cpu network • storage resource managers

Hardware • Which kind of hardware for T2 storage? • DAS (Direct Attached Storage) Servers: cheap and good performances • SAN based on SATA/FC disk-arrays and controllers: flexibility and reliability • Others? (iSCSI, ATA over Ethernet), .... • There are already working groups dedicated to this (technology tracking, tests, etc.), but information is a bit dispersed • SAN with SATA/FC disks preferred by most sites • but economic concerns: will fundings be enough for this kind of solution? • Important, but not really critical? • once you have bought some kind of hardware, you are stuck with it for years, but it’s usually possible to mix different types

Storage configuration • Optimal storage configuration is not easy, a lot of factors to take into consideration • how many TB per server? which RAID configuration? • fine tuning of parameters: in disk-arrays, controllers and servers (cache, block sizes, buffer sizes, kernel params, ... a long list) • Storage pool architecture: is one large pool enough, or is it necessary to split? • buffer pools (WAN transfer buffer, local WN buffer)? • different pools for different activities (production pool, analysis pool)? • Network configuration: avoid bottlenecks between servers and CPU • Optimal configuration depends strongly on the application • 2 main (very different) types of access: remote I/O from WN or local copy to/from WN. Currently CMS uses remote and Atlas local I/O • production and analysis activities have different access pattern

Storage configuration • Optimal configuration varies depending on many factors: there is not a single simple solution • every site will have to fine-tune its own storage • and it will vary over time • But having some guidelines would be useful • exploit current experience (mostly at T1) • Can have huge effects on performances, but it’s not so critical if you have enough flexibility • many parameters can be easily changed and adjusted • a SAN hardware architecture is much more flexible than DAS (rearrange the storage, increase or reduce number of servers, ...)

Storage Resource Manager • Which Storage Resource Manager for a T2? • DPM and dCache already in use at most of LCG sites • Storm is the INFN developed tool • Xrootd protocol required by Alice: currently being implemented in dCache and DPM • The choice of a SRM is a more critical issue: it’s much more difficult to change • adopting one and learning how to use it is a large investment: know-how in deployment, configuration, optimization, problem finding and solving, ... • obvious practical problems if a site has a lot of data already stored: data migration, catalogues update (often outside control of site) • First half of 2007 last chance for a final decision? • of course nothing is ever ‘final’, but after that a transition would be much more problematic

Current status of INFN sites (*) Pisa recently switched from DPM to dCache (**) Roma has 2 DPM installations (CMS and Atlas)

Requirements • Performance & scalability • how much is needed for a T2? • does the tool architecture address the scalability issue? • Reliability & stability • Advanced features • data management (replication, recovery), monitoring, configuration tuning, ... • Cost (in terms of human and hardware resources)

dCache • dCache is currently the most mature product • used in production since a few years • deployed at several large sites: T1 FNAL, T1 FZK, T1 IN2P3, all US-CMS T2s, T2 Desy, ... • There is no doubt it will satisfy the performance and scalability needs of a T2 • Two key features to guarantee performance and scalability • Services can be split among different nodes • all ‘access doors’ (gridftp, srm, dcap) can be replicated • also ‘central services’ (which usually run all on the admin node) can be distributed

dCache • “Access queues” to manage high number of concurrent accesses • storage access requests are queued and can be distributed, prioritized, limited based on protocol type or access type (read/write) • buffer for temporary high load, avoid server overloading • Provides a lot of advanced features • data replication (for 'hot' datasets) • pool match-making dynamic and highly configurable • pool draining for scheduled maintenance operations • grouping and partitioning of pools • internal monitoring and statistics tool

dCache issues • Services are heavy and not much efficient • written in java, require a lot of RAM and CPU • central services can be split: do they need to be split? Even in a T2 site? Having to manage several dCache admin nodes could be a problem • Still missing VOMS support and SRM v2, but both should be available soon • More expensive in terms of human resources • more difficult to configure and maintain • steeper learning curve, documentation needs to be improved • It’s more complex, with more advanced features, and this obviously comes at a cost • does a T2 need the added complexity and features, can they be afforded?

INFN dCache experience • Performance test at CNAF + Bari (end 2005) • demonstrated performance needed for T2 2008 • 1 admin node and 4 pool nodes slide by G.Donvito (Bari)

INFN dCache experience • Used in production at Bari since May 2005, building up a lot of experience and know-how • heavily used in SC4 CMS LoadTest, good stability and perfomance Bari WAN yearly graph SC4 activity • Pisa experience: SC4 started with DPM, during the summer switched to dCache • DPM problems in CMS, data loss after hw failure, overall impression that a more mature solution was really needed • participated in CSA06 challenge with dCache, they are pleased with the change Pisa WAN yearly graph SC4 activity

Storm • Developed in collaboration between INFN-CNAF and ICTP-EGRID (Trieste) • Designed for disk-based storage: implements a SRM v2 interface on top of an underlying parallel or cluster file-system (GPFS, Lustre, etc.) • Storm takes advantage of the aggregation functionalities of the underlying file-system to provide performance, scalability, load balancing, fault tolerance, ... • Not bound to a specific file-system • separation between SRM interface and data management functionalities is an interesting feature of Storm: in principle it allows to exploit the very high research and development activity in the clustering file-system field • support of SRM v2 functionalities (space reservation, lifetime, file pinning, pre-allocation, ...) and ACL • Full VOMS support

Storm • Central services scalability • Storm service has 3 components: Front-End (web service for SRM interface), Back-End (core) and DataBase • FE and BE can be replicated and distributed on multiple nodes • centralized database: currently MySql, possible others (Oracle) in future releases • Advanced fetaures provided by the underlying file-system • GPFS: data replication, pool vacation

Storm issues • Not used by any LCG site so far (not compatible with SRM v1), and only few ‘test installations’ at external sites • It’s likely that a first “field test” would result in a lot of small issues and problems (shouldn’t be a concern in the longer term) • Installation and configuration procedures not yet reliable enough • recent integration with yaim and more external deployments should quickly bring improvements in this area • No access queue for concurrent access management • No internal monitoring (or only partial, provided by fs) • There could be compatibility issues between the underlying cluster file-system and some VO applications • some file-systems have specific requirements on kernel version

INFN Storm experience • Obvioulsy CNAF has all the necessary know-how on Storm • Also GPFS experience within INFN, mostly at CNAF but not only (Catania, Trieste, Genova, ...) • overall good in term of performance, scalability and reliability • Storm installations • CNAF (interoperability test within SRM-WG group) • CNAF (pre-production) • Padova (EGRID project) • Legnaro and Trieste (GridCC project) • Performance test at CNAF at beginning of 2006 • Storm + GPFS v2.3 testbed (see next slide)

Performance test at CNAF • Framework: The disk storage was composed by roughly 40 TB, provided by 20 logical partitions of one dedicated StorageTEK FLexLine FLX680 disk array storage, aggregated by GPFS. • Write test: srmPrepareToPut() with implicit reserveSpace of 1GB files. globus-url-copy from local source to the returned TURL. 80 simultaneous client processes. • Read test: srmPrepareToGet() followed by globus-url-copy from the returned TURL to a local file (1 GB files). 80 simultaneous client processes. • Results: • Sustained read and write throughputs measured : 4 Gb/s and 3 Gb/s respectively. • The two tests are meant to validate the functionality and robustness of the srmPrepareToPut() and srmPrepareToGet() functions provided by StoRM, as well as to measure the read and write throughputs of the underlying GPFS file system. (slide by Storm team)

DPM • DPM is the SRM system suggested by LCG, distributed with LCG middleware • good Yaim integration: easy to install and configure • possible migration from old classic SE • it’s the natural choice for a LCG site that quickly needs to setup a SRM • As a result of all this, there are a lot of DPM installations around • VOMS support (including ACL) • SRM v2 implementation (but still limited functionalities)

DPM issues • Still lacking many functionalities (some of them important) • load balancing very simple (round robin among file-systems in pool) and not configurable • data replication still buggy in current release, no pool draining for server maintenance or dismission • no pool selection based on path • no internal monitoring • Scalability limits? • no problem for rfio and gridftp services: easily distributed on pool servers • but ‘central services’ on head node? In principle ‘dpm’ ‘dpns’ and mysql services can be split: not tested yet (will it be necessary? will it be enough?) • no ‘access queue’ like in dCache to manage concurrent access (but DPM is faster in serving the requests) and avoid server overloading

DPM issues • DPM-Castor-rfio compatibility • DPM used the same rfio protocol as Castor, but with gsi security layer added: 2 different shared libraries for the same protocol => big mess • and 2 different versions of rfio commands (rfcp, rfmkdir, ...): old Castor ones in /usr/bin/ and new DPM ones in /opt/lcg/bin/ • problem in CMS, where applications do remote I/O operations • CMS sw distributed with Castor rfio library (used at Cern): DPM sites need to manually hack sw installation to make it work • Has been (and still is!) a serious issue for CMS T2s • problem discovered during SC3 in Legnaro (the only DPM site): nobody cared until more widely spread in SC4, and still not solved after more than 1 year • along with the fact that dCache is the solution adopted by main CMS sites, many CMS T2s have avoided DPM, and the others are considering the transition • just an example of a more general issue: stay in the same boat as the others, otherwise when a problem arises you are out of the game

INFN DPM experience • Used in production at many INFN sites: Catania, Frascati, Legnaro, Milano, Napoli, Roma • no major issues (some problems migrating data from old SE) • main complain: lack of some advanced features • good overall stability, but still very small numbers (in size and throughput rate) • Catania and Legnaro currently largest installations • Catania has an interesting DPM + GPFS configuration: one large GPFS pool mounted on DPM • Legnaro has adopted a more complex configuration: 8 DPM pools, 2 pools for CMS and 1 for each other VO • allow a sort of VO quota management, keep data of different VOs on different servers (reduce activity interference, different hw performance) • with proper data management functionalities, it could be done (better?) in a single pool

SC4 activity Legnaro WAN yearly graph INFN DPM experience • stability and reliability: CMS LoadTest (not a throughput test) • ~3 months continuous transfer, ~200TB transferred in import (10-50 MB/s), ~60TB transferred in export (5-20 MB/s) • So far no evidence of problems or limitations, but reached performance values are still low (even in CSA06 system not stressed enough) Local access example: CMS MC production ‘merge’ jobs (high I/O activity) ~100 concurrent rfio on a single partition: ~120 MB/s (read+write)

SRM summary • dCache • mature product, meets all performance and scalability requirements • more expensive in terms of hardware and human resources • DPM • important features still missing, but this is not a concern in the longer term (no reason why they shouldn’t be added) • required performance and scalability not proven yet: are there any intrinsic limits? • Storm • potentially interesting, but not used by any LCG site yet • required performance and scalability not proven yet: are there any intrinsic limits?

Conclusions • The storage system is the most challeging part for a T2 site, with several issues • The choice of a Storage Resource Manager is one of the most critical • current tools at different level of maturity, but all in active development • A common solution for all INFN sites? There would be obvious benefits, but: • not all sites will have the same size: multi-VO T2, single-VO T2, T3 sites • different VOs have different requirements, storage access models, compatibility issues • A common solution within each VO? • We are running out of time • and of course all T2 sites have already started making choices

Acknowledgments • Thanks to all the people who have provided info and comments: • T. Boccali (Pisa) • S. Bagnasco (Torino) • G. Donvito (Bari) • A. Doria (Napoli) • S. Fantinel (Legnaro) • L. Magnoni (CNAF, Storm) • G. Platania (Catania) • M. Serra (Roma) • L. Vaccarossa (Milano) • E. Vilucchi (Frascati)

T2 storage issues