Distributed storage, work status

Distributed storage, work status Giacinto Donvito INFN-Bari On behalf of the Distributed Storage Working Group SuperB Collaboration Meeting

Outline • Testing storage solution: • Hadoop testing • Fault tolerance solutions • Patch for a distributed computing infrastructure • Napoli activities • Data access library • Why we need it? • How it should work? • Work status • Future Plans • People involved • Conclusions Only update are reported in this talk: there are others activities planned or ongoing that do not have update to be reported since last meeting SuperB Collaboration Meeting

Testing Storage Solutions • The goal is to test available storage solutions that: • Fulfill SuperB requirements • good performance also on pseudo random access • Accomplish Computing centers need • scalability in the order of PB and hundreds of client nodes • a small TCO • can benefit of a good support on a long term future • Open Source could be preferred over proprietary solutions • Wide community would be a guarantee for long term sustainability • Provide posix-like access APIs by means of a ROOT supported protocols (most interesting seems to be: XRootD and HTTP) • Also the copy functionality should be supported SuperB Collaboration Meeting

Testing Storage Solutions • Several software solution could fulfill all those requirements • few of those are already known and used within HEP community • we would test also solutions that are not already well known • focusing our attention on those that could provide fault-tolerance to hw or sw malfunction • At the moment we are focusing on two solutions: • HADOOP (at INFN-Bari) • GlusterFS (at INFN-Napoli) SuperB Collaboration Meeting

Hadoop testing • Use cases: • Scientific data analysis on cluster with CPU and storage on the same boxes • High Availability of data access without expensive hardware infrastructure • High Availability of data access service also in the event of a complete site failure • Test executed: • Metadata failures • Data failures • “Rack” awareness SuperB Collaboration Meeting

HDFS Architecture SuperB Collaboration Meeting

HDFS test • Metadata Failure • NameServer is a single point of failure • It is possible to have a Back-up snapshot • DataNode Failure • During read/write operation • Testing FUSE access to HDFS file-system • Testing HTTP access to HDFS file-system • Rack failure resilience • Farm failure resilience SuperB Collaboration Meeting

HDFS test NameServer Failures • Metadata Failure • It is possible to make a back-up of the metadata • the time between two dump is configurable • It is easy and fast to recover from PrimaryNamenode major failure HTTP REQUEST SuperB Collaboration Meeting

HDFS test NameServer Failures • Metadata Failure • While the NameNode process is active no problem is observed at the client side • If the NameNode process is unavailable all clients retry a configurable number of time before failing • The datanodes will reconnect to the NameNode as soon as it will be available • It is easy to import metadata from BackupNameNode and only the metadata changed after last checkpoint will be lost • It looks fairly easy to put in production an automatic fail-over cluster to provide HA solution SuperB Collaboration Meeting

HDFS test DataNode Failures • DataNode Failure while reading: • In case of major disk/node failure • In case of data corruption on a few data chunk • Client do not see any failure • And the system recovers and fixes the problem without human intervention • DataNode Failure while writing: • The client retries a configurable amount of times before failing • each time the destination node is different • it should be enough to have few DataNodes active in the cluster to avoid client failures SuperB Collaboration Meeting

HDFS Rack Awareness host3 host1 host2 host6 host5 host4 host8 host9 host7 SuperB Collaboration Meeting

HDFS test Rack Awareness • While writing: • Data placement could (or could not) take into account the load of each machine (depending on the configuration) • the test proves both of them works fine • While reading: • Data are read within the rack always if the load is not too high • No configuration is possible here SuperB Collaboration Meeting

HDFS test Miscellaneous • Fuse testing: • It works fine for reading • We exploit an OSG version in order to avoid some bugs while using fuse to write files • HTTP testing: • HDFS provide a native WebDav interface • Works quite well for reading • Both reading and writing works well with standard Apache module and fuse HDFS mounted via FUSE SuperB Collaboration Meeting

HDFS test Custom data placement policy • In HDFS it is possible to change/configure the data placement policy • We already implemented a new policy where three copies of the data is place in three different racks • This provides the capability to survive also to two different racks down • This policy increases the “cost“ of writing files • In HEP environment the data are “write-once-read-many” • but reduces the cost of reading files • there are more CPU that are “close to the data” SuperB Collaboration Meeting

HDFS test Custom data placement policy • HDFS give the possibility to represent hierarchical topology of racks • But this is not taken into account while writing/reading files • each rack is considered only “a different rack” • so both when reading and writing file the rack are considered as belonging to a flat topology • If a node belonging to a given rack is moved in a different rack the system detect a violation of the policy • It is not able to deal with this problem automatically • The administrator need to format and re-insert the node SuperB Collaboration Meeting

HDFS test Custom data placement policy • Goal: • achieve a data placement policy that will ensure the data availability also in the case of a complete farm go offline • Replica 3  • 2 replicas on the same farm (different racks) • 1 replica on another farm • We have developed a new policy that is able to understand rack topology of an arbitrary number of level • This is under test at the moment… • and it is working!  • we will ask Napoli to join the WAN test soon SuperB Collaboration Meeting

HDFS test Next steps • Increase the amount of resources involved into the test: • We will soon build a big cluster at INFN-Bari, using production WN + PON dedicated machines • The final test will comprise ~300 local nodes • 5000 CPU/Cores • >600TB of storage • This cluster will be used to test: • performance • horizontal scalability • and overall TCO over a long period of usage (6 months) • We will soon start WAN data replication/access test using at least INFN-Napoli cluster • Testing new version of HDFS: • HDFS HA for NameNode (manual failover) • HDFS Federation • Performance SuperB Collaboration Meeting

Napoli Activities We completed the preliminary tests on 10Gbit/s network and local I/O on a Clustered File System between 12 nodes through the use of GlusterFS. On-going works: Implementation of a Grid Farm with a GlusterFS shared Storage Area and SRM Interface through DPM or SToRM. Design a scenarios to test both performances and reliability of GlusterFS over WAN. B A Storm/DPM FrontEnd CE WN1 GlusterFS Internet GlusterFS WN2 GlusterFS WN3 FS on Grid Through SRM interface and http access WN .. DATA REPLICATION OVER WAN WN12 SuperB Collaboration Meeting

Data Access LibraryWhy we need it? • It could be painful to access data within analysis application if no posix access is provided by computing centers • there are number of library providing posix-like access in HEP community • We need an abstraction layer to be able to use transparently Logical SuperB File Name, without knowledge about the mapping with the Physical File Name at a given site • We need a single place/layer where we could implement reading optimization that would be transparently used by end users • We need an easy way for reading files using protocols that are not already supported by ROOT SuperB Collaboration Meeting

Data Access LibraryHow we can use it? • This work aim to provide a simple set of APIs: • Superb_open, Superb_seek, Superb_read, Superb_close, … • it will be needed to add a dedicated header file and the related library • building the code with this library it will be possible to open seamless local or remote files • exploiting several protocols: http(s), xrootd, etc • Different level of caching will be automatically provided from the library in order to improve the performance of the analysis code • Each site will be able to configure the library as it is needed to fulfill computing centers requirements SuperB Collaboration Meeting

Data Access LibraryWork status • At the moment we have a test implementation with http-based access • Exploiting curl libraries • It is possible to write an application linking this library and to open and read a list of files • With a list of read operations • It is possible to configure the size of the read buffer • To optimize the performance in reading files over remote network • The results of the first test will be available in few weeks from now… • We already presented WAN data access test at CHEP using curl: • https://indico.cern.ch/getFile.py/access?contribId=294&sessionId=4&resId=0&materialId=slides&confId=149557 SuperB Collaboration Meeting

Data Access LibraryFuture works • We are already in contact with Philippe Canal for implement new features of ROOT I/O • We will follow the development of the Distributed Computing Working Group to provide an seamless mapping of the Logical File Name to Local Physical File Name • We will soon implement few example of a configuration for local storage system • We need to now which ROOT version is more interesting for SuperB analysis users. • We will start implementing memory based pre-fetch and caching mechanism within Superb Library SuperB Collaboration Meeting

People involved • Giacinto Donvito– INFN-Bari: • Data Model • HADOOP testing • http & xrootd remote access • Distributed Tier1 testing • Silvio Pardi, Domenico del Prete, Guido Russo – INFN Napoli: • Cluster set-up • Distributed Tier1 testing • GlusterFS testing • SRM testing • Gianni Marzulli– INFN-Bari: • Cluster set-up • HADOOP testing • Armando Fella – INFN-Pisa: • http remote access • NFSv4.1 testing • Data Model • Elisa Manoni – INFN-Perugia: • Developing application code for testing remote data access • Paolo Franchini– INFN-CNAF: • http remote access • Claudio Grandi– INFN-Bologna: • Data Model • Stefano Bagnasco– INFN-Torino: • GlusterFS testing • DomenicoDiacono– INFN-Bari • Developing data access library

Conclusions • Collaborations and synergies with other (mainly LHC) experiments are increasing and improving • First results from testing storage solution are quite promising • We will need soon to increase the effort and the amount of (hw and human) resources involved • Both locally and geographically distributed • This will be of great importance in realizing a distributed Tier0 center among Italian computing centers • We need to increase the development effort in order to provide a code-base for an efficient data access • This will be one of the key elements in providing good performance to analysis applications SuperB Collaboration Meeting

Distributed storage, work status

Distributed storage, work status

Presentation Transcript

Distributed Storage

Distributed Storage Allocation Problems

Distributed Storage System Survey

Distributed Storage And WAN Transport

Network Coding Distributed Storage

Unigroup : OpenAFS Distributed Storage

From distributed to centralized storage

Distributed Storage

Distributed Storage Networks

Policies in Distributed Data Storage

(Distributed) (Structured) Storage Systems

Network Coding and Distributed Storage

Distributed Storage and Consistency

Distributed Storage, Rennes, September 2011

L-Store Distributed storage system

Work Status:

Storage Element Status

Cluster distributed dynamic storage

Work status

DiDaS Distributed Data Storage