Tier-2 storage

Tier-2 storage A hardware view

HEP Storage • dCache • needs feed and care although setup is now easier. • DPM • easier to deploy • xrootd (as system) is also in the picture but no SRM • dCache and DPM use DB for metadata (single point of failure) • scalability is for T2 not much of an issue • although this depends on the access pattern • any Analysis experience? • FileSystems • Mostly XFS but has its flaws • Many look at ZFS. • Gridka uses GPFS • ext4 > 16 TB fs, has extents. (still development)

Disk arrangements • For CMS in 2008 all T2 • 19.3 MSI2k (~ 800kSI2k / avg T2) • 4.9 PB (~200 TB / avg T2) • RAID groups of 8 data disks → 750 GB/disk = 340 disks in 34 RAID6 groups (34 * 8 * 50 = 13600 IOs / s) • 800kSI2k / 2kSI2k/core → 400 cores • Available are 13600/400 = 34 IOs/core/s • Writes reduce this by 50% → 17 IOs/core/s • 50 MB/s / 17 IOs/core/s → 3 MB / IO / core • 1 – 3 MB/s / core → 1200 MB / s → ~ 24 data servers • Given the above number of 34 RAID groups use 34 data servers • assume 50 MB/s per server, although today dCache tops out at around 30 MB/s per java virtual machine

Disk thumb rules • from different cores: random access (although you have large > 2 GB files!) • avg access of SATA is ~ 15 ms: ~ 50 IOs / s • avg access of FC/SAS disks ~ 5 ms: 150 IOs / s • SATA RW mix (buffers!): 1 write + 20 read accesses. End of story. • SATA reliability is OK. Expect: 800 euro / TB (incl. system) • RAID6 is suggested and the need for proper support (hot swap, alerts, failover) • experience != experience: see summary on http://hepix.caspur.it/storage/ (hepix/hepix) • calculate some servers that need to be HA (>3000 euro)

Disk configurations • Storage in a box (NAS) • 16 to 48 disks with server nodes in a case • popular example: SUN Thumper 48 disk, dual opteron. • DAS: storage and server separate • required IO rates do not apply for big servers • but the random access applies for many servers • may use some compute nodes to do the work • which would need SAS or FC to the storage • Resilient dCache • probably good for “read-mostly” • From earlier core to disk estimate • need 20 big NAS boxes • could be done with 4 servers but not with 4 links

Use Casesthanks to Thomas Kress for the input • MC and Pile-Up • mostly CPU bound. Events are merged to large files before transfer to T1 via output buffer • 12 MB/s • write and read streams on the buffer? • suggestion: 1 write stream for 20 read streams • PileUp sample is 100-200 GB • random access by how many cores? • suggestion: spread over many raid groups • Calibration • storage area of 400 GB • read only? random or stream? • suggestion: max of 50 cores per disk (group) • Analysis • 100-200TB per month with random access!! • avg flow of ~80 MB/s T1 to T2 ! (3000-4000 files) • following above 1:20 this means a system that sustains 1600 MB/s read • ? “ein großer Teil der Daten eine längere Zeit vorhanden bleibt” (TK)

ToDo • Analyze access patterns • Simulate data/disk loss • Iterate results • Join forces for HW procurement

Tier-2 storage

Tier-2 storage

Presentation Transcript

Tier 2/Tier 3 Refresher

Tier 2 Vocabulary

Tier 2 Vocabulary

Tier 2

Tier 2 Vocabulary

Tier 2 Intervention

Tier 2 Support

Tier 2 Training

Tier 2 Training

ENUM Tier 2

Tier 2 Training

Tier 2 Training

Tier 2 Training

London Tier 2

Tier 2 Meetings

London Tier 2

PBIS Tier 2: Building Fidelity in Tier 2

Tier-2 Planning

Tier 2

Polish Tier-2

Tier 2 Resources

Tier 2 Training