COTS Parallel Archive System: Integration and Performance Studies

The University of California operates Los Alamos National Laboratory for the National Nuclear Security & Administration of the United States Department of Energy. LANL Document Number LA-UR-10-06115 Integration Experiences and Performance Studies of A COTS Parallel Archive System -A New Parallel Archive Storage System Concept and Implementation Hsing-bung (HB) Chen, Gary Grider, Cody Scott, Milton Turley Aaron Torres, Kathy Sanchez, John Bremer Los Alamos National Laboratory Los Alamos, New Mexico 87545, USA September 22nd, 2010 IEEE International Conference on Cluster Computing 2010 Heraklion,Crete, Greece

Abstract Present and future Archive Storage Systems have been challenged to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have also been demanded to perform the same manner but at one or more orders of magnitude faster in performance. Archive systems continue to improve substantially comparable to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently, the number of extreme highly scalable parallel archive solutions is very limited especially for moving a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and an innovative software technology can bring new capabilities into a production environment for the HPC community. This solution is much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. We relay our experience of integrating a global parallel file system and a standard backup/archive product with an innovative parallel software code to construct a scalable and parallel archive storage system. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world’s first petaflop/s computing system, LANL’s Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems. Now this new Parallel Archive System is used on the LANL’s Turquoise Network

Agenda • Background • Issues, Motivation, and Leverage of using COTS Parallel Archive System • Proposed COTS Parallel Archive System • PERFORMANCE STUDIES ON LANL’S ROADRUNNER OPEN SCIENCE PROJECTS • Experience and observed issues of our COTS Parallel Archive System • Summary and Future Works

The DOE Advanced Strategic Computing Initiative Program published this Kiviat diagram that shows parallel file systems scaling performance at an order of magnitude faster than parallel archives

Background • Parallel File Systems & Parallel I/O • HSM Hierarchical Storage Management (HSM) • ILM – Information Life cycle Management • Non-Parallel vs. Parallel Archive Systems • Parallel Archives That Do Not Leverage Parallel File Systems as Their First Tier of Storage • Parallel Archives That Do Leverage Parallel File Systems as Their First Tier of Storage

Archives That Do Not Leverage Parallel File Systems Cluster A Cluster B Scalable Storage Area Network PFS FTA & Similar + Global Parallel File - Scratch File System Global Archive Storage System- Disks + Tapes FTA & Similar – non-parallel data movement

Parallel Archives That Leverage Parallel File Systems Scalable Storage Area Network Cluster A Cluster B PFS Archive Path : Read PFS write NFS Global Parallel File System – scratch file system NFS File Transfer Agent + Migration Path Global Parallel File System + Parallel Tape Archive System- Disks + Tapes  HSM

Motivation - 1 • More leverage of parallel file systems to provide parallel archive is possible and makes sense • Can we leverage parallel file system and non parallel archive COTS solutions that are highly leveragable to build a highly leveraged parallel archive with very creative and unique code needed to provide the parallel archive service? • If this can be realized, a huge cost savings in providing this kind of parallel data movement service could possibly be realized

Motivation - 2 • Disk is becoming more competitive with tape over time for a larger portion of archival data , • Moderate and growing volume Global Parallel File Systems market, • Scalable bandwidth and metadata • Growing use of Global Parallel File Systems for moderate scale HPC • HSM and ILM features in file systems and archives, • High volume (non single parallel file) archives for backup/archive/content mgmt, and • Leverage all free file movement/management tools in Linux, copy, move, ls, tar, etc. • a well known file management environment • get scp, sftp, and web/gui file management for free etc.

Challenging for Parallel Archive System (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware.

Proposed COTS Parallel Archive System • Build a parallel tree walker and copy user space utility, • Add storage pool (stgpool) support (using file system API), • Create an efficient ordered file retrieval utility (using dmapi API and back end tape system query), • Add support for ILM stgpool features, • Add support for ILM stgpool and co-location features in the archive back-end, and • Use FUSE to break up enormous files into pieces that can be migrated and recalled in parallel to/from the back end tape system

Proposed Parallel Archive System - PFTOOL Parallel & Scalable I/O Networking System Cluster A PFS PFS Cluster B Scalable FTA Cluster Parallel Data Movers PFTOOL PFS - Parallel File System I/O PFS Scratch Global Parallel File System • Scalable FTA (File transfer agent) Cluster: • Mounts site Global File System and other site shared file system • Runs commercial ILM enabled Parallel File System • Runs one or multiple copies of commercial backup archive • Runs HSM • Submits job to FTA cluster for data optimized data movement to/from archive Storage Area Network Global Parallel File System/ILM Parallel Tape Archive System

Manager – The conductor • Coordinates parallel tree walk • Balancing File Tree walk vs. Parallel Data Moving • Manage various queues operations • Arranges copy jobs to workers • Issues ouput/display request • Generates final statistics report DirQ NameQ TapeQ Message Queues CopyQ TapeCQ MPI Message Passing Workers – file stat, file copy, tape file restore WatchDog OutPutProc ReadDirProcs TapeProcs PFTOOL’s Software Architecture

PFTOOL’s MPI processes • Manager process: Conductor • OutPutProc process: Display process • WatchDog process: System status monitor • ReadDir process: Explore directory and sub-directory • TapeProc process: Tape data mover • Worker process: Parallel data mover to and from File systems

Parallel File System Tree Walker

Continue - Slide 16

PFTOOL’s run time environment RunTime Tunning parameters – NumProcs, NumTapeProcs, ChunkSize, StoragePool info, Fuse ChunkSize, CopySize ArchiveFUSE file system – Convert a vary large file “N-to-1” copy into a N-toN copies for scaling and performance improvement File Transfer Agent Cluster – GPFS Client/Fuse Client • PFTOOL – RunTime Environemnt • 1 Manager MPI process • 1 OutPutProc MPI process • One or more ReadDirProc MPI process(s) • One or more Worker MPI process(s) • Zero or more TapeProc MPI process(s) • One WatchDog MPI process • NumProc(MPI machine list) = Sum(All MPI processes) • Note: Number TapeProc is set to 0 , when in archive process, giving more worker for copying data PFTOOL utilities – pfls, pfcp, pfcm LoadManager – generate runtime MPI machine list periodically FTA Cluster RunTime Status : On/Off, Upgrate, Testing GPFS/HSM/ILM/MySQL Query Service – Run timeData migration and restoring status

PFTOOL’s runtime activities • LoadManager – Selecting available processes running on machines based on machines’ current CPU workload status • Tape optimization – reduce tape-trashing overhead (mounting and unmounting tape drives), line-up data for tape optimized sequential archiving • A single large file parallel copy – Parallel I/O data movement on a single large file • Very large file parallel copies – FUSE enhanced implementation (conversion of n-to-1 to a n-to-n copy) • Runtime tunable parameters for adjusting PFTOOL commands runtime performance – size of data chunk for copying, number of MPI processes, size of FUSE file selection, number of Tape Drives used,

PFTOOL Software System • Pftool – 7000+ lines C/MPI code • GPFS dsm api code + MySQL database • Pftool commands – PERL scripts, Python scripts • Pftool loadmanger – PERL scripts • Trashcan – open source Python scripts + modification • Reusing/Modifying GNU ‘s Coreutils software code – rm, copy,……

Less Aggressive MPI Polling implementation in PFTOOL while(1) { // main receiving loop MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-1: A typical AP based MPI main receiving loop int msgready = 0; while(1) { // main receiving loop // polling control enhancement while (msgready == 0) { // message is not ready yet MPI_Iprobe(fromProc,tag, comm, &msgready, &mpistatus) usleep(n micro-seconds); } MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-2: An enhancement LAP based polling control with MPI_Iprobe checking

Commands supported in PFTOOL • pfls – using parallel file tree walker and list files in parallel • pfcp – using parallel file tree walker and copy files in parallel, and • pfcm – using parallel file tree walker and compare source and destination files in terms of byte content comparison. Users use it to verify data integrity of files after data copy.

Top Level view of PFTOOL’s System

RoadRunner Cluster One PetaFlop/s Five NSD node with slow disk pool - 200TB Six DS4800 Fast Disk pool - 200TB Multiple 10GiGE Switches 10 GPFS nodes (parallel data mover) run PFTOOL Mounting /panfs & /gpfs FC switch (FC-4) Two 10Gige links Scratch File System 4PetaBytes capacity /panfs One TSM Server LTO4 x 24 Tape atchive Over 4 PetaBytes One 10 GiGE Switch Parallel Archive Setup for RoadRunner ‘s Open Science Project

Number of files per archive copy job

Number of Mega Bytes copy per job

Data bandwidth (MB/sec) copy per job

Average File size copy per job

MPI Polling comparison studies – CPU occupancy

MPI Polling comparison studies – data rate

Experience and observed issues of our COTS Parallel Archive System • Small File Tape Performance • Aggregation of small files, which consists of bundling these small files into larger aggregates better suited to getting the tape drive up to full speed, and then writing the aggregate to tape • Tape Optimization/Smart Recall • ensure that all files in a tape-recall request are handled by the same machine (Tape Trashing problem) • Limitations of the Synchronous Deleter • built-in synchronous delete function between GPFS and TSM • Single TSM Server • Considering Fail-over using multiple TSM servers

Summary & Future works • Doing parallel movement to/from tape for a single large parallel file, • Hierarchical storage management, • ILM features , • High volume (non-single parallel file) archives for backup / archive / content management, and • Leveraging all free file movement & management tools in Linux such as copy, move, compare, ls, etc.

Contiune - • Currently we are trying to generalize the PFTOOL software and make it accommodate most of parallel file systems such as PVFSv2, GFS, Ceph, Lustre, pNFS etc. • We plan to incorporate additional parallel data movement commands to PFTOOL such as parallel version of chown, chmod, chgrp, find, touch, and grep.

Q & A Thanks

COTS Parallel Archive System: Integration and Performance Studies

COTS Parallel Archive System: Integration and Performance Studies

Presentation Transcript

“Evidence Against Evolution” 2001

Bobby vs. Girls (Accidentally) by Lisa Yee Challenging Vocabulary

The Secret of Zoom by Lynne Jonell Challenging Vocabulary

Extra Credit by Andrew Clements Challenging Vocabulary

Operation Weser ü bung (ves-sir-uh-bung)

The 100 Year Old Secret By Tracy Barrett Challenging Vocabulary

out of my mind by Sharon M. Draper Challenging Vocabulary

Dong Bung Village

Thalassemia and Hemoglobinopathies

Pregnancy Screening Pathway

Financial Statements IAS 1

Christina Ling-hsing Chang

Alvin Ho: Allergic to Girls, School and Other Scary Things by Lenore Look Challenging Vocabulary

Hb Hb