Building an Elastic Batch System with Private and Public Clouds

Building an Elastic Batch System with Private and Public Clouds Wataru Takase, TomoakiNakamura, Koichi Murakami, Takashi Sasaki Computing Research Center, KEK, Japan International Symposium on Grids & Clouds 2019

Projects in KEK Electron Accelerator Proton Accelerator Tokai Tsukuba • T2K (Neutrino experiment) • Hadron experiment • MLF (Material and Life science) • Belle II (e-, e+ collision) • Photon Factory Credit KEK

KEK Batch System • Used by 14 Projects, 1200 users • 10000 CPU cores • Scientific Linux 6 • IBM Spectrum LSF Batch service calc. server calc. server Job queues job job job Interactive work and job submission calc. server calc. server job job job job job job calc. server calc. server … work server Remote login calc. server calc. server LSF work server Batch job scheduler … … calc. server calc. server work server

Challenges for the Batch System:Piled up Waiting Jobs • Available Job Slots: 10000 • Limited by Number of CPU cores • At the time of congestion, user jobs make a long stay in a job queue 2018/9/1 – 2018/9/30 2018/9/30 2018/9/1

Challenges for the Batch System:Request on Custom Environments • Requirements on specific systems from experiments groups • Develop an application on the other OS. • Test for newer OS/Libraries. • Stick to old OS • Take advantage of Cloud computing • Expand computing resource to clouds • Resolve piled up jobs problem • Provide heterogeneous clusters • Resolve various requests on custom environments

Overview of Cloud-integrated Batch Job System • Use cloud resources via batch job submission command. $ bsub –q aws /bin/hostname On-premise resource SL6 cluster LSF OpenStack Resource Connector[1] AWS The other cloud Queue based resource selection Off-premise resource [1] https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_welcome/lsf_kc_resource_connector.html

Integration with OpenStack Batch service Physical machines (SL6) calc. server calc. server Dispatch normal job OpenStack 1. Create image Project manager Base image Custom image 4. Launch instance 3. Submit job Resource Connector LSF calc. server (VM) End user 5. Dispatch { "Name": "CentOS7_01", "Attributes": { "type": ["String", "X86_64"], "openstackhost": ["Numeric", "1"], "template": ["CentOS7_01"] }, "Image": "generic-cent7-01", "Flavor": "c04-m016G" } 2. Create Resource connector template Cloud admin

Integration with Existing System: LDAP • LDAP authentication is used for the cluster. • Use the LDAP as OpenStack authentication backend. • Use the LDAP for Linux accounts inside of VMs. Keystone domains for multiple backends Nova default DB Service accounts Neutron Glance LDAP ldap User

Share GPFS between Local Batch and OpenStack • Each compute node mounts GPFS and exposes the directories to VM via NFS. Batch service OpenStack calc. server calc. server calc. server (VM) calc. server (VM) calc. server (VM) … NFS mount GPFS Compute node NFS GPFS mount

AWS Launched on demand EC2 Integration with AWS Filesystem is not shared with KEK batch system NFS KEK LSF calc. server … AWS queue LSF calc. server LSF work server VPN connection OpenStack queue OpenStack S3 Object storage LSF calc. server … The other queues LSF calc. server For sharing input/output data between KEK and AWS Physical machines (SL6)

Use AWS S3 Object Storage for Sharing Data between KEK and AWS • KEK batch system and OpenStack share GPFS filesystem in KEK. • AWS environment is independent from the KEK system. • S3FS[3]orgoofys[4]allows to Linux to mount an AWS S3 bucket via FUSE. AWS KEK 2. Copy input data NFS calc. server S3 bucket 3. Submit job INPUT LSF work server calc. server OUTPUT INPUT OUTPUT … 4. Copy output data 1. Put input data 5. Get output data [3] https://github.com/s3fs-fuse/s3fs-fuse [4] https://github.com/kahing/goofys

Upload/Download Speed Comparison between S3FS and Goofys • Measured cpcommandexecution time • 1MB x 1000 files, 10 MB x 100, 100 MB x 10, 1000 MB x 1 $ cp –r /local/1mb_files_dir/ /s3fs/ $ cp –r /local/1mb_files_dir/ /goofys/ • Goofys upload performance is better than S3FS. • S3FS has more POSIX compatibility than Goofys.

Monitoring resource Transition on AWS Transition of total number of cores Submit jobs Number of instances on AWS Number of total cores on AWS

Scalability Test: Run Geant4 based Particle Therapy Monte Carlo Simulation Jobs on AWS Particle beam direction Treatment head with patient data obtained from CT images Simulated dose distribution Monte Carlo simulation shoots 2,000,000Protons in total on N CPU cores If N=10, 10 CPU cores carried out simulation events 200,000 times each

Scalability Test: Run Geant4 based Particle Therapy Monte Carlo Simulation Jobs on AWS • Scalability comparison between on KEK and AWS NFS leads to degrading the performance AWS KEK The AWS result has the same tendency as the KEK’s one.

Scalability Test: Image Classification by Deep Learning on AWS • Classify CIFAR-10 image[5] into 10 categories. • We have built Convolutional Neural Network, then trained for the classification using TensorFlow[6]. Convolution Neural Network conv1 layer pool1 layer conv2 layer pool2 layer FC1 layer FC2 layer auto- mobile Feedback [6] https://www.tensorflow.org/tutorials/deep_cnn [5] https://www.cs.toronto.edu/~kriz/cifar.html

Scalability Test: Image ClassificationMulti-node Deep Learning on AWS • Submit TensorFlow jobs to AWS queue and measured scalability by changing number of workers. 23,000 sec (6.5 hours) 1 worker (64 cores) 57 workers (3648 cores) TensorFlow Cluster Traffic congestion? 1,000 sec Parameter server Store and update parameters 30 workers (1920 cores) Worker Worker Worker Calculate loss

Another Use case:Automatic Offloading to Cloud Submit 3000 jobs to the mixed-resources (KEK and AWS) queue Time 4. Some jobs dispatched to AWS servers PEND RUN PEND RUN 3. Some jobs dispatched to KEK servers Each job status Find free resource on KEK 2. Launch AWS instances, and some jobs dispatched to the AWS instances RUN PEND No more free resource on KEK 1. Some jobs dispatched to KEK servers RUN

Summary • We have succeeded to integrate OpenStack and AWS clouds with LSF batch job system by using Resource Connector. • Expands computing resources to clouds for reducing turnaround times of jobs at the time of congestion. • Provides any kind of job processing environments by choosing a different instance image. • The Monte Carlo simulation worked well on AWS with a bit of performance degradation due to NFS. • The Deep Learning training speed performance on AWS scaled well up to about 2000 CPU cores. • We have succeeded to offload some batch workloads to the AWS cloud automatically. • Cloud resources used in this work was provided in the Demonstration Experiment of Cloud Use conducted by National Institute of Informatics (NII) Japan (FY2017).

Building an Elastic Batch System with Private and Public Clouds