High Performance Computing Workshop (Statistics) HPC 101

High PerformanceComputing Workshop(Statistics)HPC 101 Dr. Charles J Antonelli LSAIT ARSJanuary, 2013

Credits • Contributors: • Brock Palen (CoE-IT CAC) • Jeremy Hallum (MSIS) • Tony Markel (MSIS) • BennetFauber (CoE-IT CAC) • LSAIT ARS • UM CoE-IT CAC cja 2013

Roadmap • Flux Mechanics • High Performance Computing • Flux Architecture • Flux Batch Operations • Introduction to Scheduling cja 2013

Flux Mechanics cja 2013

Using Flux • Three basic requirements to use Flux: • A Flux account • A Flux allocation • An MToken (or a Software Token) cja 2013

Using Flux • A Flux account • Allows login to the Flux login nodes • Develop, compile, and test code • Available to members of U-M community, free • Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication cja 2013

Using Flux • A Flux allocation • Allows you to run jobs on the compute nodes • Current rates: • $18 per core-month for Standard Flux • $24.35 per core-month for BigMem Flux • $8 subsidy per core month for LSA and Engineering • Details at http://www.engin.umich.edu/caen/hpc/planning/costing.html • To inquire about Flux allocations please email flux-support@umich.edu cja 2013

Using Flux • An MToken (or a Software Token) • Required for access to the login nodes • Improves cluster security by requiring a second means of proving your identity • You can use either an MToken or an application for your mobile device (called a Software Token) for this • Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/loginnodes/twofactor.html cja 2013

Logging in to Flux • ssh flux-login.engin.umich.edu • MToken (or Software Token) required • You will be randomly connected a Flux login node • Currently flux-login1 or flux-login2 • Firewalls restrict access to flux-login.To connect successfully, either • Physically connect your ssh client platform to the U-M campus wired network, or • Use VPN software on your client platform, or • Use ssh to login to an ITS login node, and ssh to flux-login from there cja 2013

Modules • The module command allows you to specify what versions of software you want to use module list -- Show loaded modulesmodule loadname-- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all options • Enter these commands at any time during your session • A configuration file allows default module commands to be executed at login • Put module commands in file ~/privatemodules/default • Don’t put module commands in your .bashrc / .bash_profile cja 2013

Flux environment • The Flux login nodes have the standard GNU/Linux toolkit: • make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, … • Watch out for source code or data files written on non-Linux systems • Use these tools to analyze and convert source files to Linux format • file • dos2unix, mac2unix cja 2013

Lab 1 Task: Invoke R interactively on the login node • module load Rmodule list • R q() • Please run only very small computations on the Flux login nodes, e.g., for testing cja 2013

Lab 2 Task: Run R in batch mode • module load R • Copy sample code to your login directorycd cp~cja/stats-sample-code.tar.gz . tar -zxvf stats-sample-code.tar.gz cd ./stats-sample-code • Examine lab2.pbs and lab2.R • Edit lab2.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2013

Lab 2 Task: Run R in batch mode • Submit your job to Fluxqsub lab2.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless lab2.out cja 2013

Lab 3 Task: Use the multicore package in R The multicore package allows you to use multiple cores on a single node • module load Rcd ~/stats-sample-code • Examine lab3.pbsand lab3.R • Edit lab3.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2013

Lab 3 Task: Use the multicore package in R • Submit your job to Fluxqsub lab3.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless lab3.out cja 2013

Lab 4 Task: Another multicore example in R • module load Rcd ~/stats-sample-code • Examine lab4.pbsand lab4.R • Edit lab4.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2013

Lab 4 Task: Another multicore example in R • Submit your job to Fluxqsub lab4.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless lab4.out cja 2013

Lab 5 Task: Run snow interactively in R The snow package allows you to use cores on multiple nodes • module load Rcd ~/stats-sample-code • Examine lab5.R • Start an interactive PBS sessionqsub -I -V -l procs=3 -l walltime=30:00 -A stats_flux -l qos=flux -q flux cja 2013

Lab 5 Task: Run snow interactively in R • cd $PBS_O_WORKDIR • Run snow in the interactive PBS sessionR CMD BATCH --vanilla lab5.R lab5.out … ignore any “Connection to lifeline lost” message cja 2013

Lab 6 Task: Run snowfall in R The snowfall package is similar to snow, and allows you to change the number of cores used without modifying your R code • module load Rcd ~/stats-sample-code • Examine lab6.pbs and lab6.R • Edit lab6.pbswith your favorite Linux editor • Change #PBS -Memail address to your own cja 2013

Lab 6 Task: Run snowfall in R • Submit your job to Fluxqsublab6.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless lab6.out cja 2013

Lab 7 Task: Run parallel MATLAB Distribute parfor iterations over multiple cores on multiple nodes • Do this once: mkdir ~/matlab/ cd ~/matlab wgethttp://cac.engin.umich.edu/resources/software/matlabdct/mpiLibConf.m cja 2013

Lab 7 Task: Run parallel MATLAB • Start an interactive PBS sessionmodule load matlab qsub-I -V -l nodes=2:ppn=3 -l walltime=30:00 -A stats_flux -l qos=flux -q flux • Start MATLAB matlab-nodisplay cja 2013

Lab 7 Task: Run parallel MATLAB • Set up a matlabpool sched = findResource('scheduler', 'type', 'mpiexec') set(sched, 'MpiexecFileName', '/home/software/rhel6/mpiexec/bin/mpiexec') set(sched, 'EnvironmentSetMethod', 'setenv') %use the 'sched' object when calling matlabpool %the syntax for matlabpool must use the (sched, N) format matlabpool (sched, 6) … ignore “Found pre-existing parallel job(s)” warnings cja 2013

Lab 7 Task: Run parallel MATLAB • Run a simple parfor tic x=0; parfori=1:100000000 x=x+i; end toc • Close the matlabpoolmatlabpool close cja 2013

Compiling Code • Assuming default module settings • Use mpicc/mpiCC/mpif90 for MPI code • Use icc/icpc/ifort with -mp for OpenMP code • Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90 • Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o progprog.c • MPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o progprog.cmpirun -np 2 ./prog cja 2013

Lab Task: compile and execute simple programs on the Flux login node • Copy sample code to your login directory:cd cp~brockp/cac-intro-code.tar.gz. tar -xvzfcac-intro-code.tar.gz cd ./cac-intro-code • Examine, compile & execute helloworld.f90: ifort-O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90 ./f90hello • Examine, compile & execute helloworld.c: icc-O3 -ipo -no-prec-div -xHost -o chellohelloworld.c ./chello • Examine, compile & execute MPI parallel code: mpicc-O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c … ignore the “feupdateenv is not implemented and will always fail” warning mpirun-np 2 ./c_ex01 … ignore runtime complaints about missing NICs cja 2013

Makefiles • The make command automates your code compilation process • Uses a makefile to specify dependencies between source and object files • The sample directory contains a sample makefile • To compile c_ex01: make c_ex01 • To compile all programs in the directory make • To remove all compiled programs make clean • To make all the programs using 8 compiles in parallel make -j8 cja 2013

High Performance Computing cja 2013

Advantages of HPC • Cheaper than the mainframe • More scalable than your laptop • Buy or rent only what you need • COTS hardware • COTS software • COTS expertise cja 2013

Disadvantages of HPC • Serial applications • Tightly-coupled applications • Truly massive I/O or memory requirements • Difficulty/impossibility of porting software • No COTS expertise cja 2013

Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both cja 2013

Good parallel • Embarrassingly parallel • Folding@home, RSA Challenges, password cracking, … • Regular structures • Divide&conquer, e.g. Quicksort • Pipelined: N-body problems, matrix multiplication • O(n2) -> O(n) cja 2013

Less good parallel • Serial algorithms • Those that don’t parallelize easily • Irregular data & communications structures • E.g., surface/subsurface water hydrology modeling • Tightly-coupled algorithms • Unbalanced algorithms • Master/worker algorithms, where the worker load is uneven cja 2013

Amdahl’s Law If you enhance a fraction f of a computation by a speedup S, the overall speedup is: cja 2013

Amdahl’s Law cja 2013

Flux Architecture cja 2013

The Flux cluster Login nodes Compute nodes Data transfernode Storage … cja 2013

Behind the curtain Login nodes Compute nodes nyx Data transfernode flux Storage … shared cja 2013

A Flux node 48 GB RAM 12 Intel cores Local disk Ethernet InfiniBand cja 2013

A Newer Flux node 64 GB RAM 16 Intel cores Local disk Ethernet InfiniBand cja 2013

A Flux BigMem node 1 TB RAM 40 Intel cores Local disk Ethernet InfiniBand cja 2013

Flux hardware(January 2012) • 8,016 Intel cores 200 Intel BigMem cores632 Flux nodes 5 Flux BigMem nodes • 48 GB RAM/node 1 TB RAM/ BigMem node4 GB RAM/core (average) 25 GB RAM/BigMem core • 4X Infiniband network (interconnects all nodes) • 40 Gbps, <2 us latency • Latency an order of magnitude less than Ethernet • Lustre Filesystem • Scalable, high-performance, open • Supports MPI-IO for MPI jobs • Mounted on all login and compute nodes cja 2013

Flux software • Default Software: • Intel Compilers with OpenMPI for Fortran and C • Optional software: • PGI Compilers • Unix/GNU tools • gcc/g++/gfortran • Licensed software: • Abacus, ANSYS, Mathematica, Matlab, R, STATA SE, … • See http://cac.engin.umich.edu/resources/software/index.html • You can choose software using the module command cja 2013

Flux network • All Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network • The Flux login nodes are also connected to the campus backbone network • The Flux data transfer node will soon be connected over a 10 Gbps connection to the campus backbone network • This means • The Flux login nodes can access the Internet • The Flux compute nodes cannot • If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications cja 2013

Flux data • Lustre filesystem mounted on /scratch on all login, compute, and transfer nodes • 342TB of short-term storage for batch jobs • Large, fast, short-term • NFS filesystems mounted on /home and /home2 on all nodes • 40 GB of storage per user for development & testing • Small, slow, long-term cja 2013

Flux data • Flux does not provide large, long-term storage • Alternatives: • ITS Value Storage • Departmental server • CAEN can mount your storage on the login nodes • Issue df–kh command on a login node to see what other groups have mounted cja 2013

Globus Online • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own server endpoint: contact flux-support • Add your own client endpoint! • More information • http://cac.engin.umich.edu/resources/loginnodes/globus.html cja 2013

Flux Batch Operations cja 2013

High Performance Computing Workshop (Statistics) HPC 101