570 likes | 691 Vues
Join Dr. Charles J. Antonelli for a comprehensive workshop on High-Performance Computing (HPC) and the Flux architecture, aimed at understanding computational clusters and effective programming models. Explore key topics such as Flux mechanics, message-passing, multi-threading, and Amdahl's Law. Discover the capabilities of the Flux system, including its extensive software ecosystem, network architecture, and data management strategies. This workshop caters to researchers and students eager to leverage advanced computing resources for interdisciplinary applications.
E N D
High PerformanceComputing WorkshopHPC 101 Dr. Charles J Antonelli LSAIT ARSJune, 2014
Credits • Contributors: • Brock Palen (CAEN HPC) • Jeremy Hallum (MSIS) • Tony Markel (MSIS) • BennetFauber (CAEN HPC) • Mark Montague (LSAIT ARS) • Nancy Herlocher (LSAIT ARS) • LSAIT ARS • CAEN HPC cja 2014
Roadmap • High Performance Computing • Flux Architecture • Flux Mechanics • Flux Batch Operations • Introduction to Scheduling cja 2014
High Performance Computing cja 2014
Cluster HPC • Acomputing cluster • a number of computing nodes connected together via special hardware and software that together can solvelarge problems. • A cluster is much less expensive than a single supercomputer(e.g., a mainframe) • Using clusters effectively requires support in scientific software applications(e.g., Matlab's Parallel Toolbox, or R's Snow library), or custom code cja 2014
Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both cja 2014
Amdahl’s Law cja 2014
Flux Architecture cja 2014
Flux Flux is a university-wideshared computational discovery / high-performance computing service. • Provided by Advanced Research Computing at U-M • Operated by CAEN HPC • Procurement, licensing, billing by U-M ITS • Interdisciplinary since 2010 http://arc.research.umich.edu/resources-services/flux/ cja 2014
The Flux cluster Login nodes Compute nodes Data transfernode Storage … cja 2014
A Flux node 48-64 GB RAM 12-16 Intel cores Local disk Ethernet InfiniBand cja 2014
A Large Memory Flux node 1 TB RAM 32-40 Intel cores Local disk Ethernet InfiniBand cja 2014
Coming soon:A Flux GPU node 64 GB RAM 8 GPUs 16 Intel cores Local disk Each GPU contains 2,688 GPU cores cja 2014
Flux software • Licensed and open software: • Abacus, BLAST, BWA, bowtie, ANSYS, Java, Mason, Mathematica, Matlab, R, RSEM, STATA SE, … • See http://cac.engin.umich.edu/resources • C, C++, Fortran compilers: • Intel (default), PGI, GNU toolchains • You can choose software using the module command cja 2014
Flux network • All Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network • The Flux login nodes are also connected to the campus backbone network • The Flux data transfer node is connected over a 10 Gbps connection to the campus backbone network • This means • The Flux login nodes can access the Internet • The Flux compute nodes cannot • If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications cja 2014
Flux data • Lustre filesystem mounted on /scratch on all login, compute, and transfer nodes • 640 TB of short-term storage for batch jobs • Large, fast, short-term • NFS filesystems mounted on /home and /home2 on all nodes • 80 GB of storage per user for development & testing • Small, slow, long-term cja 2014
Flux data • Flux does not provide large, long-term storage • Alternatives: • Value Storage (NFS) • $20.84 / TB / month (replicated, no backups) • $10.42 / TB / month (non-replicated, no backups) • LSA Large Scale Research Storage • 2 TB free to researchers (replicated, no backups) • Faculty members, lecturers, postdocs, GSI/GSRA • Additional storage $30 / TB / year (replicated, no backups) • Departmental server • CAEN can mount your storage on the login nodes cja 2014
Copying data Three ways to copy data to/from Flux • From Linux or Mac OS X, use scp:scplocalfilelogin@flux-xfer.engin.umich.edu:remotefilescplogin@flux-login.engin.umich.edu:remotefilelocalfilescp -r localdirlogin@flux-xfer.engin.umich.edu:remotedir • From Windows, use WinSCP • U-M Blue Dischttp://www.itcs.umich.edu/bluedisc/ • Use Globus Connect cja 2014
Globus Connect • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own client endpoint! • Add your own server endpoint: contact flux-support@umich.edu • More information • http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp cja 2014
Flux Mechanics cja 2014
Using Flux • Three basic requirements to use Flux: • A Flux account • A Flux allocation • An MToken (or a Software Token) cja 2014
Using Flux • A Flux account • Allows login to the Flux login nodes • Develop, compile, and test code • Available to members of U-M community, free • Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication cja 2014
Using Flux • A Flux allocation • Allows you to run jobs on the compute nodes • Some units cost-share Flux rates • Regular Flux: $11.72/core/monthLSA, Engineering, Medical School $6.60/month • Large Memory Flux: $23.82/core/monthLSA, Engineering, Medical School $13.30/month • GPU Flux: $107.10/2 CPU cores and 1 GPU/monthLSA, Engineering, Medical School $60/month • Flux Operating Environment: $113.25/node/monthLSA, Engineering, Medical School $63.50/month • Flux pricing at http://arc.research.umich.edu/flux/hardware-services/ • Rackham grants are available for graduate students • Details at http://arc.research.umich.edu/resources-services/flux/flux-pricing/ • To inquire about Flux allocations please email flux-support@umich.edu cja 2014
Using Flux • An MToken (or a Software Token) • Required for access to the login nodes • Improves cluster security by requiring a second means of proving your identity • You can use either an MToken or an application for your mobile device (called a Software Token) for this • Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/login-nodes/tfa cja 2014
Logging in to Flux • ssh flux-login.engin.umich.edu • MToken (or Software Token) required • You will be randomly connected a Flux login node • Currently flux-login1 or flux-login2 • Firewalls restrict access to flux-login.To connect successfully, either • Physically connect your ssh client platform to the U-M campus wired or MWireless network, or • Use VPN software on your client platform, or • Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh to flux-login from there cja 2014
Modules • The module command allows you to specify what versions of software you want to use module list -- Show loaded modulesmodule loadname-- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all options • Enter these commands at any time during your session • A configuration file allows default module commands to be executed at login • Put module commands in file ~/privatemodules/default • Don’t put module commands in your .bashrc / .bash_profile cja 2014
Flux environment • The Flux login nodes have the standard GNU/Linux toolkit: • make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, … • Watch out for source code or data files written on non-Linux systems • Use these tools to analyze and convert source files to Linux format • file • dos2unix cja 2014
Lab 1 Task: Invoke R interactively on the login node • module load Rmodule list • R q() • Please run only very small computations on the Flux login nodes, e.g., for testing cja 2014
Lab 2 Task: Run R in batch mode • module load R • Copy sample code to your login directorycd cp~cja/hpc-sample-code.tar.gz. tar -zxvfhpc-sample-code.tar.gz cd ./hpc-sample-code • Examine Rbatch.pbsand Rbatch.R • Edit Rbatch.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2014
Lab 2 Task: Run R in batch mode • Submit your job to FluxqsubRbatch.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless Rbatch.out • Copy your results to your local workstation (change uniqnameto your own uniqname)scpuniqname@flux-xfer.engin.umich.edu:hpc-sample-code/Rbatch.out Rbatch.out cja 2014
Lab 3 Task: Use the multicore package The multicore package allows you to use multiple cores on the same node • module load Rcd ~/sample-code • Examine Rmulti.pbsand Rmulti.R • Edit Rmulti.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2014
Lab 3 Task: Use the multicore package • Submit your job to FluxqsubRmulti.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless Rmulti.out • Copy your results to your local workstation (change uniqnameto your own uniqname)scpuniqname@flux-xfer.engin.umich.edu:hpc-sample-code/Rmulti.outRmulti.out cja 2014
Compiling Code • Assuming default module settings • Use mpicc/mpiCC/mpif90 for MPI code • Use icc/icpc/ifort with -mp for OpenMP code • Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90 • Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o progprog.c • MPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o progprog.cmpirun -np 2 ./prog cja 2014
Lab 4 Task: compile and execute simple programs on the Flux login node • Copy sample code to your login directory:cd cp~brockp/cac-intro-code.tar.gz. tar -xvzfcac-intro-code.tar.gz cd ./cac-intro-code • Examine, compile & execute helloworld.f90: ifort-O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90 ./f90hello • Examine, compile & execute helloworld.c: icc-O3 -ipo -no-prec-div -xHost -o chellohelloworld.c ./chello • Examine, compile & execute MPI parallel code: mpicc-O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c mpirun-np 2 ./c_ex01 cja 2014
Makefiles • The make command automates your code compilation process • Uses a makefile to specify dependencies between source and object files • The sample directory contains a sample makefile • To compile c_ex01: make c_ex01 • To compile all programs in the directory make • To remove all compiled programs make clean • To make all the programs using 8 compiles in parallel make -j8 cja 2014
Flux Batch Operations cja 2014
Portable Batch System • All production runs are run on the compute nodes using the Portable Batch System(PBS) • PBS manages all aspects of cluster job execution except job scheduling • Flux uses the Torque implementation of PBS • Flux uses the Moab scheduler for job scheduling • Torque and Moab work together to control access to the compute nodes • PBS puts jobs into queues • Flux has a single queue, named flux cja 2014
Cluster workflow • You create a batch script and submit it to PBS • PBS schedules your job, and it enters the flux queue • When its turn arrives, your job will execute the batch script • Your script has access to any applications or data stored on the Flux cluster • When your job completes, anything it sent to standard output and error are saved and returned to you • You can check on the status of your job at any time, or delete it if it’s not doing what you want • A short time after your job completes, it disappears cja 2014
Basic batch commands • Once you have a script, submit it:qsubscriptfile$ qsubsinglenode.pbs6023521.nyx.engin.umich.edu • You can check on the job status:qstatjobidqstat -u user $ qstat-u cja nyx.engin.umich.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6023521.nyx.engi cjaflux hpc101i -- 1 1 -- 00:05 Q -- • To delete your jobqdeljobid$ qdel 6023521$ cja 2014
Loosely-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l procs=12,pmem=1gb,walltime=01:00:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below:cd $PBS_O_WORKDIR mpirun ./c_ex01 cja 2014
Tightly-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l nodes=1:ppn=12,mem=47gb,walltime=02:00:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR matlab -nodisplay -r script cja 2014
Lab 5 • Task: Run an MPI job on 8 cores • Compile c_ex05cd ~/cac-intro-codemake c_ex05 • Edit file runwith your favorite Linux editor • Change #PBS -Maddress to your own • I don’t want Brock to get your email! • Change #PBS -Aallocation to FluxTraining_flux, or to your own allocation, if desired • Change #PBS -lallocation to flux • Submit your jobqsubrun cja 2014
PBS attributes • As always, man qsub is your friend-N : sets the job name, can’t start with a number-V : copy shell environment to compute node-A youralloc_flux: sets the allocation you are using-l qos=flux: sets the quality of service parameter-q flux: sets the queue you are submitting to-l : requests resources, like number of cores or nodes-M : whom to email, can be multiple addresses-m : when to email: a=job abort, b=job begin, e=job end-joe: join STDOUT and STDERR to a common file-I: allow interactive use-X : allow X GUI use cja 2014
PBS resources (1) • A resource (-l) can specify: • Request wallclock (that is, running) time-l walltime=HH:MM:SS • Request C MB of memory per core-l pmem=Cmb • Request T MB of memory for entire job-l mem=Tmb • Request M cores on arbitrary node(s)-l procs=M • Request a token to uselicensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox cja 2014
PBS resources (2) • A resource (-l) can specify:For multithreaded code: • Request M nodes with at least N cores per node-l nodes=M:ppn=N • Request Mcores with exactlyNcores per node (note the differencevis a visppn syntax and semantics!)-l nodes=M,tpn=N(you’ll only use this for specific algorithms) cja 2014
Interactive jobs • You can submit jobs interactively: qsub -I -X -V -l procs=2 -l walltime=15:00 -A youralloc_flux-l qos=flux –q flux • This queues a job as usual • Your terminal session will be blocked until the job runs • When your job runs, you'll get an interactive shell on one of your nodes • Invoked commands will have access to all of your nodes • When you exit the shell your job is deleted • Interactive jobs allow you to • Develop and test on cluster node(s) • Execute GUI tools on a cluster node • Utilize a parallel debugger interactively cja 2014
Lab 6 • Task: Run an interactive job • Enter this command (all on one line):qsub -I -V -l procs=1 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux • When your job starts, you’ll get an interactive shell • Copy and paste the batch commands from the “run” file, one at a time, into this shell • Experiment with other commands • After thirty minutes, your interactive shell will be killed cja 2014
Lab 7 Task: Run Matlab interactively • module load matlab • Start an interactive PBS sessionqsub -I -V -l procs=2-l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux • Run Matlab in the interactive PBS sessionmatlab -nodisplay cja 2014
Introduction to Scheduling cja 2014
The Scheduler (1/3) • Flux scheduling policies: • The job’s queue determines the set of nodes you run on • The job’s account and qos determine the allocation to be charged • If you specify an inactive allocation, your job will never run • The job’s resource requirements help determine when the job becomes eligible to run • If you ask for unavailable resources, your job will wait until they become free • There is no pre-emption cja 2014