1 / 51

Introduction to Flux: Hands-on Session

Introduction to Flux: Hands-on Session. Dr. Charles J Antonelli LSAIT Research Systems Group The University of Michigan November 30, 2011. Roadmap. High Performance Computing HPC Resources U-M Flux Architecture Using Flux Gaining Insight. High Performance Computing. Advantages of HPC.

danae
Télécharger la présentation

Introduction to Flux: Hands-on Session

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Flux:Hands-on Session Dr. Charles J Antonelli LSAIT Research Systems GroupThe University of Michigan November 30, 2011

  2. Roadmap • High Performance Computing • HPC Resources • U-M Flux Architecture • Using Flux • Gaining Insight

  3. High Performance Computing

  4. Advantages of HPC • Cheaper than the mainframe • More scalable than your laptop • Buy or rent only what you need • COTS hardware • COTS software • COTS expertise

  5. Disadvantages of HPC • Serial applications • Tightly-coupled applications • Truly massive I/O or memory requirements • Difficulty/impossibility of porting software • No COTS expertise

  6. Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both

  7. Good parallel • Embarrassingly parallel • Folding@home, RSA Challenges, password cracking, … • Regular structures • Divide&conquer, e.g. Quicksort • Pipelined: N-body problems, matrix multiplication • O(n2) -> O(n)

  8. Less good parallel • Serial algorithms • Those that don’t parallelize easily • Irregular data & communications structures • E.g., surface/subsurface water hydrology modeling • Tightly-coupled algorithms • Unbalanced algorithms • Master/worker algorithms, where the worker load is unbalanced

  9. Amdahl’s Law If you enhance a fraction f of a computation by a speedup S, the overall speedup is:

  10. Amdahl’s Law

  11. HPC Resources

  12. Some Resources • U-M • Cirrus Project http://orci.research.umich.edu/about-orci/cirrus-project/ • Flux: shared leased computing • ORCI • Meta • XSEDE (was TeraGrid) • Amazon EC2 • Google App Engine • Microsoft Azure • …

  13. U-M Flux Architecture

  14. The Flux cluster Login nodes Compute nodes … Data transfernode Storage …

  15. The truth Login nodes Compute nodes nyx … Data transfernode flux Storage … shared

  16. The Flux node 48 GB RAM 12 Intel cores Local disk InfiniBand Ethernet

  17. Flux hardware • 2,000 Intel cores (4,000 in January 2012)172 Flux nodes (340 in January 2012) • 48 GB RAM/node4 GB RAM/core (average) • 4X Infiniband network (interconnects all nodes) • 40 Gbps, <2 us latency • Latency an order of magnitude less than Ethernet • Lustre Filesystem • Scalable, high-performance, open • Supports MPI-IO for MPI jobs • Mounted on all login and compute nodes

  18. Flux software • Default Software: • Intel Compilers with OpenMPI for Fortran and C • Optional software: • PGI Compilers • Unix/GNU tools • gcc/g++/gfortran • Licensed software: • Abacus, ANSYS, Mathematica, Matlab, STATA SE, … • See http://cac.engin.umich.edu/resources/software/index.html • You can choose software using the module command

  19. Flux network • All Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network • The Flux login nodes are also connected to the campus backbone network • The Flux data transfer node will soon be connected over a 10 Gbps connection to the campus backbone network • This means • The Flux login nodes can access the Internet • The Flux compute nodes cannot • If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications

  20. Flux data • Lustre filesystem mounted on /nobackup on all login, compute, and transfer nodes • 143 TB of short-term storage for batch jobs(375 TB in January 2012) • Large, fast, short-term • NFS filesystems mounted on /home and /home2 on all nodes • 40 GB of storage per user for development & testing • Small, slow, long-term

  21. Flux data • Flux does not provide large, long-term storage • Alternatives: • ITS Value Storage • Departmental server • CAEN can mount your storage on the login nodes • Issue df–kh command on a login node to see what other groups have mounted

  22. Globus Online GridFTP • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own server endpoint: contact flux-support • Add your own client endpoint! • More information • http://cac.engin.umich.edu/resources/loginnodes/globus.html

  23. Using Flux

  24. Using Flux • Two requirements: • A Flux account • Allows login to the Flux login nodes • Develop, compile, and test code • Available to members of U-M community, free • Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication • A Flux allocation • Allows you to run jobs on the compute nodes • Current rate is $18 per core-month • Discounted to $11.20 per core-month until July 1, 2012 • To inquire about flux allocations please email flux-support@umich.edu

  25. Flux On-Demand • Alternative to a static allocation • Pay only for the core time you use • Pros • Accommodates “bursty” usage patterns • Cons • Limit of 50 cores total • Limit of 25 cores for any user

  26. Logging in to Flux • ssh flux-login.engin.umich.edu • You will be randomly connected a Flux login node • Currently flux-login1 or flux-login2 • Firewalls restrict access to flux-login.To connect successfully, either • Physically connect your ssh client platform to the U-M campus wired network, or • Use VPN software on your client platform, or • Use ssh to login to an ITS login node, and ssh to flux-login from there

  27. Modules • The module command allows you to specify what versions of software you want to use module list -- Show loaded modulesmodule loadname-- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all options • Enter these commands at any time during your session • A configuration file allows default module commands to be executed at login • Put module commands in file ~/privatemodules/default

  28. Flux environment • The Flux login nodes have the standard GNU/Linux toolkit: • make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, … • Source code written on non-Linux systems • On Flux, use these tools to convert source files to Linux format • dos2unix • mac2unix

  29. Compiling Code • Assuming default module settings • Use mpicc/mpiCC/mpif90 for MPI code • Use icc/icpc/ifort with -mp for OpenMP code • Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90 • Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o progprog.c • MPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o progprog.cmpirun -np 2 ./prog

  30. Lab 1 • Task: compile and execute simple programs on the Flux login node Copy sample code to your login directory:cd cp~brockp/cac-intro-code.tar.gz. tar -xvzfcac-intro-code.tar.gz cd ./cac-intro-code • Examine, compile & execute helloworld.f90: ifort-O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90 ./f90hello • Examine, compile & execute helloworld.c: icc-O3 -ipo -no-prec-div -xHost -o chellohelloworld.c ./chello • Examine, compile & execute MPI parallel code: mpicc-O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c … ignore the “feupdateenv is not implemented and will always fail” warning mpirun-np 2 ./c_ex01 … ignore runtime complaints about missing NICs

  31. Makefiles • The make command automates your code compilation process • Uses a makefile to specify dependencies between source and object files • The sample directory contains a sample makefile • To compile c_ex01: make c_ex01 • To compile all programs in the directory make • To remove all compiled programs make clean • To make all the programs using 8 compiles in parallel make -j8

  32. The Portable Batch System

  33. Portable Batch System • All production runs are run on the compute nodes using the Portable Batch System(PBS) • PBS manages all aspects of cluster job execution • Flux uses the Torque implementation of PBS • Flux uses the Moab scheduler for job scheduling • Torque and Moab work together to control access to the compute nodes • PBS puts jobs into queues • Flux has a single queue, named flux

  34. Using the cluster • You create a batch script and submit it to PBS • PBS schedules your job, and it enters the flux queue • When its turn arrives, your job will execute the batch script • Your script has access to any applications or data stored on the Flux cluster • When your job completes, its standard output and error are saved and returned to you • You can check on the status of your job at any time, or delete it if it’s not doing what you want • A short time after your job completes, it disappears

  35. Sample serial script #PBS -N yourjobname #PBS -V #PBS -q flux #PBS -A youralloc_flux #PBS -l qos=youralloc_flux #PBS –l procs=1,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR ./f90hello

  36. Sample batch script #PBS -N yourjobname #PBS -V #PBS -q flux #PBS -A youralloc_flux #PBS -l qos=youralloc_flux #PBS –l procs=16,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cat $PBS_NODEFILEcd $PBS_O_WORKDIR mpirun ./c_ex01 Lists the node(s) your job ran on Change to submission directory No need to specify -np

  37. Batch job mechanics • Once you have a script, submit it:qsubscriptfile$ qsubsinglenode.pbs6023521.nyx.engin.umich.edu • You can check on the job status:qstat –u username $ qstat -u cja nyx.engin.umich.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6023521.nyx.engi cjaflux hpc101i -- 1 1 -- 00:05 Q -- • To delete your jobqdeljobid$ qdel 6023521$

  38. Lab 2 • Task: Run an MPI job on 8 cores • Compile c_ex05 cd ~/cac-intro-code make c_ex05 • Edit file runwith your favorite Linux editor • Change #PBS –Maddress to your own • I don’t want Brock to get your email! • Change both #PBS –Aand #PBS –lallocation names to FluxTraining_flux, or to your own allocation name, if desired • Submit your job qsubrun

  39. PBS attributes • As always, man qsub is your friend-N : sets the job name, can’t start with a number-V : copy shell environment to compute node-q flux: sets the queue you are submitting to-A youralloc_flux: sets the allocation you are using-l qos=youralloc_flux: mustmatch the allocation-l : requests resources, like number of cores or nodes-M : whom to email, can be multiple addresses-m : when to email: a=job abort, b=job begin, e=job end-joe: join STDOUT and STDERR to a common file-I: allow interactive use-X : allow X GUI use

  40. PBS resources • A resource (-l) can specify: • Request wallclock (that is, running) time-l walltime=HH:MM:SS • Request C MB of memory per core-l pmem=Cmb • Request T MB of memory for entire job-l mem=Tmb • Request M cores on arbitrary node(s)-l procs=M • Request M nodes with N cores per node (only if necessary)-l nodes=M:ppn=N • Request a token to uselicensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox

  41. Interactive jobs • You can submit jobs interactively: qsub-I -X -V -q flux -l procs=2 -l walltime=15:00 -l qos=youralloc_flux -A youralloc_flux • This queues a job as usual • Your terminal session will be blocked until the job runs • When it runs, you will be connected to one of your nodes • Invoked serial commands will run on that node • Invoked parallel commands (e.g., via mpirun) will run on all of your nodes • When you exit the terminal session your job is deleted • Interactive jobs allow you to • Test your code on cluster node(s) • Execute GUI tools on a cluster node with output on your local platform’s X server • Utilize a parallel debugger interactively

  42. Lab 3 • Task: Run an interactive job • Enter this command (all on one line):qsub-I -X -V -q flux -l procs=2 -l walltime=15:00 -l qos=FluxTraining_flux -A FluxTraining_flux • When your job starts, you’ll get an interactive shell • Copy and paste the batch commands from the “run” file, one at a time, into this shell • Experiment with other commands • After fifteen minutes, your interactive shell will be killed

  43. Gaining Insight

  44. The Scheduler (1/3) • Flux scheduling policies: • The job’s queue determines the set of nodes you run on • The job’s account and qos determine the allocation to be charged • If you specify an inactive allocation, your job will never run • The job’s resource requirements help determine when the job becomes eligible to run • If you ask for unavailable resources, your job will wait until they become free • There is no pre-emption 44

  45. The Scheduler (2/3) • Flux scheduling policies: • If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run: • How long you have waited for the resource • How much of the resource you have used so far • This is called “fairshare” • The scheduler will reserve nodes for a job with sufficient priority • This is intended to prevent starving jobs with large resource requirements 45

  46. The Scheduler (3/3) • Flux scheduling policies: • If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps • This is called “backfill” Cores Time 46

  47. Gaining insight • There are several commands you can run to get some insight over the scheduler’s actions: • freenodes : shows the number of free nodes and cores currently available • showq : shows the state of the queue (like qstat -a), except shows running jobs in order of finishing • diagnose -p : shows the factors used in computing job priority • checkjobjobid : Can show why your job might not be starting • showstart –e all: Gives you a coarse estimate of job start time; use the smallest value returned 47

  48. Flux Resources • http://cac.engin.umich.edu/started • Cluster news, RSS feed and outages listed here • http://cac.engin.umich.edu/ • Getting an account, training, basic tutorials • http://www.engin.umich.edu/caen/hpc • Getting an allocation, Flux On-Demand, Flux info • For assistance: flux-support@umich.edu • Read by a team of people • Cannot help with programming questions, but can help with operational Flux and basic usage questions

  49. Summary • The Flux cluster is just a collection of similar Linux machines connected together to run your code, much faster than your laptop can • Unlike laptops, there is limited GUI access, command line encouraged • Some important commands are qsubqstat -u usernameqdeljobid • Developand test, then submit your jobs in bulk and let the scheduler do the dirty work 49

  50. Any Questions? • Charles J. AntonelliLSAIT Research Systems Groupcja@umich.eduhttp://www.umich.edu/~cja734 926 8421

More Related