Cluster Computing Basics

Cluster Computing Basics R D Bjornson N J Carriero CS Dept, Keck HPC Resource, YCGA http://maguro.cs.yale.edu/mediawiki/index.php/Center_For_HPC_In_Biology_And_Biomedicine

Don’t Panic! man: Describes how to use a command. man man help: Information about frequently used “shell” commands. info: New and improved (?) man—may provide more details. locate: Find the location of a file (in common system areas). which: Use to determine which version of a program will be used by default. Note: User interface ishunt-and-pecknotpoint-and-click!

Accessing Louise • Run a program on your computer (“local”) to login to louise (“remote”) over a network connection. • The local computer must be on the Yale network: • A computer at Yale • Via VPN software • Via a login to a computer at Yale that allows external access, then login from there to louise. • The login program must support the secure shell protocol. • Linux: ssh • Mac OS X: Use terminal or X11/xterm to create a command line session (“shell”), then ssh. • Windows: Putty + ssh or cygwin(and then pretend as if you are using Linux). • sshnetid@louise.hpc.yale.edu • On first log in, if prompted for a passphrase for an ssh key, just press “enter”. In general, unless you know what you are doing, leave ssh-related files alone (and do not change the permission on your home directory!). • RunningGUIsinvolvesunderstanding and using X11. Baked-in with Linux, distributedbutnotinstalledby default with Mac OS X, and a 3rdpartyaddonfor Windows (e.g., cygwin).

Accessing Louise • Use scp or sftp (part of the ssh program suite) to copy files from local to remote and back. • rsync can be useful for keeping a local and remote file hierarchy in sync. • wget will allow you to retrieve a file via a URL from the command line. Useful for fetching reference files from repository sites (ENSEMBL, NCBI, UCSC).

Cluster Organization Login nodes • Virtualized • Light use only Compute nodes • Multicore, ~4GB DRAM per core. Parallel or concurrent execution is relatively easy using the cores of one node. More work to use the cores on multiple nodes. But in either case do not assume this will happen automatically. • Shared vs dedicated File systems • Cluster wide (default), accessible over network • Local to node (direct connection)

Cluster Organization (Louise) ssh 300+ Users. 90 compute nodes for general use. qsub Don’t loiter in the lobby! Compute-22-2 Processor cores: 4 to 64 per compute node

Resource Management Need to explicitly allocate resources for computing • Interactive. For development; using interactive programs such as MATLAB®, python or R; and/or graphic rich tools (X11 forwarding) • Batch Commands • qsub registers a request for resources (for X11 forwarding also use ssh –Y for initial login) qsub -X -I nodes=1:ppn=8 -q default qsubFileWithOptionsAndCommands • qstat provides information about requests qstat -1 -n -u njc2

Tools Editor (emacsvs vi and vim) emacs makes it possible to work directly with files 10s to 100s of MB, explore binary files, capture shell transcripts and review them, interactively navigate the file hierarchy, review file differences, etc. . Binary vs ASCII files. fileBasic command to determine the kind of file. od –c Displays content byte by byte, permitting a detailed examination—useful especially when dealing with DOS/Unix/Mac OS X end of line conflicts or looking for file corruption. Often used in a “pipe” with head. Btw, do not use a “wysiwyg” editor such as Word or Wordpad for technical work, especially data preparation or code development.

Tools ls , cd , mkdir: List directory contents, change directories, make a new directory. File hierarchy = tree of directories. • A “path” is a series of nested directories written this way /dir0/dir1/dir2/file. • When you login, you start work in your “home” directory (aka ~). • When bash looks up a command for you, it searches in all paths listed in the “PATH environment variable”. export PATH=/my/new/program/Directory:$PATH • Look in “/usr/local/cluster/software/installation” for programs of interest.

Tools head , less , tail: See a couple of lines in an ASCII file. head and tail can be used to extract a small sample, e.g. to see the format of data in the file or to create test input (but this kind of sample is generally not representative). Often used with pipes. Use less to browse files (by line number or percentage). split: One way to cope with large files (but virtual splitting can be more efficient: split will, at least temporarily, double the amount of file space used). awk: Swiss army knife. Can do head/tail/split and much more: awk 'NR%1000 == 13{print $0}' fullDataSet > sampleDataSet python: An excellent general purpose text processing and analysis environment (increasingly popular, but perl has a large lead).

Tools: bash scripting, redirection and pipes When you log into a computer you are connected to a program. This program accepts the text you type and does “something” with it. If, for example, you type “ls”, the program first determines that “ls” is not something that it directly understands, so it next looks for another program on the computer called “ls” in one of the directories in PATH. If it finds it, it runs that program on your behalf and then reports the output. If it does not find it, it reports an error to that effect. This class of program is generally referred to as “command shells”. It should be clear that the shell plays a critical role in the use of a cluster computer, and yet most users give the shell little or no thought. This generally comes back to haunt them in the form of subtle bugs that they are ill equipped to diagnose and correct, as well as missed opportunities to streamline workflow.

Tools: bash Consider a sequence of commands given to the bash shell (the default shell) : unzip data.gz awk '/chr13/{print $0}' data > chr13Records gzip data myProgram -i chr13Records -o chr13Filtered rm chr13Records sort -k 2,2n < chr13Filtered > chr13Sorted rm chr13Filtered Note: stdin, stdout, stderr

Tools: bash An alternative using bash pipes (“|”): gunzip -c data.gz |awk '/chr13/{print $0}'|myProgram -i - -o - | sort -k 2,2n > chr13Sorted Three advantages: • Less file system IO (extremely important in a cluster setting) • Less clean up (an issue when this sort of processing is done 100s or 1000s of times) • Better use of multicore machines (gunzip, awk, and myProgram can run concurrently).

Tools: bash Now suppose we have 100 data sets: dataSet00.gz ... dataSet99.gz. A few notes about file naming: • When working with a large number of files, it is easy to lose track of files or accidentally overwrite some, so choose a clear and informative scheme and stick to it. If >> 1000, use additional levels of directories. • 0- vs 1- based indexing is a subtle point that you need to get comfortable with (you don’t have to use it yourself, but you will run into it sooner or later). • Padding with leading 0’s compensates for dumb file sorting. How can we easily process all of these sets?

Tools: bash for f in $(lsdataSet*.gz) do gunzip -c $f | awk '/chr13/{print $0}’| myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/} done Note: You can use an editor to create a file that contains a complex command or a command sequence and then have bash execute that file as if you typed it in directly: source CommandFileYou can also turn that file (“script”) into something that you can run like any other program.

Parallelism That may take a while, how can we use multiple processors to do it faster? Simple queue: • Produce a list of tasks to be executed (essentially the same loop as before modified to display the commands to be executed rather than actually execute them). for f in $(lsdataSet*.gz) do echo ”cd $(pwd) &&(gunzip -cdata.gz | awk '/chr13/{print $0}’ | myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/} ) >${f}.out 2>${f}.err” done > Tasks • Create a batch script that directs the resource manager to allocate compute nodes and then uses the allocated nodes to work through the list of tasks (can “|” to qsub). sqPBS.py default 4.6 njc2 dataExtraction Tasks • Check output files and status information (Simple Queue collects a great deal). • /usr/local/cluster/software/installation/SimpleQueue/sqPBS.py

cd ... && blast ds 00 cd ... && blast ds 01 cd ... && blast ds 02 cd ... && blast ds 03 cd ... && blast ds 01 cd ... && blast ds 04 cd ... && blast ds 05 cd ... && blast ds 06 cd ... && blast ds 02

Aside: Random Number Generation If you run a code that depends on random numbers, you must take care to ensure it does what you expect when you run it several times, perhaps concurrently on different nodes. On the one hand, in general you will want each instance to see different random numbers. This may not happen by default. On the other, you would like to be able to reproduce your results. Different but not too different!

Parallelism: Pre-packaged Thread based: Fairly common ("easy"-ish). Thread-based parallelism can only make use of the cores on one node. Message passing based (MPI, PVM, …): Less common in bioinformatics. A message passing program can make use of the aggregate resources of many nodes. “make” based: Illumina and one or two others. Limited to the cores of one node.

Parallelism: Pre-packaged If you are using a 3rd party program, it is important to know which kind of parallelism is used and to invoke the program appropriately. If threaded: • Run on a dedicated node! • Check docs for a number of threads parameter. If MP, typically need to set up a special execution environment in order to run the program using the resources allocated. Unfortunately, this tends to be MPI-implementation specific and so has to be addressed on a case by case basis (ask RDB or NJC). If “make”, invoke like this: make -j N MakeTarget > make.out 2> make.err where N is the number of cores to use.

Do It Yourself: Owner computes It is possible to write you own parallel programs. One strategy that RDB and NJC often use: • Imagine that you run multiple copies of a sequential version. • At some point, the copies will enter a period of execution in which the work can be split up into independent tasks. Add a check to decide which copy “owns” (and should execute) a given task—all other copies will skip this task. • Each copy records the tasks it did. When it exits the period of execution that was split up, it exchanges with all other copies the results of the tasks it did. At this point all the copies know all the results and will continue to execute as if they had each done all of the work themselves. The devil is in the details—especially the mechanisms used to settle ownership and to exchange task results. Ask us for help; just keep in mind that this kind of parallelism is an option and need not be terribly complex.

Software as an Experimental System Start with “small” input sets and/or run parameters and systematically alter these to study how CPU time, memory use and IO activity vary from run to run. Non-invasive tools: top May need a separate log in to the allocated node (use intra-cluster ssh). time command: /usr/bin/time –vprog a0 a1 a2 > outFile 2> errFile Output from time will be appended to “errFile”. Note: use the full path—this is an instance where it is important to understand how the shell works.

Software as an Experimental System If you are in a position to modify code, you can get much more accurate and detailed information. Ditto with profiling: Compile time option plus post processing for C, C++, Fortran, … Available as a runtime facility in various scripting systems (python, perl, ruby). Activating profiling often significantly increases run time, placing a premium on the importance of well designed small test cases.

Scaling Considerations Consider the time (in arbitrary “operation” units) to process N records, if doing: A record by record transform => Time(N) An all to all comparison => Time(N2) An exploration of subsets => Time(2N) An exploration of orderings => Time(N!) One naturally tends to focus on run time, but memory and IO (amount as well as rate) matter too.

Scaling Considerations What N corresponds to about 1 CPU second? Time(N) => 1,000,000,000 Time(N2) => 30,000 Time(2N) => 30 Time(N!) => 13 What model applies clearly matters!

Scaling Considerations It matters when determining how big a problem is feasible. Suppose we double the input size: Time(2*1,000,000,000) => ~ 2 s Time((2*30,000)2) => ~4 s Time(2(2*30)) => 1,000,000,000 s (> 30 years) Time((2*13)!) => 1016 s (roughly a billion years)

Scaling Considerations It matters when verifying code behavior. If you have a code that you believe follows a Time(N) model, but empirically behaves like Time(N2), then you may have a bug. For example, code that maintains a list of values can easily degenerate to Time(N2) if one is careless with the operations that maintain the list.

Other Performance Considerations Memory hierarchy: Do as much as you can with one record before moving on to the next. Physical vs Virtual Memory: When chunking work, size to fit in physical memory. Local vs remote IO: If you cannot eliminate temporary IO via bash pipes or named pipes, at least write to a local file system (but clean up!). Bulk IO vs character IO: Mostly done for you, but avoid IO operations that read or write one byte or character at a time. Data IO vs metadata operations: Metadata operations are much more expensive than normal data IO. Avoid them. E.g., don’t use a series of specially named empty files to indicate progress, write to a log file instead.

Cluster Computing Basics