Distributed & Parallel Computing Cluster

Distributed & Parallel Computing Cluster Patrick McGuigan mcguigan@cse.uta.edu 2/12/04

DPCC Background • NSF funded Major Research Instrumentation (MRI) grant • Goals • Personnel • PI • Co-PI’s • Senior Personnel • Systems Administrator 2/12/04

DPCC Goals • Establish a regional Distributed and Parallel Computing Cluster at UTA (DPCC@UTA) • An inter-departmental and inter-institutional facility • Facilitate collaborative research that requires large scale storage (tera to peta bytes), high speed access (gigabit or more) and mega processing (100’s of processors) 2/12/04

DPCC Research Areas • Data mining / KDD • Association rules, graph-mining, Stream processing etc. • High Energy Physics • Simulation, moving towards a regional Dø Center • Dermatology/skin cancer • Image database, lesion detection and monitoring • Distributed computing • Grid computing, PICO • Networking • Non-intrusive network performance evaluation • Software Engineering • Formal specification and verification • Multimedia • Video streaming, scene analysis • Facilitate collaborative efforts that need high-performance computing 2/12/04

DPCC Personnel • PI Dr. Chakravarthy • CO-PI’s • Drs. Aslandogan, Das, Holder, Yu • Senior Personnel • Paul Bergstresser, Kaushik De, Farhad Kamangar, David Kung, Mohan Kumar, David Levine, Jung-Hwan Oh, Gregely Zaruba • Systems Administrator • Patrick McGuigan 2/12/04

DPCC Components • Establish a distributed memory cluster (150+ processors) • Establish a Symmetric or shared multiprocessor system • Establish a large shareable high speed storage (100’s of Terabytes) 2/12/04

DPCC Cluster as of 2/1/2004 • Located in 101 GACB • Inauguration 2/23 as part of E-Week • 5 racks of equipment + UPS 2/12/04

2/12/04

Photos (scaled for presentation) 2/12/04

DPCC Resources • 97 machines • 81 worker nodes • 2 interactive nodes • 10 IDE based RAID servers • 4 nodes support Fibre Channel SAN • 50+ TB storage • 4.5 TB in each IDE RAID • 5.2 TB in FC SAN 2/12/04

DPCC Resources (continued) • 1 Gb/s network interconnections • core switch • satellite switches • 1 Gb/s SAN network • UPS 2/12/04

DPCC Layout 2/12/04

DPCC Resource Details • Worker nodes • Dual Xeon processors • 32 machines @ 2.4GHz • 49 machines @ 2.6GHz • 2 GB RAM • IDE Storage • 32 machines @ 60 GB • 49 machines @ 80 GB • Redhat 7.3 Linux (2.4.20 kernel) 2/12/04

DPCC Resource Details (cont.) • Raid Server • Dual Xeon processors (2.4 GHz) • 2 GB RAM • 4 Raid Controllers • 2 port controller (qty 1) Mirrored OS disks • 8 port controller (qty 3) RAID5 with hot spare • 24 250GB disks • 2 40GB disk • NFS used to support worker nodes 2/12/04

DPCC Resource Details (cont.) • FC SAN • RAID5 Array • 42 142GB FC disks • FC Switch • 3 GFS nodes • Dual Xeon (2.4 Ghz) • 2 GB RAM • Global File System (GFS) • Serve to cluster via NFS • 1 GFS Lockserver 2/12/04

Using DPCC • Two nodes available for interactive use • master.dpcc.uta.edu • grid.dpcc.uta.edu • More nodes are likely to support other services (Web, DB access) • Access through SSH (version 2 client) • Freeware Windows clients are available (ssh.com) • File transfers through SCP/SFTP 2/12/04

Using DPCC (continued) • User quotas not implemented on home directory yet. Be sensible in your usage. • Large data sets will be stored on RAID’s (requires coordination with sys admin) • All storage visible to all nodes. 2/12/04

Getting Accounts • Have your supervisor request account • Account will be created • Bring ID to 101 GACB to receive password • Keep password safe • Login to any interactive machine • master.dpcc.uta.edu • grid.dpcc.uta.edu • USE yppasswd command to change password • If you forget your password • See me in my office, I will reset your password (with ID) • Call or e-mail me, I will reset your password to the original password 2/12/04

User environment • Default shell is bash • Change with ypchsh • Customize user environment using startup files • .bash_profile (login session) • .bashrc (non-login) • Customize with statements like: • export <variable>=<value> • source <shell file> • Much more information in man page 2/12/04

Program development tools • GCC 2.96 • C • C++ • Java (gcj) • Objective C • Chill • Java • JDK Sun J2SDK 1.4.2 2/12/04

Development tools (cont.) • Python • python = version 1.5.2 • python2 = version 2.2.2 • Perl • Version 5.6.1 • Flex, Bison, gdb… • If your favorite tool is not available, we’ll consider adding it! 2/12/04

Batch Queue System • OpenPBS • Server runs on master • pbs_mom runs on worker nodes • Scheduler runs on master • Jobs can be submitted from any interactive node • User commands • qsub – submit a job for execution • qstat – determine status of job, queue, server • qdel – delete a job from the queue • qalter – modify attributes of a job • Single queue (workq) 2/12/04

PBS qsub • qsub used to submit jobs to PBS • A job is represented by a shell script • Shell script can alter environment and proceed with execution • Script may contain embedded PBS directives • Script is responsible for starting parallel jobs (not PBS) 2/12/04

Hello World [mcguigan@master pbs_examples]$ cat helloworld echo Hello World from $HOSTNAME [mcguigan@master pbs_examples]$ qsub helloworld 15795.master.cluster [mcguigan@master pbs_examples]$ ls helloworld helloworld.e15795 helloworld.o15795 [mcguigan@master pbs_examples]$ more helloworld.o15795 Hello World from node1.cluster 2/12/04

Hello World (continued) • Job ID is returned from qsub • Default attributes allow job to run • 1 Node • 1 CPU • 36:00:00 CPU time • standard out and standard error streams are returned 2/12/04

Hello World (continued) • Environment of job • Defaults to login shell (overide with #!) or –S switch • Login environment variable list with PBS additions: • PBS_O_HOST • PBS_O_QUEUE • PBS_O_WORKDIR • PBS_ENVIRONMENT • PBS_JOBID • PBS_JOBNAME • PBS_NODEFILE • PBS_QUEUE • Additional environment variables may be transferred using “-v” switch 2/12/04

PBS Environment Variables • PBS_ENVIRONMENT • PBS_JOBCOOKIE • PBS_JOBID • PBS_JOBNAME • PBS_MOMPORT • PBS_NODENUM • PBS_O_HOME • PBS_O_HOST • PBS_O_LANG • PBS_O_LOGNAME • PBS_O_MAIL • PBS_O_PATH • PBS_O_QUEUE • PBS_O_SHELL • PBS_O_WORKDIR • PBS_QUEUE=workq • PBS_TASKNUM=1 2/12/04

qsub options • Output streams: • -e (error output path) • -o (standard output path) • -j (join error + output as either output or error) • Mail options • -m [aben] when to mail (abort, begin, end, none) • -M who to mail • Name of job • -N (15 printable characters MAX first is alphabetical) • Which queue to submit job to • -q [name] Unimportant for now • Environment variables • -v pass specific variables • -V pass all environment variables of qsub to job • Additional attributes -w specify dependencies 2/12/04

Qsub options (continued) • -l switch used to specify needed resources • Number of nodes • nodes = x • Number of processors • ncpus = x • CPU time • cput=hh:mm:ss • Walltime • walltime=hh:mm:ss • See man page for pbs_resources 2/12/04

Hello World • qsub –l nodes=1 –l ncpus=1 –l cput=36:00:00 –N helloworld –m a –q workq helloworld • Options can be included in script: #PBS -l nodes=1 #PBS -l ncpus=1 #PBS -m a #PBS -N helloworld2 #PBS -l cput=36:00:00 echo Hello World from $HOSTNAME 2/12/04

qstat • Used to determine status of jobs, queues, server • $ qstat • $ qstat <job id> • Switches • -u <user> list jobs of user • -f provides extra output • -n provides nodes given to job • -q status of the queue • -i show idle jobs 2/12/04

qdel & qalter • qdel used to remove a job from a queue • qdel <job ID> • qalter used to alter attributes of currently queued job • qalter <job id> attributes (similar to qsub) 2/12/04

Processing on a worker node • All RAID storage visible to all nodes • /dataxy where x is raid ID, y is Volume (1-3) • /gfsx where x is gfs volume (1-3) • Local storage on each worker node • /scratch • Data intensive applications should copy input data (when possible) to /scratch for manipulation and copy results back to raid storage 2/12/04

Parallel Processing • MPI installed on interactive + worker nodes • MPICH 1.2.5 • Path: /usr/local/mpich-1.2.5 • Asking for multiple processors • -l nodes=x • -l ncpus=2x 2/12/04

Parallel Processing (continued) • PBS node file created when job executes • Available to job via $PBS_NODEFILE • Used to start processes on remote nodes • mpirun • rsh 2/12/04

Using node file (example job) #!/bin/sh #PBS -m n #PBS -l nodes=3:ppn=2 #PBS -l walltime=00:30:00 #PBS -j oe #PBS -o helloworld.out #PBS -N helloword_mpi NN=`cat $PBS_NODEFILE | wc -l` echo "Processors received = "$NN echo "script running on host `hostname`" cd $PBS_O_WORKDIR echo echo "PBS NODE FILE" cat $PBS_NODEFILE echo /usr/local/mpich-1.2.5/bin/mpirun -machinefile $PBS_NODEFILE -np $NN ~/mpi-example/helloworld 2/12/04

MPI • Shared Memory vs. Message Passing • MPI • C based library to allow programs to communicate • Each cooperating execution is running the same program image • Different images can do different computations based on notion of rank • MPI primitives allow for construction of more sophisticated synchronization mechanisms (barrier, mutex) 2/12/04

helloworld.c #include <stdio.h> #include <unistd.h> #include <string.h> #include "mpi.h" int main( argc, argv ) int argc; char **argv; { int rank, size; char host[256]; int val; val = gethostname(host,255); if ( val != 0 ){ strcpy(host,"UNKNOWN"); } MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); printf( "Hello world from node %s: process %d of %d\n", host, rank, size ); MPI_Finalize(); return 0; } 2/12/04

Using MPI programs • Compiling • $ /usr/local/mpich-1.2.5/bin/mpicc helloworld.c • Executing • $ /usr/local/mpich-1.2.5/bin/mpirun <options> \ helloworld • Common options: • -np number of processes to create • -machinefile list of nodes to run on 2/12/04

Resources for MPI • http://www-hep.uta.edu/~mcguigan/dpcc/mpi • MPI documentation • http://www-unix.mcs.anl.gov/mpi/indexold.html • Links to various tutorials • Parallel programming course 2/12/04

Distributed & Parallel Computing Cluster