800 likes | 916 Vues
Introduction to Scientific Computing on the IBM SP and Regatta. Doug Sondak sondak@bu.edu. Outline. Friendly Users (access to Regatta) hardware batch queues (LSF) compilers libraries MPI OpenMP debuggers profilers and hardware counters. Friendly Users. Friendly Users.
E N D
Introduction to Scientific Computing on the IBM SP and Regatta Doug Sondak sondak@bu.edu
Outline • Friendly Users (access to Regatta) • hardware • batch queues (LSF) • compilers • libraries • MPI • OpenMP • debuggers • profilers and hardware counters
Friendly Users • Regatta • not presently open to general user community • will be open to a small number of “friendly users” to help us make sure everything’s working ok
Friendly Users (cont’d) • Friendly-user rules 1. We expect the friendly-user period to last 4-6 weeks 2. No charge for CPU time! 3. Must have “mature” code • code must currently run (we don’t want to test how well the Regatta runs emacs!) • serial or parallel
Friendly Users (3) • Friendly-user rules (cont’d) 4. We want feedback! • What did you encounter that prevented porting your code from being a “plug and play” operation? • If it was not obvious to you, it was not obvious to some other users! 5. Timings are required for your code • use time command • report wall-clock time • web-based form for reporting results
Friendly Users (4) • Friendly-user application and report form: • first go to SP/Regatta repository: http://scv.bu.edu/SCV/IBMSP/ • click on Friendly Users link at bottom of menu on left-hand side of page • timings required for the Regatta and either the O2k or SP (both would be great!)
Hal (SP) • Power3 processors • 375 MHz • 4 nodes • 16 processors each • shared memory on each node • 8GB memory per node • presently can use up to 16 procs.
Hal (cont’d) • L1 cache • 64 KB • 128-byte line • 128-way set associative • L2 cache • 4 MB • 128-byte line • direct-mapped (“1-way” set assoc.)
Twister (Regatta) • Power4 processors • 1.3 GHz • 2 CPUs per chip (interesting!) • 3 nodes • 32 processors each • shared memory on each node • 32GB memory per node • presently can use up to 32 procs.
Twister (cont’d) • L1 cache • 32 KB per proc. (64 KB per chip) • 128-byte line • 2-way set associative
Twister (3) • L2 cache • 1.41 MB • shared by both procs. on a chip • 128-byte line • 4-to-8 way set associative • unified • data, instructions, page table entries
Twister (4) • L3 cache • 128 MB • off-chip • shared by 8 procs. • 512-byte “blocks” • coherence maintained at 128-bytes • 8-way set associative
Batch Queues • LSF batch system • bqueues for list of queues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP p4-mp32 10 Open:Active 1 1 - - 0 0 0 0 p4-mp16 9 Open:Active 2 1 - - 0 0 0 0 p4-short 8 Open:Active 2 1 - - 0 0 0 0 p4-long 7 Open:Active 16 5 - - 0 0 0 0 sp-mp16 6 Open:Active 2 1 - 1 2 1 1 0 sp-mp8 5 Open:Active 2 1 - - 1 0 1 0 sp-long 4 Open:Active 8 2 - - 20 12 8 0 sp-short 3 Open:Active 2 1 - - 0 0 0 0 graveyard 2 Open:Inact - - - - 0 0 0 0 donotuse 1 Open:Active - - - - 0 0 0 0
Batch Queues (cont’d) • p4 queues are on the Regatta • sp queues are on the SP (surprise!) • “long” and “short” queues are serial • for details see http://scv.bu.edu/SCV/scf-techsumm.html • will not include Regatta info. until it’s open to all users • bsub to submit job • bjobs to monitor job
Compiler Names • AIX uses different compiler namesto perform some tasks which are handled by compiler flagson many other systems • parallel compiler names differ for SMP, message-passing, and combined parallelization methods
Compilers (cont’d) Serial MPI OpenMP Mixed Fortran 77 xlf mpxlf xlf_r mpxlf_r Fortran 90 xlf90 mpxlf90 xlf90_r mpxlf90_r Fortran 95 xlf95 mpxlf95 xlf95_r mpxlf95_r C cc mpcc cc_r mpcc_r xlc mpxlc xlc_r mpxlc_r C++ xlC mpCC xlC_r mpCC_r gcc and g++ are also available
Compilers (3) • xlc default flags -qalias=ansi • optimizer assumes that pointers can only point to an object of the same type (potentially better optimization) -qlanglvl=ansi • ansi c -qro • string literals (e.g., char *p = ”mystring”;) placed in “read-only” memory (text segment); cannot be modified
Compilers (4) • xlc default flags (cont’d) -qroconst • constants placed in read-only memory
Compilers (5) • cc default flags • -qalias=extended • optimizer assumes that pointers may point to object whose address is taken, regardless of type (potentially weaker optimization) • -qlanglvl=extended • extended (not ansi) c • “compatibility with the RT compiler and classic language levels” • -qnoro • string literals (e.g., char *p = ”mystring”;) can be modified • may use more memory than -qro
Compilers (6) • cc default flags (cont’d) • -qnoroconst • constants not placed in read-only memory
Default Fortran Suffixes xlf .f xlf90.f f90 .f90 xlf95.f f95.f mpxlf .f mpxlf90 .f90 mpxlf95.f Same except for suffix
Compiler flags • Specify source file suffix -qsuffix=f=f90 (lets you use xlf90 with .f90 suffix) • 64-bit • q64 • use if you need more than 2GB
flags cont’d • Presently a foible on twister (Regatta) • if compiling with -q64 and using MPI, must compile with mp…_r compiler, even if you’re not using SMP parallelization
flags (3) • IBM optimization levels -Obasic optimization -O2 same as -O -O3 more aggressive optimization -O4 even more aggressive optimization; optimize for current architecture; IPA -O5 aggressive IPA
flags (4) • If using O3 or below, can optimize for local hardware (done automatically for -O4 and -O5): -qarch=auto optimize for resident architecture -qtune=autooptimize forresident processor -qcache=auto optimize for resident cache
flags (5) • If you’re using IPA and you get warnings about partition sizes, try -qipa=partition=large • default data segment limit 256MB • data segment contains static, common, and allocatable variables and arrays • can increase limit to a maximum of 2GB with 32-bit compilation -bmaxdata:0x80000000 • can use more than 2GB data with -q64
flags (6) • -O5 does not include function inlining • function inlining flags: -Q compiler decides what functions to inline -Q+func1:func2 only inline specified functions -Q -Q-func1:func2 let compiler decide, but do not inline specified functions
Scientific Libraries • Contain • Linear Algebra Subprograms • Matrix Operations • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms, Convolutions and Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation
Scientific Libs. Cont’d • Documentation - go to IBM Repository: http://scv.bu.edu/SCV/IBMSP/ • click on Libraries • ESSLSMP • for use with “SMP processors” (that’s us) • some serial, some parallel • parallel versions use multiple threads • thread safe; serial versions may be called within multithreaded regions (or on single thread) • link with -lesslsmp
Scientific Libs. (3) • PESSLSMP • message-passing (MPI, PVM) -lpesslsmp -lesslsmp -lblacssmp
Fast Math • MASS library • Mathematical Acceleration SubSystem • faster versions of some Fortran intrinsic functions • sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y • work with Fortran or C • differ from standard functions in last bit (at most)
Fast Math (cont’d) • simply link to mass library: Fortran: -lmass C: -lmass -lm • sample approx. speedups exp 2.4 log 1.6 sin 2.2 complex atan 4.7
Fast Math (3) • Vector routines offer even more speedup, but require minor code changes • link to -lmassv • subroutine calls • prefix name with vs for 4-byte reals (single precision) and v for 8-byte reals (double precision)
Fast Math (4) • example: single-precision exponential call vsexp(y,x,n) • x is the input vector of length n • y is the output vector of length n • sample speedups (single & double) exp 9.7 6.7 log 12.3 10.4 sin 10.0 9.8 complex atan 16.7 16.5
Fast Math (5) • For details see the following file on hal: file:/usr/lpp/mass/MASS.readme
MPI • MPI works differently on IBM than on other systems • first compile code using compiler with mp prefix, e.g., mpcc • this automatically links to MPI libraries; do not use -lmpi
POE • Parallel Operating Environment • controls parallel operation, including running MPI code
Running MPI Code • Do not use mpirun! • poe mycode -procs 4 • file re-direction: poe mycode < myin > myout-procs 4 • note: no quotes • a useful flag: -labelio yes labels output with process number (0, 1, 2, …) • also setenv MP_LABELIOyes
SMP Compilation • OpenMP • append compiler name with _r • use-qsmp=ompflag SGI: f77 -mp mycode.f IBM: xlf_r -qsmp=omp mycode.f • Automatic parallelization SGI: f77 -apo mycode.f IBM: xlf_r -qsmp mycode.f
SMP Compilation cont’d • Listing files for auto-parallelization SGI: f77 -apo list mycode.f IBM: xlf_r -qsmp -qreport=smplist mycode.f
SMP Environment • Per-thread stack limit • default 4MB • can be increased by using environment variable setenv XLSMPOPTS $XLSMPOPTS\:stack=size where size is the new size limit in bytes
Running SMP • Running is the same as on other systems, e.g., #!/bin/tcsh setenv OMP_NUM_THREADS 4 mycode < myin > myout exit
OpenMP functions • On IBM, must declare OpenMP Fortran functions integer OMP_GET_NUM_THREADS (not necessary on SGI)