1 / 44

Blue Gene extreme I/O

Blue Gene extreme I/O. Giri Chukkapalli San Diego Supercomputer Center July 29, 2005. BlueGene/L design. Almost 4 years of collaboration between LLNL, IBM and SDSC and others Applications people, computational scientists involved in the design and implementation of the machine

horace
Télécharger la présentation

Blue Gene extreme I/O

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blue Gene extreme I/O Giri Chukkapalli San Diego Supercomputer Center July 29, 2005

  2. BlueGene/L design • Almost 4 years of collaboration between LLNL, IBM and SDSC and others • Applications people, computational scientists involved in the design and implementation of the machine • Computation/communication characteristics of the machine are designed to mimic that of the physics and the algorithms

  3. BG/L: Broad design concepts • Communication characteristics of the scientific computing algorithms • Large amount of node neighbor communication • Small amount of global communication • Second processor on the node to allow truly overlapping communication and computation • Flexible I/O • Designed to exploit fine grain parallelism • Lean kernel, low latency interconnects, small memory

  4. Here’s a view of Blue Gene from chips to racks

  5. BG System Overview:Novel, massively parallel system from IBM • Full system scheduled for installation at LLNL in 3Q05 • 65,000+ compute nodes in 64 racks • Each node being two low-power PowerPC processors + memory • Compact footprint with very high processor density • Slow processors & modest memory per processor • Very high peak speed of 360 Tflops • 1/2 built now with #1 Linpack speed of 137 Tflops • 1024 compute nodes in single rack installed at SDSC • Maximum I/O-configuration with 128 I/O nodes for data-intensive computing • Has already achieved more than 3 GB/s read rate • Systems at >10 other sites • Need to select apps carefully • Must scale (at least weakly) to many processors (because they’re slow) • Must fit in limited memory

  6. BG System Overview: Processor Chip (1)(= System-on-a-chip) • Two 700-MHz PowerPC 440 processors • Each with two floating-point units • Each with 32-kB L1 data caches that are noncoherent • 4 flops/proc-clock peak (=2.8 Gflops/proc) • 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GBps/proc) • Shared 2-kB L2 cache (or prefetch buffer) • Shared 4-MB L3 cache • Five network controllers (though not all wired to each node) • 3-D torus (for point-to-point MPI operations: 175 MBps nom x 6 links x 2 ways) • Tree (for most collective MPI operations: 350 MBps nom x 3 links x 2 ways) • Global interrupt (for MPI_Barrier: low latency) • Gigabit Ethernet (for I/O) • JTAG (for machine control) • Memory controller for 512 MB of off-chip, shared memory • No concept of virtual memory or TLB miss

  7. BG System Overview: Processor Chip (2)

  8. BG System Overview: Integrated system

  9. BG System Overview:SDSC’s single-rack system (1) • 1024 compute nodes & 128 I/O nodes (each with 2p) • Most I/O-rich configuration possible (8:1 compute:I/O node ratio) • Identical hardware in each node type with different networks wired • Compute nodes connected to: torus, tree, global interrupt, & JTAG • I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG • Two half racks (also confusingly called midplanes) • Connected via link chips • Front-end nodes (4 B80s, each with 4p) • Service node (p275 with 2p) • Large file system (~400 TB in /idgpfs) serviced by NSD nodes (IA-64s, each with 2p)

  10. BG System Overview:SDSC’s single-rack system (2)

  11. BG System Overview:Multiple operating systems & functions • Compute nodes: run Compute Node Kernel (CNK = blrts) • Each run only one job at a time • Each use very little memory for CNK • I/O nodes: run Embedded Linux • Run CIOD to manage compute nodes • Perform file I/O • Run GPFS • Front-end nodes: run SuSE SLES9 Linux/PPC64 • Support user logins • Run cross compilers & linker • Run parts of mpirun to submit jobs & LoadLeveler to manage jobs • Service node: runs SuSE SLES8 Linux/PPC64 • Uses DB2 to manage four system databases • Runs control system software, including MMCS • Runs other parts of mpirun & LoadLeveler • (Software comes in drivers: currently running Driver 202)

  12. BG System Overview:Parallel I/O via GPFS

  13. Getting started:Logging on & moving files • Logging on sshbglogin.sdsc.edu or ssh -lusername bglogin.sdsc.edu • Moving files scpfile username@bglogin.sdsc.edu:~ or scp -rdirectory username@bglogin.sdsc.edu:~

  14. Getting started:Places to store your files • /users (home directory) • 18-GB file system on front-end node • Will increase soon to ~1 TB • Still won’t be able to store much there • Regular backups • /gpfs-wan available for parallel I/O via GPFS • ~400 TB accessed via IA-64 NSD servers • GPFS shared as Global File System across other high-end systems at SDSC • No backups

  15. Using the compilers:Important programming considerations • Front-end nodes have different processors & run different OS than compute nodes • Hence codes must be cross compiled • Discovery of system characteristics during compilation (e.g., via configure) may require code change • Some system calls are not supported by the compute node kernel

  16. Using the compilers:Dual FPUs & SIMDization • Good performance depends upon using • both FPUs per processor* • SIMD vectorization • These work only for • data that are 16-B aligned to support quadword (16-B = 128-b) loads & stores • Full bandwidth is obtained only for stride-one accesses *All floating-point computations are done in double precision, even though rest of processor is 32 bit

  17. Using the compilers:Compiler versions, paths, & wrappers • Compilers (version numbers the same as on DataStar) XL Fortran V9.1: blrts_xlf & blrts_xlf90 XL C/C++ V7.0: blrts_xlc & blrts_xlC • Paths to compilers in default .bashrc export PATH=/opt/ibmcmp/xlf/9.1/bin:$PATH export PATH=/opt/ibmcmp/vac/7.0/bin:$PATH export PATH=/opt/ibmcmp/vacpp/7.0/bin:$PATH • Compilers with MPI wrappers (recommended) mpxlf, mpxlf90, mpcc, & mpCC • Path to MPI-wrapped compilers in default .bashrc export PATH=/usr/local/apps/bin:$PATH

  18. Using the compilers: Options & example • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 (after -qarch) seems superfluous, but avoids warnings -O3 gives minimal optimization with no SIMDization -O3 -qhot=simd adds SIMDization (seems to be the same as -O5) -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (but sometimes has problems) -qdebug=diagnostic gives SIMDization info • Big problem now! Second FPU is seldom used, i.e., -O5 is seldom better than -O3 • Current recommendation -O3 -qarch=440 • Example using MPI-wrapped compiler mpxlf90 -O3 -qarch=440 -o hello hello.f

  19. Using libraries: Math libraries • ESSL • ~500 routines implemented • Mostly optimized for -O3 -qarch=440 now • Beta version available; formal release in October 05 • Currently having correctness problems • MASS/MASSV • Initial release of Version 4.2 available • Still being optimized • FFTW • Versions 2.1.5 & 3.0.1 available in both single & double precision • Performance comparisons with ESSL in progress • Example link paths -Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lessln -L/usr/local/apps/fftw301s/lib -lfftw3f • Reference: Ramendra Sahoo’s slides (SSW16)

  20. Using libraries:Message passing via MPI • MPI is based on MPICH2 from ANL • All MPICH2 routines (other than MPI-IO) are implemented: some still are being optimized • Most MPI-IO routines are implemented: optimization is underway • Compilation & linking are facilitated by MPI wrappers at /usr/local/apps/bin • References: George Almási’s slides (SSW09), Rusty Lusk’s slides (SS10), & Hao Yu’s slides (SSW11)

  21. Running jobs: Overview • There are two compute modes • Coprocessor (CO) mode: one compute processor per node • Virtual node (VN) mode: two compute processors per node • Jobs run in partitions or blocks • These are typically powers of two • Blocks must be allocated (or booted) before run & are restricted to a single user at a time • Job submission & block allocation are done by mpirun • Sys admins may also boot blocks with MMCS; this avoids allocation overhead for repeat runs • Interactive & batch jobs are both supported • Batch jobs are managed by LoadLeveler

  22. Running jobs: mpirun (1) • Jobs are submitted from front-end nodes via mpirun • Here are two example for interactive runs mpirun -partition bot64-1 -np 8 -exe /users/pfeiffer/hello/hello mpirun -partition bot256-1 -mode VN -np 512 -exe /users/pfeiffer/NPB2.4/NPB2.4-MPI/binO3fix2/cg.C.512 | tee cg.C.256v.out • mpirun occasionally hangs, but control-C usually allows exit

  23. Running jobs: mpirun (2) • Key mpirun options are -partition predefined partition name -mode compute mode: CO or VN -connect connectivity: TORUS or MESH -np number of compute processors -mapfile logical mapping of processors -cwd full path of current working directory -exe full path of executable -args arguments of executable (in double quotes) -env environmental variables (in double quotes) (These are mostly different than for TeraGrid) • See mpirun user’s manual for syntax

  24. Running jobs: mpirun (3) • -partition may be specified explicitly (or not) • If specified, partition must be predefined in database (which can be viewed via Web page) • If not specified, partition will be at least a half rack • Recommendation: Always use a predefined partition • -mode may be CO (default) or VN • Generally you must specify partition to run in VN mode • For given number of nodes, VN mode is usually faster than CO mode • Memory per processor in VN mode is half that of CO mode • Recommendation: Use VN mode unless there is not enough memory • -connect may be TORUS or MESH • Option only applies if -partition not specified (with MESH the default) • Performance is generally better with TORUS • Recommendation: Use a predefined partition, which ensures TORUS

  25. Running jobs: mpirun (4) • -np gives number of processors • Must fit in available partition • -mapfile gives logical mapping of processors • Can improve MPI performance in some cases • Can be used to ensure VN mode for 2*np ≤ partition size • Can be used to change ratio of compute nodes to I/O nodes • Recommendation: Contact SDSC if you you want to use mapfile, since no documentation is available • -cwd gives current working directory • Needed if there is an input file

  26. Running jobs:LoadLeveler for batch jobs (1) • Batch jobs are managed with LoadLeveler (in a similar manner as on DataStar) • You generate LoadLeveler run script that includes mpirun • Then you submit job via llsubmit • You can monitor status with llq -x & llstatus • Additional BG-specific commands are available • Problem: scheduler is not working now! • Once this is fixed, LoadLeveler will be recommended way to make production runs • See LoadLeveler user guide for more information

  27. Running jobs:LoadLeveler for batch jobs (2) • Here is an example LoadLeveler run script, say cg.C.512v.run #!/usr/bin/ksh #@ environment = COPY_ALL;MMCS_SERVER_IP=bgsn-e.sdsc.edu; BACKEND_MPIRUN_PATH=/usr/bin/mpirun_be; #@ job_type = parallel #@ class = parallel #@ input = /dev/null #@ output = cg.C.512v.$(jobid).out #@ error = cg.C.512v.$(jobid).err #@ wall_clock_limit = 00:10:00 #@ queue mpirun -partition bot256-1 -mode VN -np 512 -exe /users/pfeiffer/NPB2.4/NPB2.4-MPI/binO3fix2/cg.C.512 • Submit as follows: llsubmit cg.C.512v.run

  28. Running jobs:Usage guidelines during weekdays

  29. Running jobs:Predefined partitions • Production or test partitions rack all 1,024 nodes top & bot 512 nodes in top & 512 nodes in bottom top256-1 & top256-2 256 nodes in each half of top bot256-1 & bot256-2 256 nodes in each half of bottom • Test partitions bot128-1, …, bot128-4 128-node quarters of bottom bot64-1, …, bot64-8 64-node eighths of bottom • GPFS partitions rackGPFS all 1,024 nodes topGPFS & botGPFS 512 nodes in top & 512 nodes in bottom

  30. Monitoring jobs: Block & job status via Web Web site at bgsn.sdsc.edu (password protected)

  31. Monitoring jobs:Life cycles of blocks & jobs • Successive block states • FREE • ALLOCATED • CONFIGURING • BOOTING (may hang in this state; control-C to exit) • INITIALIZED • Successive job states • QUEUED • STARTING • RUNNING • DYING (if killed by user) • TERMINATED

  32. BG System Overview: References • Special Blue Gene issue of IBM Journal of Research and Development, v. 49 (2/3), March/May 2005 www.research.ibm.com/journal/rd49-23.html • Blue Gene Web site at SDSC www.sdsc.edu/user_services/bluegene • Slides from Blue Gene System Software Workshop www-unix.mcs.anl.gov/~beckman/bluegene/SSW-Utah-2005

  33. BG/L Architecture: Drawbacks • Code has to be memory scalable • Reengineer the code to overlap computation and communication • Understanding the process geometry • Must be parallel I/O • Cross compilation issues • Codes have to exhibit fine-grain parallelism

  34. BG/L Architecture: Advantages • High intra, inter-node Bytes/Flops ratio • Two separate networks to handle two distinct type of communications • On the network global reductions • Extremely repeatable performance • Very low OS overhead • In I/O rich configuration very high I/O bytes to FLOPS • Very low watts/FLOPS ratio, square feet/FLOPS

  35. Advantages cntd. • Truly RISC architecture • Quad load instructions • Familiar environment • Linux front end • XL compilers • Totalview debugger • HPM/MPI profiling tools

  36. Good matching codes • Spectral element codes, CFD • QCD, QM/MD • Codes involving Data streaming • Ab-initio protein folding codes • Early production runs • NREL simulation of cellulase linker domain using CPMD • Caltech spectral element simulation of Sumatra earthquake • DOT, a protein-protein docking code • MPCUGLES, an LES CFD code • MD simulations using NAMD

  37. Very Preliminary experience • Compiler doesn’t generate dual floating point or quad load instructions yet • MPI calls are not yet fully optimized for the Tree • Performance numbers are very preliminary • User environment is still rough • Parallel file system still not user friendly

  38. NAMD doesn’t scale as well on BG as on p655s;VN mode is only a little worse than CO mode (per p)

  39. ENZO • Example of performance impact of codes containing O(N^2) or O(N^3) algorithms where N is the number of processors

  40. Original Enzo performance on DS,BG/L

  41. Improved Enzo performance on BH and DS

  42. I/O performance with GPFS has been measuredfor two benchmarks & one application;max rates on Blue Gene are comparable to DataStar for benchmarks,but slower for application in VN mode 2048p DS p655s BG CO BG VN BG VN 8p/node 8p/IO node 16p/IO node 16p/IO node Code & quantity (MB/s) (MB/s) (MB/s) (MB/s) IOR write 1,793 1,797 1,478 1,585 IOR read 1,755 2,291 2,165 2,306 mpi-tile-io write 2,175 2,040 1,720 1,904 mpi-tile-io read 1,698 3,481 2,929 2,933 mpcugles write 1,391 905 387 IOR & mpi-tile-io results are on 1024p, except for last column mpcugles results are on 512p

  43. IOR weak scaling scans with GPFS showBG has higher max for reads (2.3 vs 1.9 GB/s), whileDS has higher max (than BG VN) for writes (1.8 vs 1.6 GB/s)

  44. User related issues • Effectiveness of comp/comm overlapping using coprocessor mode still not tested • Performance variability based on • Physical slice given by the job scheduler • Mapping of the mpi tasks onto the physical slice

More Related