1 / 70

Programming the IBM Power3 SP

Programming the IBM Power3 SP. Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB. Advanced Computational Research Laboratory. High Performance Computational Problem-Solving and Visualization Environment

aizza
Télécharger la présentation

Programming the IBM Power3 SP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

  2. Advanced Computational Research Laboratory • High Performance Computational Problem-Solving and Visualization Environment • Computational Experiments in multiple disciplines: CS, Science and Eng. • 16-Processor IBM SP3 • Member of C3.ca Association, Inc. (http://www.c3.ca)

  3. Advanced Computational Research Laboratory www.cs.unb.ca/acrl • Virendra Bhavsar, Director • Eric Aubanel, Research Associate & Scientific Computing Support • Sean Seeley, System Administrator

  4. Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)

  5. POWER chip: 1990 to 2003 1990 • Performance Optimized with Enhanced RISC • Reduced Instruction Set Computer • Superscalar: combined floating point multiply-add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz • Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

  6. POWER chip: 1990 to 2003 1991: SP1 • IBM’s first SP (scalable power parallel) • Rack of standalone POWER processors (62.5 MHz) connected by internal switch network • Parallel Environment & system software

  7. POWER chip: 1990 to 2003 1993: POWER2 • 2 FMAs • Increased data cache size • 66.5 MHz (254 MFLOPS) • Improved instruction set (incl. Hardware square root) • SP2: POWER2 + higher bandwidth switch for larger systems

  8. POWER chip: 1990 to 2003 1993: POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz

  9. POWER chip: 1990 to 2003 Feb. ‘99: POWER3 • Combined P2SC & POWERPC • 64 bit architecture • Initially 2-way SMP, 200 MHz • Cache improvement, including L2 cache of 1-16 MB • Instruction & data prefetch

  10. Winterhawk II - 375 MHz 4- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II - 375 MHz 16- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node POWER3+ chip: Feb. 2000

  11. The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes

  12. Power3 Architecture

  13. Power4 - 32 way • Logical UMA • SP High Node • L3 cache shared between all processors on node - 32 MB • Up to 32 GB main memory • Each processor: 1.1 GHz • 140 Gflops total peak

  14. Going to NUMA NUMA up to 256 processors - 1.1 Teraflops

  15. Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)

  16. Uni-processor Optimization • Compiler options: • start with -O3 -qstrict, then -O3, -qarch=pwr3 • Cache re-use • Take advantage of superscalar architecture • give enough operations per load/store • Use ESSL - optimization already maximally exploited

  17. Memory Access Times

  18. 128 byte cache line 2 MB 2 MB 2 MB 2 MB Cache L2 cache: 4-way set-associative, 8 MB total L1 cache: 128-way set-associative, 64 KB

  19. How to Monitor Performance? • IBM’s hardware monitor: HPMCOUNT • Uses hardware counters on chip • Cache & TLB misses, fp ops, load-stores, … • Beta version • Available soon on ACRL’s SP

  20. real*8 a(256,256),b(256,256),c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end do end PM_TLB_MISS (TLB misses) : 66543 Average number of loads per TLB miss : 5.916 Total loads and stores : 0.525 M Instructions per load/store : 2.749 Cycles per instruction : 2.378 Instructions per cycle : 0.420 Total floating point operations : 0.066 M Hardware float point rate : 2.749 Mflop/sec HMPCOUNT sample output

  21. real*8 a(257,256),b(257,256),c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : 241.876 Total loads and stores : 0.527 M Instructions per load/store : 2.749 Cycles per instruction : 1.271 Instructions per cycle : 0.787 Total floating point operations : 0.066 M Hardware float point rate : 3.525 Mflop/sec HMPCOUNT sample output

  22. ESSL • Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers • Fast! • 560x560 real*8 matrix multiply • Hand coding: 19 Mflops • dgemm: 1.2 GFlops • Parallel (threaded and distributed) versions

  23. Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)

  24. Disk ACRL’s IBM SP • 4 Winterhawk II nodes • 16 processors • Each node has: • 1 GB RAM • 9 GB (mirrored) disk on each node • Switch adapter • High Perforrnance Switch • Gigabit Ethernet (1 node) • Control workstation • Disk: SSA tower with 6 18.2 GB disks Gigabit Ethernet

  25. IBM Power3 SP Switch • Bidirectional multistage interconnection networks (MIN) • 300 MB/sec bi-directional • 1.2 sec latency

  26. Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD General Parallel File System Node 2 Node 3 Node 4 SP Switch Node 1

  27. ACRL Software • Operating System: AIX 4.3.3 • Compilers • IBM XL Fortran 7.1 (HPF not yet installed) • VisualAge C for AIX, Version 5.0.1.0 • VisualAge C++ Professional for AIX, Version 5.0.0.0 • IBM Visual Age Java - not yet installed • Job Scheduler: Loadleveler 2.2 • Parallel Programming Tools • IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O • Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) • Visualization: OpenDX (not yet installed) • E-Commerce software (not yet installed)

  28. Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)

  29. Why Parallel Computing? • Solve large problems in reasonable time • Many algorithms are inherently parallel • image processing, Monte Carlo • Simulations (eg. CFD) • High performance computers have parallel architectures • Commercial off-the shelf (COTS) components • Beowulf clusters • SMP nodes • Improvements in network technology

  30. NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP

  31. Parallel Computational Models • Data Parallelism • Parallel program looks like serial program • parallelism in the data • Vector processors • HPF

  32. Parallel Computational Models • Message Passing (MPI) • Processes have only local memory but can communicate with other processes by sending & receiving messages • Data transfer between processes requires operations to be performed by both processes • Communication network not part of computational model (hypercube, torus, …) Send Receive

  33. Address space Processes Parallel Computational Models • Shared Memory (threads) • P(osix)threads • OpenMP: higher level standard

  34. Parallel Computational Models • Remote Memory Operations • “One-sided” communication • MPI-2, IBM’s LAPI • One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Get Put

  35. Address space Address space Address space Network Processes Processes Processes Parallel Computational Models • Combined: Message Passing & Threads • Driven by clusters of SMPs • Leads to software complexity!

  36. Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)

  37. Message Passing Interface • MPI 1.0 standard in 1994 • MPI 1.1 in 1995 - IBM support • MPI 2.0 in 1997 • Includes 1.1 but adds new features • MPI-IO • One-sided communication • Dynamic processes

  38. Advantages of MPI • Universality • Expressivity • Well suited to formulating a parallel algorithm • Ease of debugging • Memory is local • Performance • Explicit association of data with process allows good use of cache

  39. MPI Functionality • Several modes of point-to-point message passing • blocking (e.g. MPI_SEND) • non-blocking (e.g. MPI_ISEND) • synchronous (e.g. MPI_SSEND) • buffered (e.g. MPI_BSEND) • Collective communication and synchronization • e.g. MPI_REDUCE, MPI_BARRIER • User-defined datatypes • Logically distinct communicator spaces • Application-level or virtual topologies

  40. Simple MPI Example My_Id 0 1 This is from MPI process number 0 This is from MPI processes other than 0

  41. Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id .eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end

  42. MPI Example with send/recv Send Receive Receive Send My_Id 0 1

  43. MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end

  44. /* Processor 0 */ ... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now ...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */ ... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now ...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); What Will Happen?

  45. MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered <= eager limit > eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)

  46. MPI Performance Visualization • ParaGraph • Developed by University of Illinois • Graphical display system for visualizing behaviour and performance of MPI programs

More Related