610 likes | 778 Vues
Scalable Architectures and their Software. Introduction. Overview of RISC CPUs, Memory Hierarchy Parallel Systems General Hardware Layout (SMP, Distributed, Hybrid) Networks for Parallel Systems Overview of Parallel Programming Hardware Specifics of NPACI Parallel Machines
E N D
Scalable Architectures and their Software
Introduction • Overview of RISC CPUs, Memory Hierarchy • Parallel Systems General Hardware Layout (SMP, Distributed, Hybrid) • Networks for Parallel Systems • Overview of Parallel Programming • Hardware Specifics of NPACI Parallel Machines • IBM SP Blue Horizon • SUN HPC10000
Scalable Parallel Computer Systems (Scalable) [ ( CPUs) + (Memory) + (I/O) + (Interconnect) + (OS) ] = Scalable Parallel Computer System
Scalable Parallel Computer Systems Scalablity: A parallel system is scalable if it is capable of providing enhanced resources to accommodate increasing performance and/or functionality • Resource scalability: scalability achieved by increasing machine size ( # CPUs, memory, I/O, network, etc.) • Application scalability • machine size • problem size
CPU, CACHE, and MEMORY Basic Layout CPU CACHE Level 1 CACHE Level 2 MAIN MEMORY
Laura C. Nett: Instruction set is just how each operation is processed x=y+1 load y and a add y and a put in x Processor Related Terms • RISC : Reduced Instruction Set Computers • PIPELINE : Technique where multiple instructions are overlapped in execution • SUPERSCALAR : Computer designfeature -multiple instructions can be executed per clock period
Loads & Stores r0 r1 r2 . . . . r32 ‘Typical’ RISC CPU CPU Functional Units registers FP Add FP Multiply Memory/Cache FP Multiply & Add FP Divide
Chair Building Function Unit Carpenter 1 Carpenter 2 Carpenter 3 Carpenter 4 Carpenter 5 • Fully Segmented - A(I)=C(I)*D(I) C(I) A(I) D(I) Multiply pipeline length Functional Unit
Dual Pipes A(I) = C(I)*D(I) odd C(I) odd C(I) A(I) & A(I+1) even C(I+1) even D(I+1)
RISC Memory/Cache Related Terms • ICACHE : Instruction cache • DCACHE (Level 1) : Data cache closest to registers • SCACHE (Level 2) : Secondary data cache • Data from SCACHE has to go through DCACHE to registers • SCACHE is larger than DCACHE • All processors do not have SCACHE • CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy • TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed • MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory • MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one level in memory to another • CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy
Memory/Cache Related Terms (cont.) • The data cache was designed to allow programmers to take advantage of common data access patterns : • Spatial Locality • When an array element is referenced, its neighbors are likely to be referenced • Cache lines are fetched together • Work on consecutive data elements in the same cache line • Temporal Locality • When an array element is referenced, it is likely to be referenced again soon • Arrange code so that data in cache is reused as often as possible
Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache. Memory/Cache Related Terms (cont.) cache Main memory
Memory/Cache Related Terms (cont.) Fully associative cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache. cache Main memory
Memory/Cache Related Terms (cont.) Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into n (n at least 2) locations in the cache. 2-way set-associative cache Main memory
Memory/Cache Related Terms (cont.) • Common Cache Replacement Strategies • Least Recently Used (LRU) : Cache replacement strategy for set associative caches. The cache block that is least recently used is replaced with a new block. • Random Replace : Cache replacement strategy for set associative caches. A cache block is randomly replaced.
Parallel Networks • Common Parallel Networks • Network Terminology
Send Information among CPUs through aNetwork - System Interconnect The best choice would be a fully connected network in which each processor has a direct link to every other processor. Unfortunately, this type of network would be very expensive and difficult to scale. Instead, processors are arranged in some variation of a mesh, torus, hypercube, etc. 2-d mesh 2-d torus 3-d hypercube
Network Topologies • Bus • shared data path • data requests require exclusive access • not very scalable • Crossbar • non-blocking switching grid between network elements • complexity ~ O(n*n) • Multistage Interconnection Network (MIN) • hierarchy of switching networks • Omega network for p cpus, p memory banks: complexity ~ O(ln(p))
Network Topologies We won’t worry too much about network topologies, since it is not the topology itself that we are interested in, but rather the effect that it has on important parameters for parallel program performance: • latency • bandwidth.
Network Terminology • Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better. • Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. larger is better. • May vary with data size For IBM Machines:
Parallel “Architectures” Control Mechanism SIMD MIMD Hybrid (SMP cluster) distributed-memory Memory Model shared-memory Programming Model SPMD MPMD
CPU CPU CPU CPU M M M M NETWORK CPU CPU CPU CPU BUS MEMORY Shared and Distributed memory Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (examples: CRAY T3E, IBM SP2 ) Shared memory - single address space. All processors have access to a pool of shared memory. (examples: SUN HPC, CRAY T90) Methods of memory access : - Bus - Crossbar
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU BUS BUS BUS MEMORY MEMORY MEMORY Laura C. Nett: Don’t scale because the memory bus gets saturated and also cache coherence problems. Types of Shared memory: UMA/SMP and NUMA Uniform memory access (UMA)-Each processor has uniform access to memory. Also known as Symmetric MultiProcessors (SMP) SMPs don’t scale to large # of CPUs because of memory access issues. Example: SUN HPC Non-uniform memory access (NUMA) Time for memory access depends on location of data. Local access is faster than non-local access. Global memory access. Easier to scale than SMPs.(example: HP V-Class) Secondary Bus
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU BUS BUS BUS Network MEMORY MEMORY MEMORY Hybrid (SMP Clusters)
Parallel Programming • What is parallel computing? • Why go parallel? • Types of parallel computing • Shared memory • Distributed memory • Hybrid • What are some limits of parallel computing?
What is parallel computing? • Parallel computing: the use of multiple computers or processors working together on a common task. • Each processor works on its section of the problem • Processors are allowed to exchange information (data in local memory) with other processors Grid of Problem to be solved CPU #1 works on this area of the problem CPU #2 works on this area of the problem exchange y CPU #3 works on this area of the problem CPU #4 works on this area of the problem exchange x
Why do parallel computing? • Limits of single-CPU computing • Available memory • Performance - usually “time to solution” • Parallel computing allows: • Solving problems that don’t fit on a single CPU • Solving problems that can’t be solved in a reasonable time • We can run… • Larger problems • Finer resolution • Faster • More cases
Hardware Architecture Models for Design of Parallel Programs Sequential computers - von Neumann model (RAM) is universal computational model Parallel computers - no one model exists • Model must be sufficiently general to encapsulate hardware features of parallel systems • Programs designed from model must execute efficiently on real parallel systems
Categories of Parallel Problems Problem “Architectures” ( after G Fox) • Ideally Parallel (Embarrassingly Parallel, “Job-Level Parallel”) • same application run on different data • could be run on separate machines • Example: Parameter Studies • Almost Ideally Parallel • similar to Ideal case, but with “minimum” coordination required • Example: Linear Monte Carlo calculations • Pipeline Parallelism • problem divided into tasks that have to be completed sequentially • can be transformed into partially sequential tasks • Example: DSP • Synchronous Parallelism • Each operation performed on all/most of data • Operations depend on results of prior operations • All processes must be synchronized at regular points • Example: Atmospheric Dynamics • Loosely Synchronous Parallelism • similar to Synchronous case, but with “minimum” intermittent data sharing • Example: Diffusion of contaminants through groundwater
Generic Parallel Programming Models • Single Program Multiple Data Stream (SPMD) • Each cpu accesses same object code • Same application run on different data • Data exchange may be handled explicitly/implicitly • “Natural” model for SIMD machines • Most commonly used generic programming model • Message-passing • Shared-memory • Multiple Program Multiple Data Stream (MPMD) • Each cpu accesses different object code • Each cpu has only data/instructions needed • “Natural” model for MIMD machines
Generic Parallel Programming Systems • Message-Passing • Local tasks, each encapsulating local data • Explicit data exchange • Supports both SPMD and MPMD • Supports both task and data decomposition • Most commonly used • Example: MPI, PVM • Data Parallel • Usually SPMD • Supports data decomposition • Data mapping to cpus may be either implicit/explicit • Example: HPF • Shared-Memory • Tasks share common address space • No explicit transfer of data - supports both task and data decomposition • Can be SPMD, MPMD • Example: OpenMP, Pthreads • Hybrid - Combination of Message-Passing and Shared-Memory - supports both task and data decomposition • Example: OpenMP + MPI
Methods of Problem Decomposition for Parallel Programming Want to map Problem + Algorithms to Architecture • Data Decomposition - data parallel • Each processor performs the same task on different data • Example - grid problems • Task Decomposition - task parallel • Each processor performs a different task • Example - signal processing • Other Decomposition methods
Programming Methodologies - Practical Aspects Bulk of parallel programs in Fortran, C, or C++ • Generally, best compiler, tool support for parallel development Data and/or tasks are split up onto different processors by: • Distributing the data onto local memory (MPPs,MPI) • Distribute work of each loop to different cpus (SMPs,OpenMP) • Hybrid - distribute data onto SMPs and then within each SMP distribute work of each loop (or task) to different CPUs within the box (SMP-Cluster, MPI&OpenMP)
PE #0 PE #1 PE #2 PE #4 PE #5 PE #6 PE #3 PE #7 Typical data parallel decomposition Example: integrate 2-D propagation problem: Original partial differential equation: Finite Difference Approximation: y x
Basics of Data Parallel Decomposition - SPMD One code will run on 2 CPUs Program has array of data to be operated on by 2 CPU so array is split into two parts. program.f MPI: … if CPU=a then LL=1 UL=50 elseif CPU=b then LL=51 UL=100 end if do I = LL,UL work on A(I) end do ... end program CPU A CPU B program.f MPI: … LL=1 UL=50 do I= LL,UL work on A(I) end do … end program program.f MPI: … LL=51 UL=100 do I= LL,UL work on A(I) end do … end program program.f OpenMP: … do I= 51,100 work on A(I) end do … end program program.f OpenMP: … do I= 1,50 work on A(I) end do … end program program.f OpenMP: … !$OMP directive do I = 1,100 work on A(I) end do ... end program
Typical Task Parallel Decomposition Inverse FFT Task Normalize Task FFT Task Multiply Task DATA Signal processing • Use one processor for each task • Can use more processors if one is overloaded v •
Basics of Task Parallel Decomposition - SPMD One code will run on 2 CPUs Program has 2 tasks (a and b) to be done by 2 CPUs program.f: … initialize ... if CPU=a then do task a elseif CPU=b then do task b end if …. end program CPU B CPU A program.f: … Initialize … do task a … end program program.f: … Initialize … do task b … end program
Sending data between CPUs Finite Difference Approximation: Sample Pseudo Code if (cpu=0) then li = 1 ui = 25 lj = 1 uj = 25 send(1:25)=f(25,1:25) elseif (cpu=1)then .... elseif (cpu=2) then ... elseif(cpu=3) then ... end if do j = lj,uj do i = li,ui work on f(i,j) end do end do PE #0 PE #1 i=1,25 j=1,25 i=1,25 j=26,50 i=1-25, j=26 i=26-50,j=25 i=25,j=1-25 i=25,j=26-50 y PE #3 PE #4 i=26,j=1-25 i=26,j=26-50 i=26,50 j=1,25 i=26,50 j=26,50 i=1-25, j=26 i=26-50,j=25 x
Proc set #1 Proc set #2 Proc set #3 Proc set #4 Multi-Level Task Parallelism Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) threads Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) MPI MPI MPI MPI MPI MPI MPI MPI MPI Implementation: MPI and OpenMP network
Parallel Application Performance Concepts • Parallel Speedup • Parallel Efficiency • Parallel Overhead • Limits on Parallel Performance
Parallel Application Performance Concepts • Parallel Speedup - ratio of best sequential time to parallel execution time • S(n) = ts/tp • Parallel Efficiency - fraction of time processors in use • E(n) = ts/(tp*n) = S(n)/n • Parallel Overhead • Limits on Parallel Performance
Limits of Parallel Computing • Theoretical upper limits • Amdahl’s Law • Gustafson’s Law • Practical limits • communication overhead • synchronization overhead • extra operations necessary for parallel version • Other Considerations • time to re-write (existing) code
Theoretical upper limits • All parallel programs contain: • Parallel sections • Serial sections • Serial sections limit the parallel performance • Amdahl’s Law provides a theoretical upper limit on parallel performance for size-constant problems
1 = S + f f / N s p Amdahl’s Law • Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. • Effect of multiple processors on run time for size-constant problems • Effect of multiple processors on parallel speedup, S: • Where • fs = serial fraction of code • fp = parallel fraction of code • N = number of processors • t1 = sequential execution time
Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors 250 fp = 1.000 200 fp = 0.999 fp = 0.990 150 fp = 0.900 100 50 0 0 50 100 150 200 250 Number of processors
80 fp = 0.99 70 60 50 Amdahl's Law 40 Reality 30 20 10 0 0 50 100 150 200 250 Number of processors Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in further degradation of performance
Some other considerations • Writing efficient parallel application is usually more difficult than writing serial application • Communication, synchronization can limit parallel efficiency • Usually want to overlap communication and computation to minimize ratio of communication to computation time • Serial time can dominate • CPU computational load balance is important • Is it worth your time to rewrite existing application? • Do the CPU requirements justify parallelization? • Will the code be used “enough” times?
Parallel Programming - Real Life • These are the main models • New approaches are likely to arise • Other hybrids of these models are possible • Large applications might use more than one model • Shared memory model is closest to mathematical model of application
Parallel Computing References • www.npaci.edu/PCOMP • Selected HPC link collection - categorized, updated • Books • Computer Organization and Design, D. Patterson and J. L. Hennessy • Scalable Parallel Computing, K. Huang, Z. Xu • Parallel Programming, B. Wilkinson, M. Allen