Parallel Computing Overview

Parallel Computing Overview CS 524 – High-Performance Computing

Parallel Computing • Multiple processors that are able to work cooperatively to solve a computational problem • Example of parallel computing include specially designed parallel computers and algorithms to geographically distributed network of workstations cooperating on a task • There are problems that cannot be solved by present-day serial computers or they take an impractically long time to solve • Parallel computing exploits concurrency and parallelism inherent in the problem domain • Task parallelism • Data parallelism CS 524 (Au 05-06)- Asim Karim @ LUMS

Development Trends • Advances in IC technology and processor design • CPU performance double every 18 months for past 20+ years (Moore’s Law) • Clock rates increase from 4.77 MHz for 8088 (1979) to 3.6 GHz for Pentium 4 (2004) • FLOPS increase from a handful (1945) to 35.86 TFLOPS (Earth Simulator by NEC, 2002 to date) • Decrease in cost and size • Advances in computer networking • Bandwidth increase from a few bits per second to > 10 Gb/s • Decrease in size and cost, and increase in reliability • Need • Solution of larger and more complex problems CS 524 (Au 05-06)- Asim Karim @ LUMS

Issues in Parallel Computing • Parallel architectures • Design of bottleneck-free hardware components • Parallel programming models • Parallel view of problem domain for effective partitioning and distribution of work among processors • Parallel algorithms • Efficient algorithms that take advantage of parallel architectures • Parallel programming environments • Programming languages, compilers, portable libraries, development tools, etc CS 524 (Au 05-06)- Asim Karim @ LUMS

Two Key Algorithm Design Issues • Load balancing • Execution time of parallel programs is the time elapsed from start of processing by the first processor to end of processing by the last processor • Partitioning of computational load among processors • Communication overhead • Processors are much faster than communication links • Partitioning of data among processors CS 524 (Au 05-06)- Asim Karim @ LUMS

Parallel MVM: Row-Block Partition do i = 1, N do j = 1, N y(i) = y(i)+A(i,j)*x(j) end do end do P0 P1 P2 P3 x j P0 P1 i P2 P3 y A CS 524 (Au 05-06)- Asim Karim @ LUMS

Parallel MVM: Column-Block Partition do j = 1, N do i = 1, N y(i) = y(i)+A(i,j)*x(j) end do end do P0 P1 P2 P3 x j P0 P1 i P2 P3 y A CS 524 (Au 05-06)- Asim Karim @ LUMS

Parallel MVM: Block Partition • Can we do any better? • Assume same distribution of x and y • Can A be partitioned to reduce communication? P0 P1 P2 P3 x j P0 P0 P1 P1 i P2 P2 P3 P3 y A CS 524 (Au 05-06)- Asim Karim @ LUMS

Parallel Architecture Models • Bus-based shared memory or symmetric multiprocessor [SMP] (e.g. suraj, dual/quad processor Xeon machines) • Network-based distributed-memory (e.g. Cray T3E, our linux cluster) • Network-based distributed-shared-memory (e.g. SGI Origin 2000) • Network-based distributed shared-memory (e.g. SMP clusters) CS 524 (Au 05-06)- Asim Karim @ LUMS

Bus-Based Shared-Memory (SMP) P P P P • Any processor can access any memory location at equal cost (symmetric multiprocessor) • Tasks “communicate” by writing/reading commonly accessible locations • Easier to program • Cannot scale beyond 30 processors (bus bottleneck) • Examples: most workstation vendors make SMPs (Sun, IBM, Intel-based SMPs), Cray T90, SV1 (uses cross-bar) Shared memory Bus CS 524 (Au 05-06)- Asim Karim @ LUMS

Network-Connected Distributed-Memory P P P P • Each processor can only access own memory • Explicit communication by sending and receiving messages • More tedious to program • Can scale to thousand of processors • Examples: Cray T3E, clusters M M M M Interconnection network CS 524 (Au 05-06)- Asim Karim @ LUMS

Network-Connected Distributed-Shared-Memory Interconnection network P P P P • Each processor can directly access any memory location • Physically distributed memory • Non-uniform memory access costs • Example: SGI Origin 2000 M M M M CS 524 (Au 05-06)- Asim Karim @ LUMS

Network-Connected Distributed Shared-Memory P P P P • Network of SMPs • Each SMP can only access own memory • Explicit communication between SMPs • Can take advantage of both shared-memory and distributed-memory programming models • Can scale to hundreds of processors • Examples: SMP clusters M M Bus Bus Interconnection network CS 524 (Au 05-06)- Asim Karim @ LUMS

Parallel Programming Models • Global-address (or shared-address) space model • POSIX threads (PThreads) • OpenMP • Message passing (or distributed-address) model • MPI (message passing interface) • PVM (parallel virtual machine) • Higher level programming environments • High-Performance Fortran (HPF) • PETSc (portable extensible toolkit for scientific computation) • POOMA (parallel object-oriented methods and applications) CS 524 (Au 05-06)- Asim Karim @ LUMS

Other Parallel Programming Models • Task and channel • Similar to message passing • Instead of communicating between named tasks (as in message passing model), it communicates through named channels • SPMD (single program multiple data) • Each processor executes the same program code that operates on different data • Most message passing programs are SPMD • Data parallel • Operations on chunks of data (e.g. arrays) are parallelized • Grid • Problem domain viewed in parcels with processing for parcel(s) allocated to different processors CS 524 (Au 05-06)- Asim Karim @ LUMS

Example real a(n,n), b(n,n) do k = 1, NumIter do i = 2, n-1 do j = 2, n-1 a(i,j) = (b(i-1,j) + b(i,j-1 + b(i+1,j) + b(i,j+1))/4 end do end do do i = 2, n-1 do j = 2, n-1 b(i,j) = a(i,j) end do end do end do CS 524 (Au 05-06)- Asim Karim @ LUMS

Shared-Address Space Model: OpenMP real a(n,n), b(n,n) c$omp parallel shared(a,b,k) private(i,j) do k = 1, NumIter c$omp do do i = 2, n-1 do j = 2, n-1 a(i,j) = (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1))/4 end do end do c$omp do do i = 2, n-1 do j = 2, n-1 b(i,j) = a(i,j) end do end do end do CS 524 (Au 05-06)- Asim Karim @ LUMS

Message Passing Pseudo-code real aLoc(NdivP,n), bLoc(0:NdivP+1,n) me = get_my_procnum() do k = 1, NumIter if (me .ne. P-1) send(me+1, bLoc(NdivP, 1:n)) if (me .ne. 0) recv(me-1, bLoc(0, 1:n)) if (me .ne. 0) send(me-1, bLoc(1, 1:n)) if (me .ne. P-1) recv(me+1, bLoc(NdivP+1, 1:n)) if (me .eq. 0) then ibeg = 2 else ibeg = 1 endif if (me .eq. P-1) then iend = NdivP-1 else iend = NdivP endif do i = ibeg, iend do j = 2, n-1 aLoc(i,j) = (bLoc(i-1,j) + bLoc(i,j-1) + bLoc(i+1,j) + bLoc(i,j+1))/4 end do end do do i = ibeg, iend do j = 2, n-1 bLoc(i,j) = aLoc(i,j) end do end do end do CS 524 (Au 05-06)- Asim Karim @ LUMS

Parallel Computing Overview