Multicores, Multiprocessors, and Clusters

Multicores, Multiprocessors, and Clusters

Computer Architecture Applications suggest how to improve technology, provide revenue to fund development Improved technologies make new applications possible Applications Technology Cost of software development makes compatibility a major force in market Compatibility

Crossroads First Microprocessor Intel 4004, 1971 • 4-bit accumulator architecture • 8mm pMOS • 2,300 transistors • 3 x 4 mm2 • 750kHz clock • 8-16 cycles/inst.

Hardware • Team from IBM building PC prototypes in 1979 • Motorola 68000 chosen initially, but 68000 was late • 8088 is 8-bit bus version of 8086 => allows cheaper system • Estimated sales of 250,000 • 100,000,000s sold [ Personal Computing Ad, 11/81]

Crossroads • Carried in two tractor trailers, 12 tons + 8 tons • Built for US Army Signal Corps DYSEAC, first mobile computer! • 900 vacuum tubes • memory of 512 words of 45 bits each

End of Uniprocessors Hardware and software, ILP! Intel cancelled high performance uniprocessor, joined IBM and Sun for multiple processors Hardware based

Trends • Shrinking of transistor sizes: 250nm (1997)  • 130nm (2002)  65nm (2007)  32nm (2010) • 28nm(2011, AMD GPU, Xilinx FPGA) • 22nm(2011, Intel Ivy Bridge, die shrink of the • Sandy Bridge architecture) • Transistor density increases by 35% per year and die size • increases by 10-20% per year… more cores!

Trends Transistors: 1.43x / year Cores: 1.2 - 1.4x Performance: 1.15x Frequency: 1.05x Power: 1.04x 2004 2010 Source: Micron University Symp.

Crossroads 1996 When I took this class! 2011 2002 2009 Reduced ILP to 1 chapter! Shift to multicore! Reduced emphasis on ILP Introduce thread level P. Request, Data, Thread, Instruction Level Introduce: GPU, cloud computing, Smart phones, tablets!

Introduction • Goal: connecting multiple computersto get higher performance • Multiprocessors • Scalability, availability, power efficiency • Job-level (process-level) parallelism • High throughput for independent jobs • Parallel processing program • Single program run on multiple processors • Multicore microprocessors • Chips with multiple processors (cores)

Parallel Programming • Parallel software is the problem • Need to get significant performance improvement • Otherwise, just use a faster uniprocessor, since it’s easier! • Difficulties • Partitioning • Coordination • Communications overhead

Parallel Programming • MPI, OpenMP, and Stream Processing are methods of distributing workloads on computers. • Key: Overlapping program architecture with the target hardware architecture

Shared Memory • SMP: shared memory multiprocessor • Small number of cores • Share single memory with uniform memory latency (symmetric) • SGI Altix UV 1000 (ARDC, December 2011 ) • 58 nodes, 928 cores • up to 2,560 cores with architectural support to 327,680 • support for up to 16TB of global shared memory. • Programming: Parallel OpenMP or Threaded

Example: Sum Reduction • Sum 100,000 numbers on 100 processor UMA • Each processor has ID: 0 ≤ Pn ≤ 99 • Partition 1000 numbers per processor • Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; • Now need to add these partial sums • Reduction: divide and conquer • Half the processors add pairs, then quarter, … • Need to synchronize between reduction steps

Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);

Distributed Memory • Distributed shared memory (DSM) • Memory distributed among processors • Non-uniform memory access/latency (NUMA) • Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks • Hardware sends/receives messages between processors

Distributed Memory • URDC SGI ICE 8400 • 179 nodes, 2112 cores • up to 768 cores in a single rack, • scalable from 32 to tens of thousands of nodes • lower cost than SMP! • distributed memory system (Cluster), • typically using MPI programming.

Sum Reduction (Again) • Sum 100,000 on 100 processors • First distribute 100 numbers to each • The do partial sums sum = 0;for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; • Reduction • Half the processors send, other half receive and add • The quarter send, quarter receive and add,…

Sum Reduction (Again) • Given send() and receive() operations limit = 100; half = 100;/* 100 processors */repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit)send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */until (half == 1); /* exit with final sum */ • Send/receive also provide synchronization • Assumes send/receive take similar time to addition

Matrix Multiplication = X

Message Passing Interface • language-independent communications protocol • provides a means to enable communication between different CPUs • point-to-point and collective communication • is a specification, not an implementation • standard for communication among processes on a distributed memory system • does not mean that its usage is restricted • processes do not have anything in common, and each has its own memory space.

Message Passing Interface • set of subroutines used explicitly to communicate between processes. • MPI programs are truly "multi-processing" • Parallelization can not be done automatically or semi-automatically as in "multi-threading" programs • function and subroutine calls have to be inserted into the code • alter the algorithm of the code with respect to the serial version.

Is it a curse? • The need to include the parallelism explicitly in the program • is a curse • more work and requires more planning than multi-threading, • and a blessing • often leads to more reliable and scalable code • the behavior is in the hands of the programmer. • Well-written MPI codes can be made to scale for thousands of CPUs.

OpenMP • a system of so-called "compiler directives" that are used to express parallelism on a shared-memory machine. • an industry standard • most parallel enabled compilers that are used on SMP machines are capable of processing OpenMP directives. • OpenMP is not a “language” • Instead, OpenMP specifies a set of subroutines in an existing language (FORTRAN, C) for parallel programming on a shared memory machine

Systems using OpenMP • SMP (Symmetric Multi-Processor) • designed for shared-memory machines • advantage of not requiring communication between processors • allow multi-threading, • dynamic form of parallelism in which sub-processes are created and destroyed during program execution. • OpenMP will not work on distributed-memory clusters

OpenMP compiler directives are inserted by the programmer, which allows stepwise parallelization of pre-existing serial programs

Multiplication for (ii = 0; ii < nrows; ii++){ for(jj = 0; jj < ncols; jj++){ for (kk = 0; kk < nrows; kk++){ array[ii][jj] = array[ii][kk] * array[kk][jj] + array[ii][jj]; } } } = X

Multiplication • unified code: OpenMP constructs are treated as comments when sequential compilers are used. #pragma omp parallel for shared(array, ncols, nrows) private(ii, jj, kk) for (ii = 0; ii < nrows; ii++){ for(jj = 0; jj < ncols; jj++){ for (kk = 0; kk < nrows; kk++){ array[ii][jj] = array[ii]kk] * array[kk][jj] + array[ii][jj]; } } }

Why is OpenMP popular? • The simplicity and ease of use • No message passing • data layout and decomposition is handled automatically by directives. • OpenMP directives may be incorporated incrementally. • program can be parallelized one portion after another and thus no dramatic change to code is needed • original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs and helps maintenance as well. • The code is in effect a serial code and more readable • Code size increase is generally smaller.

OpenMP Tradeoffs • Cons • currently only runs efficiently in shared-memory multiprocessor platforms • requires a compiler that supports OpenMP. • scalability is limited by memory architecture. • reliable error handling is missing. • synchronization between subsets of threads is not allowed. • mostly used for loop parallelization

MPI Tradeoffs • Pros of MPI • doesnot require shared memory architectures which are more expensive than distributed memory architectures • can be used on a wider range of problems since it exploits both task parallelism and data parallelism • can run on bothshared memory and distributed memory architectures • highly portable with specific optimization for the implementation on most hardware • Cons of MPI • requires more programming changes to go from serial to parallel version • can be harder to debug

Another Example • Consider the following code fragment that finds the sum of f(x) for 0 <= x < n. for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); }

Solution for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]);} #pragma omp parallel for shared(sum, a, n) private(ii, value) for (ii = 0; ii < n; ii++) { value = some_complex_long_fuction(a[ii]); #pragma omp critical sum = sum + value; } or better, you can use the reduction clause to get #pragma omp parallel for private(sum) reduction(+: sum) for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); }

Measuring Performance • Two primary metrics: wall clock time (response time for a program) and throughput (jobs performed in unit time) • If we upgrade a machine with a new processor what do we increase? • If we add a new machine to the lab what do we increase? • Performance is measured with benchmark suites: a collection of programs that are likely relevant to the user • SPEC CPU 2006: cpu-oriented (desktops) • SPECweb, TPC: throughput-oriented (servers) • EEMBC: for embedded processors/workloads

Measuring Performance • Elapsed Time • counts everything (disk and memory accesses, I/O , etc.) a useful number, but often not good for comparison purposes • CPU time • doesn't count I/O or time spent running other programs can be broken up into system time, and user time

Benchmark Games • An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code… Saturday, January 6, 1996 New York Times

SPEC CPU2000

Problems of Benchmarking • Hard to evaluate real benchmarks: • Machine not built yet, simulators too slow • Benchmarks not ported • Compilers not ready • Benchmark performance is composition of hardware and software (program, input, compiler, OS) performance, which must all be specified

Compiler and Performance

Amdahl's Law • The performance enhancement of an improvement is limited by how much the improved feature is used. In other words: Don’t expect an enhancement proportional to how much you enhanced something. • Example:"Suppose a program runs in 100 seconds on a machine, with multiply operations responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

Amdahl's Law 1. Speed up = 4 2. Old execution time = 100 3. New execution time = 100/4 = 25 4. If 80 seconds is used by the affected part => 5. Unaffected part = 100-80 = 20 sec 6. Execution time new = Execution time unaffected + Execution time affected / Improvement 7. 25= 20 + 80/Improvement How about 5X speedup? 8. Improvement = 16

Amdahl's Law • An application is “almost all” parallel: 90%. Speedup using • 10 processors => 5.3x • 100 processors => 9.1x • 1000 processors => 9.9x

Stream Processing • streaming data in and out of an execution core without utilizing inter-thread communication, scattered (i.e., random) writes or even reads, or local memory. • hardware is drastically simplified • specialized chips (graphics processing unit)

Parallelism • ILP exploits implicit parallel operations within a loop or straight-line code segment • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel

Multithreaded Categories Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

Graphics Processing Units • Few hundred $ = hundreds of parallel FPUs • High performance computing more accessible • Blossomed with easy programming environment • GPUs and CPUs do not go back in computer architecture genealogy to a common ancestor • Primary ancestors of GPUs: Graphics accelerators

History of GPUs • Early video cards • Frame buffer memory with address generation for video output • 3D graphics processing • Originally high-end computers (e.g., SGI) • 3D graphics cards for PCs and game consoles • Graphics Processing Units • Processors oriented to 3D graphics tasks • Vertex/pixel processing, shading, texture mapping, ray tracing

Graphics in the System

Graphics Processing Units • Given the hardware invested to do graphics well, how can we supplement it to improve performance of a wider range of applications? • Basic idea: • Heterogeneous execution model • CPU is the host, GPU is the device • Develop a C-like programming language for GPU • Unify all forms of GPU parallelism as CUDA thread • Programming model: “Single Instruction Multiple Thread”

Programming the GPU • Compute Unified Device Architecture-CUDA • Elegant solution to problem of expressing parallelism • Not all algorithms, but enough to matter • Challenge: Coordinating HOST and CPU • Scheduling of computation • Data transfer • GPU offers every type of parallelism that can be captured by the programming environment

Multicores, Multiprocessors, and Clusters

Multicores, Multiprocessors, and Clusters

Presentation Transcript

Multiprocessors

CS6290 Multiprocessors

Multiprocessors

Multiprocessors and Threads

Multiprocessors

Multiprocessors

Multiprocessors and Threads

Nomad: A Scalable Operating System for Clusters of Uni and Multiprocessors

Multicores, Manycores and Amdahl’s Law

Multiprocessors II

Multiprocessors, Threads and Microkernels

Multiprocessors and Threads

Multiprocessors

Multiprocessors

MULTIPROCESSORS

Groups, Clusters and Clusters of Clusters

Multiprocessors

Multiprocessors

Multiprocessors and Multi-computers

Multiprocessors

Multiprocessors

Symmetric Multiprocessors