400 likes | 524 Vues
This chapter focuses on the architecture and organization of parallel processors, exploring key concepts like multiprocessor organizations, hardware multithreading, and various architectures (SISD, SIMD, MIMD). It discusses the two main software strategies to exploit parallel processing: job-level parallelism and process-level parallelism. It emphasizes real-world applications, the implications of Amdahl's Law on performance, and common challenges in balancing computational loads. Additionally, the chapter highlights the importance of efficiently handling memory and communication in multiprocessor systems for optimal performance.
E N D
PARALLEL PROCESSOR ORGANIZATIONS Jehan-François Pâris jfparis@uh.edu
Chapter Organization • Overview • Writing parallel programs • Multiprocessor Organizations • Hardware multithreading • Alphabet soup (SISD, SIMD, MIMD, …) • Roofline performance model
The hardware side • Many parallel processing solutions • Multiprocessor architectures • Two or more microprocessor chips • Multiple architectures • Multicore architectures • Several processors on a single chip
The software side • Two ways for software to exploit parallel processing capabilities of hardware • Job-level parallelism • Several sequential processes run in parallel • Easy to implement (OS does the job!) • Process-level parallelism • A single program runs on several processors at the same time
Overview • Some problems are embarrassingly parallel • Many computer graphics tasks • Brute force searches in cryptography or password guessing • Much more difficult for other applications • Communication overhead among sub-tasks • Amdahl's law • Balancing the load
Amdahl's Law • Assume a sequential process takes • tp seconds to perform operations that could be performed in parallel • ts seconds to perform purely sequential operations • The maximum speedup will be (tp+ ts )/ts
Balancing the load • Must ensure that workload is equally divided among all the processors • Worst case is when one of the processors does much more work than all others
Example (I) • Computation partitioned amongnprocessors • One of them does 1/m of the work with m < n • That processor becomes a bottleneck • Maximum expected speedup: n • Actual maximum speedup: m
Example (II) • Computation partitioned among64processors • One of them does 1/8 of the work • Maximum expected speedup: 64 • Actual maximum speedup: 8
A last issue • Humans likes to address issues one after the order • We have meeting agendas • We do not like to be interrupted • We write sequential programs
Rene Descartes • Seventeenth-century French philosopher • Invented • Cartesian coordinates • Methodical doubt • [To] never to accept anything for true which I did not clearly know to be such • Proposed a scientific method based on four precepts
Method's third rule • The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain ordereven to those objects which in their own nature do not stand in a relation of antecedence and sequence.
PU PU PU Cache Cache Cache Shared memory multiprocessors … Interconnection network RAM I/O
Shared memory multiprocessor • Can offer • Uniform memory access to all processors(UMA) • Easiest to program • Non-uniform memory access to all processors(NUMA) • Can scale up to larger sizes • Offer faster access to nearby memory
PU PU PU Cache Cache Cache RAM RAM RAM Computer clusters … Interconnection network
Computer clusters • Very easy to assemble • Can take advantage of high-speed LANs • Gigabit Ethernet, Myrinet, … • Data exchanges must be done throughmessage passing
Message passing (I) • If processor P wants to access data in the main memory of processor Q it must • Send a request to Q • Wait for a reply • For this to work, processor Q must have a thread • Waiting for message from other processors • Sending them replies
Message passing (II) • In a shared memory architecture, each processor can directly access all data • A proposed solution • Distributed shared memory offers to the users of a cluster the illusion of a single address space for their shared data • Still has performance issues
When things do not add up • Memory capacity is very important for big computing applications • If the data can fit into main memory, the computation will run much faster
A problem • A company replaced • Single shared memory computer with 32GB of RAM • Four “clustered” computers with 8GB each • More I/O than ever • What did happen?
The explanation • Assume OS occupies one GB of RAM • The old shared-memory computer still had 31 GB of free RAM • Each of the clustered computer has 7 GB of free RAM • The total RAM available to the program went down from 31 GB to 47 = 28 GB!
Grid computing • The computers are distributed over a very large network • Sometimes computer time is donated • Volunteer computing • Seti@Home • Works well with embarrassingly parallel workloads • Searches in a n-dimensional space
General idea • Let the processor switch to another thread of computation while them current one is stalled • Motivation: • Increased cost of cache misses
Implementation • Entirely controlled by the hardware • Unlike multiprogramming • Requires a processor capable of • Keeping track of the state of each thread • One set of registers—including PC– for each concurrent thread • Quickly switching among concurrent threads
Approaches • Fine-grained multithreading: • Switches between threads for each instruction • Provides highest throughputs • Slows down execution of individual threads
Approaches • Coarse-grained multithreading • Switches between threads whenever a long stall is detected • Easier to implement • Cannot eliminate all stalls
Approaches • Simultaneous multi-threading: • Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threads • Best solution
Overview • Used to describe processor organizations where • Same instructions can be applied to • Multiple data instances • Encountered in • Vector processors in the past • Graphic processing units (GPU) • x86 multimedia extension
Classification • SISD: • Single instruction, single data • Conventional uniprocessor architecture • MIMD: • Multiple instructions, multiple data • Conventional multiprocessor architecture
Classification • SIMD: • Single instruction, multiple data • Perform same operations on a set of similar data • Think of adding two vectors for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];
Vector computing • Kind of SIMD architecture • Used by Cray computers • Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU • Requires • Vector registers able to storemultiple values • Special vector instructions: say lv, addv, …
Benchmarking • Two factors to consider • Memory bandwidth • Depends on interconnection network • Floating-point performance • Best known benchmark is LINPACK
Roofline model • Takes into account • Memory bandwidth • Floating-point performance • Introduces arithmetic intensity • Total number of floating point operations in a program divided by total number of bytes transferred to main memory • Measured in FLOPS/byte
Roofline model • Attainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance
Roofline model Peak floating-point performance Floating-point performance is limited by memory bandwidth