PARALLEL PROCESSOR ORGANIZATIONS

PARALLEL PROCESSOR ORGANIZATIONS Jehan-François Pâris jfparis@uh.edu

Chapter Organization • Overview • Writing parallel programs • Multiprocessor Organizations • Hardware multithreading • Alphabet soup (SISD, SIMD, MIMD, …) • Roofline performance model

OVERVIEW

The hardware side • Many parallel processing solutions • Multiprocessor architectures • Two or more microprocessor chips • Multiple architectures • Multicore architectures • Several processors on a single chip

The software side • Two ways for software to exploit parallel processing capabilities of hardware • Job-level parallelism • Several sequential processes run in parallel • Easy to implement (OS does the job!) • Process-level parallelism • A single program runs on several processors at the same time

WRITING PARALLEL PROGRAMS

Overview • Some problems are embarrassingly parallel • Many computer graphics tasks • Brute force searches in cryptography or password guessing • Much more difficult for other applications • Communication overhead among sub-tasks • Amdahl's law • Balancing the load

Amdahl's Law • Assume a sequential process takes • tp seconds to perform operations that could be performed in parallel • ts seconds to perform purely sequential operations • The maximum speedup will be (tp+ ts )/ts

Balancing the load • Must ensure that workload is equally divided among all the processors • Worst case is when one of the processors does much more work than all others

Example (I) • Computation partitioned amongnprocessors • One of them does 1/m of the work with m < n • That processor becomes a bottleneck • Maximum expected speedup: n • Actual maximum speedup: m

Example (II) • Computation partitioned among64processors • One of them does 1/8 of the work • Maximum expected speedup: 64 • Actual maximum speedup: 8

A last issue • Humans likes to address issues one after the order • We have meeting agendas • We do not like to be interrupted • We write sequential programs

Rene Descartes • Seventeenth-century French philosopher • Invented • Cartesian coordinates • Methodical doubt • [To] never to accept anything for true which I did not clearly know to be such • Proposed a scientific method based on four precepts

Method's third rule • The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain ordereven to those objects which in their own nature do not stand in a relation of antecedence and sequence.

MULTI PROCESSOR ORGANIZATIONS

PU PU PU Cache Cache Cache Shared memory multiprocessors … Interconnection network RAM I/O

Shared memory multiprocessor • Can offer • Uniform memory access to all processors(UMA) • Easiest to program • Non-uniform memory access to all processors(NUMA) • Can scale up to larger sizes • Offer faster access to nearby memory

PU PU PU Cache Cache Cache RAM RAM RAM Computer clusters … Interconnection network

Computer clusters • Very easy to assemble • Can take advantage of high-speed LANs • Gigabit Ethernet, Myrinet, … • Data exchanges must be done throughmessage passing

Message passing (I) • If processor P wants to access data in the main memory of processor Q it must • Send a request to Q • Wait for a reply • For this to work, processor Q must have a thread • Waiting for message from other processors • Sending them replies

Message passing (II) • In a shared memory architecture, each processor can directly access all data • A proposed solution • Distributed shared memory offers to the users of a cluster the illusion of a single address space for their shared data • Still has performance issues

When things do not add up • Memory capacity is very important for big computing applications • If the data can fit into main memory, the computation will run much faster

A problem • A company replaced • Single shared memory computer with 32GB of RAM • Four “clustered” computers with 8GB each • More I/O than ever • What did happen?

The explanation • Assume OS occupies one GB of RAM • The old shared-memory computer still had 31 GB of free RAM • Each of the clustered computer has 7 GB of free RAM • The total RAM available to the program went down from 31 GB to 47 = 28 GB!

Grid computing • The computers are distributed over a very large network • Sometimes computer time is donated • Volunteer computing • Seti@Home • Works well with embarrassingly parallel workloads • Searches in a n-dimensional space

HARDWARE MULTITHREADING

General idea • Let the processor switch to another thread of computation while them current one is stalled • Motivation: • Increased cost of cache misses

Implementation • Entirely controlled by the hardware • Unlike multiprogramming • Requires a processor capable of • Keeping track of the state of each thread • One set of registers—including PC– for each concurrent thread • Quickly switching among concurrent threads

Approaches • Fine-grained multithreading: • Switches between threads for each instruction • Provides highest throughputs • Slows down execution of individual threads

Approaches • Coarse-grained multithreading • Switches between threads whenever a long stall is detected • Easier to implement • Cannot eliminate all stalls

Approaches • Simultaneous multi-threading: • Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threads • Best solution

ALPHABET SOUP

Overview • Used to describe processor organizations where • Same instructions can be applied to • Multiple data instances • Encountered in • Vector processors in the past • Graphic processing units (GPU) • x86 multimedia extension

Classification • SISD: • Single instruction, single data • Conventional uniprocessor architecture • MIMD: • Multiple instructions, multiple data • Conventional multiprocessor architecture

Classification • SIMD: • Single instruction, multiple data • Perform same operations on a set of similar data • Think of adding two vectors for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];

Vector computing • Kind of SIMD architecture • Used by Cray computers • Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU • Requires • Vector registers able to storemultiple values • Special vector instructions: say lv, addv, …

Benchmarking • Two factors to consider • Memory bandwidth • Depends on interconnection network • Floating-point performance • Best known benchmark is LINPACK

Roofline model • Takes into account • Memory bandwidth • Floating-point performance • Introduces arithmetic intensity • Total number of floating point operations in a program divided by total number of bytes transferred to main memory • Measured in FLOPS/byte

Roofline model • Attainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance

Roofline model Peak floating-point performance Floating-point performance is limited by memory bandwidth

PARALLEL PROCESSOR ORGANIZATIONS

PARALLEL PROCESSOR ORGANIZATIONS

Presentation Transcript

Processor

Processor

3D-MAPS: 3D Massively Parallel Processor with Stacked Memory

Massively Parallel Processor Array Presented by: Samaneh Rabienia

Processor

Processor

PROCESSOR

Processor

Processor

Parallel H.264 Decoding on an Embedded Multicore Processor

Processor

A Nanoliter-Scale Nucleic Acid Processor with Parallel Architecture

Dynamic Processor Allocation for Adaptively Parallel Jobs

ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Inter-Processor Parallel Architecture

Processor