PPT - Parallel Scientific Computing: Algorithms and Tools Lecture #3 PowerPoint Presentation

Parallel Scientific Computing: Algorithms and ToolsLecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

Levels of Parallelism • Job level parallelism: Capacity computing • Goal: run as many jobs as possible on a system for given time period. Concerned about throughput; Individual user’s jobs may not run faster. • Of interest to administrators • Program/Task level parallelism: Capability computing • Use multiple processors to solve a single problem. • Controlled by users. • Instruction level parallelism: • Pipeline, multiple functional units, multiple cores. • Invisible to users. • Bit-level parallelism: • Of concern to hardware designers of arithmetic-logic units

Granularity of Parallel Tasks • Large/coarse grain parallelism: • Amount of operations that run in parallel is fairly large • e.g., on the order of an entire program • Small/fine grain parallelism: • Amount of operations that run in parallel is relatively small • e.g., on the order of single loop. Coarse/large grains usually result in more favorable parallel performance

Flynn’s Taxonomy of Computers • SISD: Single instruction stream, single data stream • MISD: Multiple instruction streams, single data stream • SIMD: Single instruction stream, multiple data streams • MIMD: Multiple instruction streams, multiple data streams

Classification of Computers • SISD: single instruction single data • Conventional computers • CPU fetches from one instruction stream and works on one data stream. • Instructions may run in parallel (superscalar). • MISD: multiple instruction single data • No real world implementation.

Classification of Computers • SIMD: single instruction multiple data • Controller + processing elements (PE) • Controller dispatches an instruction to PEs; All PEs execute same instruction, but on different data • e.g., MasPar MP-1, Thinking machines CM-1, vector computers (?) • MIMD: multiple instruction multiple data • Processors execute own instructions on different data streams • Processors communicate with one another directly, or through shared memory. • Usual parallel computers, clusters of workstations

Flynn’s Taxonomy

Programming Model • SPMD: Single program multiple data • MPMD: multiple programs multiple data

Programming Model • SPMD: Single program multiple data • Usual parallel programming model • All processors execute same program, on multiple data sets (domain decomposition) • Processor knows its own ID • if(my_cpu_id == N){} • else {}

Programming Model • MPMD: Multiple programs multiple data • Different processors execute different programs, on different data • Usually a master-slave model is used. • Master CPU spawns and dispatches computations to slave CPUs running a different program. • Can be converted into SPMD model • if(my_cpu_id==0) run function_containing_program_1; • else run function_containing_program_2;

Classification of Parallel Computers • Flynn’s MIMD computers contain a wide variety of parallel computers • Based on memory organization (address space): • Shared-memory parallel computers • Processors can access all memories • Distributed-memory parallel computers • Processor can only access local memory • Remote memory access through explicit communication

memory … Mn M1 M3 M2 Bus or crossbar C C C C P3 P1 Pn P2 … Shared-Memory Parallel Computer • Superscalar processors with L2 cache connected to memory modules through a bus or crossbar • All processors have access to all machine resources including memory and I/O devices • SMP (symmetric multiprocessor): if processors are all the same and have equal access to machine resources, i.e. it is symmetric. • SMP are UMA (Uniform Memory Access) machines • e.g., A node of IBM SP machine; SUN Ultraenterprise 10000 Prototype shared-memory parallel computer P – processor; C – cache; M – memory.

bus … Mn M1 M2 M2 M1 Mn M3 memory C C C C C C C … … P2 P1 P2 Pn P3 Pn P1 … Shared-Memory Parallel Computer memory • If bus, • Only one processor can access the memory at a time. • Processors contend for bus to access memory • If crossbar, • Multiple processors can access memory through independent paths • Contention when different processors access same memory module • Crossbar can be very expensive. • Processor count limited by memory contention and bandwidth • Max usually 64 or 128 crossbar

Shared-Memory Parallel Computer • Data flows from memory to cache, to processors • Performance depends dramatically on reuse of data in cache • Fetching data from memory with potential memory contention can be expensive • L2 cache plays of the role of local fast memory; Shared memory is analogous to extended memory accessed in blocks

Cache Coherency • If a piece of data in one processor’s cache is modified, then all other processors’ cache that contain that data must be updated. • Cache coherency: the state that is achieved by maintaining consistent values of same data in all processors’ caches. • Usually hardware maintains cache coherency; System software can also do this, but more difficult.

Programming Shared-Memory Parallel Computers • All memory modules have the same global address space. • Closest to single-processor computer • Relatively easy to program. • Multi-threaded programming: • Auto-parallelizing compilers can extract fine-grain (loop-level) parallelism automatically; • Or use OpenMP; • Or use explicit POSIX (portable operating system interface) threads or other thread libraries. • Message passing: • MPI (Message Passing Interface).

Communication Network P1 P2 Pn … M M M Distributed-Memory Parallel Computer • Superscalar processors with local memory connected through communication network. • Each processor can only work on data in local memory • Access to remote memory requires explicit communication. • Present-day large supercomputers are all some sort of distributed-memory machines Prototype distributed-memory computer e.g. IBM SP, BlueGene; Cray XT3/XT4

Distributed-Memory Parallel Computer • High scalability • No memory contention such as those in shared-memory machines • Now scaled to > 100,000 processors. • Performance of network connection crucial to performance of applications. • Ideal: low latency, high bandwidth Communication much slower than local memory read/write Data locality is important. Frequently used data  local memory

Programming Distributed-Memory Parallel Computer • “Owner computes” rule • Problem needs to be broken up into independent tasks with independent memory • Each task assigned to a processor • Naturally matches data based decomposition such as a domain decomposition • Message passing: tasks explicitly exchange data by message passing. • Transfers all data using explicit send/receive instructions • User must optimize communications • Usually MPI (used to be PVM), portable, high performance • Parallelization mostly at large granularity level controlled by user • Difficult for compilers/auto-parallelization tools

Programming Distributed-Memory Parallel Computer • A global address space is provided on some distributed-memory machine • Memory physically distributed, but globally addressable; can be treated as “shared-memory” machine; so-called distributed shared-memory. • Cray T3E; SGI Altix, Origin. • Multi-threaded programs (OpenMP, POSIX threads) can also be used on such machines • User accesses remote memory as if it were local; OS/compilers translate such accesses to fetch/store over the communication network. • But difficult to control data locality; performance may suffer. • NUMA (non-uniform memory access); ccNUMA (cache coherent non-uniform memory access); overhead

Communication network M M M M Bus or crossbar Bus or crossbar …… P P P P Hybrid Parallel Computer • Overall distributed memory, SMP nodes • Most modern supercomputers and workstation clusters are of this type • Message passing; or hybrid message passing/threading. Hybrid parallel computer e.g. IBM SP, Cray XT3

Interconnection Network/Topology • Nodes, links • Neighbors: nodes with a link between them • Degree of a node: number of neighbors it has • Scalability: increase in complexity when more nodes are added. Ring Fully connected network

Topology Hypercube

Topology 3D mesh/torus 1D/2D mesh/torus

Topology Tree Star

Topology • Bisection width: minimum number of links that must be cut in order to divide the topology into two independent networks of the same size (plus/minus one node) • Bisection bandwidth: communication bandwidth across the links that are cut in defining bisection width Larger bisection bandwidth  better

Parallel Scientific Computing: Algorithms and Tools Lecture #3

Presentation Transcript

Scientific Computing Lecture 5

Parallel Algorithms &amp; Distributed Computing

Parallel Scientific Computing: Algorithms and Tools Lecture #2

Parallel Algorithms and Computing Selected topics

Scientific Computing Lecture 10

Lecture 21: Parallel Algorithms

Parallel Scientific Computing: Algorithms and Tools Lecture #2

Lecture 19: Parallel Algorithms

Parallel Scientific Computing: Algorithms and Tools Lecture #3

Presentation Transcript

Scientific Computing Lecture 5

Parallel Algorithms &amp;amp; Distributed Computing

Parallel Scientific Computing: Algorithms and Tools Lecture #2

Parallel Algorithms and Computing Selected topics

Scientific Computing Lecture 10

Lecture 21: Parallel Algorithms

Parallel Scientific Computing: Algorithms and Tools Lecture #2

Lecture 19: Parallel Algorithms

Parallel Algorithms & Distributed Computing