Multiprocessor Systems Performance Lecture: Speedup and Execution Time

Lecture 6:Performance of Multiprocessor Systems

Speedup Execution time on 1 processor T1 Speedup = ----------------------------------------------- = -------- Execution time on p processors Tp ts : time for the serial part of the algorithm tp : time for the parallelizable part of the algorithm T1 = ts + tpSpeedup ideal Tp = ts + tp/p ts + tp Speedup(p) = ---------------- ts + tp/p p

Amdahl’s Law If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s. ts = (1/s) x T1 tp = (1- 1/s) x T1 T1 Speedup(p) = ------------------------ s T1/s + (1-1/s)T1 ------------- p Speedup(p) = s p lim p ∞

Speedup

Superlinear speedup Speedup(p) > p  superlinear speedup Reasons: • Increased cache size • Random algorithms • Parallel algorithm

Speedup T1 Speedup = -------- Tp • Relative speedup: single processor execution time of the parallel algorithm is used • Absolute speedup: execution time of the best parallel algorithm on one processor is used

Efficiency Speedup(p) T1 Efficiency(p) = ------------------- = ---------- ≤ 1 p p x Tp Efficiency 1 p

Amdahl’s Law If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s. ts = (1/s) x T1 tp = (1- 1/s) x T1 T1 Speedup = ------------------------ s T1/s + (1-1/s)T1 ------------- p Speedup = s p lim p ∞

Gustafson’s Law work time p p work time p p ts ts Fixed size ws ts ts wp ts tp/p tp tp tp tp ws wp ws ws ws wp ts ws tp /p ws ws wp wp wp wp wp ts tp /p ts tp /p 1 2 3 4 1 2 3 4 Fixed time 1 2 3 4 1 2 3 4

Gustafson’s Law Scaled Speedup (Fixed-size Speedup) Tp = ts + tp T1 = ts + p.tp If the sequential component of an algorithm is 1/s of the program’s execution time ts = (1/s) x Tp tp = (1- 1/s) x TpSpeedup ideal Speedup(p) = 1/s + (1-1/s)p Speedup(p) = ∞ p lim p ∞

Sizeup Total work on 1 processor Sizeup = ------------------------------------------- Total work on p processors ws: serial work wp: parallelizable work wp’: scaled parallelizable work ws + wp’ ws + p.wp Sizeup = ---------------- = ----------------- ws + wpws + wp

Roofline Performance Model Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte number of data bytes

Roofline Performance Model Attainable GFLOPs/second Peak memory bandwidth x Arithmetic intensity = min Peak floating-point performance

Roofline Performance Model Peak floating-point performance is given by the hardware specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance of all the cores on the chip. So, multiply the peak per chip by the number of chips Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second) Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as Peak memory bandwidth x Arithmetic intensity (bytes/second) x (FLOPs/bytes) ==> FLOPs/second

Roofline Performance Model Roofline sets an upper bound on performance Roofline of a computer does not vary by benchmark kernel

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are larger than the cache size http://www.cs.virginia.edu/stream/ref.html define N 2000000 ... void tuned_STREAM_Copy() { int j; #pragmaomp parallel for for (j=0; j<N; j++) c[j] = a[j]; } void tuned_STREAM_Scale(double scalar) { int j; #pragmaomp parallel for for (j=0; j<N; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Add() { int j; #pragmaomp parallel for for (j=0; j<N; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Triad(double scalar) { int j; #pragmaomp parallel for for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; }

Multiprocessor Systems Performance Lecture: Speedup and Execution Time

Multiprocessor Systems Performance Lecture: Speedup and Execution Time

Presentation Transcript

Multiprocessor Systems

Performance Analysis of Multiprocessor Architectures

Chapter 6 Multiprocessor System

Lecture 6 Measurement Systems

CSCE 313: Embedded Systems Multiprocessor Systems

Lecture 6 Chapter 9 Systems of Particles

Multiprocessor Kernel Performance Profiling

Database Systems Lecture #6

Analysis of Checkpointing Schemes for Multiprocessor Systems

Lecture 6: Feedback Systems of Reactors

Communication Systems 6 th lecture

Caching in multiprocessor systems

Critical systems Lecture 6

Lecture 12 –Multiprocessor Introduction

Design of Adaptive On-Chip Multiprocessor Systems

Multimedia Systems Lecture 6 – Basics of Compression

CS1Q Computer Systems Lecture 6

Lecture 6: CRUISE PERFORMANCE

Lecture #6: Economic Systems

Lecture 6: Multicore Systems

Lecture 12 –Multiprocessor Introduction

Multiprocessor Systems