1 / 18

Lecture 6: Performance of Multiprocessor Systems

Lecture 6: Performance of Multiprocessor Systems. Speedup. Execution time on 1 processor T 1 Speedup = ----------------------------------------------- = -------- Execution time on p processors T p t s : time for the serial part of the algorithm

Télécharger la présentation

Lecture 6: Performance of Multiprocessor Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6:Performance of Multiprocessor Systems

  2. Speedup Execution time on 1 processor T1 Speedup = ----------------------------------------------- = -------- Execution time on p processors Tp ts : time for the serial part of the algorithm tp : time for the parallelizable part of the algorithm T1 = ts + tpSpeedup ideal Tp = ts + tp/p ts + tp Speedup(p) = ---------------- ts + tp/p p

  3. Amdahl’s Law If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s. ts = (1/s) x T1 tp = (1- 1/s) x T1 T1 Speedup(p) = ------------------------ s T1/s + (1-1/s)T1 ------------- p Speedup(p) = s p lim p ∞

  4. Speedup

  5. Speedup

  6. Speedup

  7. Superlinear speedup Speedup(p) > p  superlinear speedup Reasons: • Increased cache size • Random algorithms • Parallel algorithm

  8. Speedup T1 Speedup = -------- Tp • Relative speedup: single processor execution time of the parallel algorithm is used • Absolute speedup: execution time of the best parallel algorithm on one processor is used

  9. Efficiency Speedup(p) T1 Efficiency(p) = ------------------- = ---------- ≤ 1 p p x Tp Efficiency 1 p

  10. Amdahl’s Law If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s. ts = (1/s) x T1 tp = (1- 1/s) x T1 T1 Speedup = ------------------------ s T1/s + (1-1/s)T1 ------------- p Speedup = s p lim p ∞

  11. Gustafson’s Law work time p p work time p p ts ts Fixed size ws ts ts wp ts tp/p tp tp tp tp ws wp ws ws ws wp ts ws tp /p ws ws wp wp wp wp wp ts tp /p ts tp /p 1 2 3 4 1 2 3 4 Fixed time 1 2 3 4 1 2 3 4

  12. Gustafson’s Law Scaled Speedup (Fixed-size Speedup) Tp = ts + tp T1 = ts + p.tp If the sequential component of an algorithm is 1/s of the program’s execution time ts = (1/s) x Tp tp = (1- 1/s) x TpSpeedup ideal Speedup(p) = 1/s + (1-1/s)p Speedup(p) = ∞ p lim p ∞

  13. Sizeup Total work on 1 processor Sizeup = ------------------------------------------- Total work on p processors ws: serial work wp: parallelizable work wp’: scaled parallelizable work ws + wp’ ws + p.wp Sizeup = ---------------- = ----------------- ws + wpws + wp

  14. Roofline Performance Model Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte number of data bytes

  15. Roofline Performance Model Attainable GFLOPs/second Peak memory bandwidth x Arithmetic intensity = min Peak floating-point performance

  16. Roofline Performance Model Peak floating-point performance is given by the hardware specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance of all the cores on the chip. So, multiply the peak per chip by the number of chips Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second) Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as Peak memory bandwidth x Arithmetic intensity (bytes/second) x (FLOPs/bytes) ==> FLOPs/second

  17. Roofline Performance Model Roofline sets an upper bound on performance Roofline of a computer does not vary by benchmark kernel

  18. Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are larger than the cache size http://www.cs.virginia.edu/stream/ref.html define N 2000000 ... void tuned_STREAM_Copy() { int j; #pragmaomp parallel for for (j=0; j<N; j++) c[j] = a[j]; } void tuned_STREAM_Scale(double scalar) { int j; #pragmaomp parallel for for (j=0; j<N; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Add() { int j; #pragmaomp parallel for for (j=0; j<N; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Triad(double scalar) { int j; #pragmaomp parallel for for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; }

More Related