1 / 70

Parallel Processing

Parallel Processing. Chapter 9. Problem: Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:. Problem: Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: Divide program into parts

dayton
Télécharger la présentation

Parallel Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Processing Chapter 9

  2. Problem: • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available • Solution:

  3. Problem: • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available • Solution: • Divide program into parts • Run each part on separate CPUs of larger machine

  4. Motivations

  5. Motivations • Desktops are incredibly cheap • Custom high-performance uniprocessor • Hook up 100 desktops • Squeezing out more ILP is difficult

  6. Motivations • Desktops are incredibly cheap • Custom high-performance uniprocessor • Hook up 100 desktops • Squeezing out more ILP is difficult • More complexity/power required each time • Would require change in cooling technology

  7. Challenges • Parallelizing code is not easy • Communication can be costly • Requires HW support

  8. Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Requires HW support

  9. Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Performance analysis ignores caches - these costs are much higher • Requires HW support

  10. Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Performance analysis ignores caches - these costs are much higher • Requires HW support • Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.

  11. Performance - Speedup • _____________________ • 70% of the program is parallelizable • What is the highest speedup possible? • What is the speedup with 100 processors?

  12. Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • What is the speedup with 100 processors?

  13. Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • 1 / (.30 + .70 / ) = 1 / .30 = 3.33 • What is the speedup with 100 processors? 8

  14. Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • 1 / (.30 + .70 / ) = 1 / .30 = 3.33 • What is the speedup with 100 processors? • 1 / (.30 + .70/100) = 1 / .307 = 3.26 8

  15. Taxonomy • SISD – single instruction, single data • SIMD – single instruction, multiple data • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  16. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  17. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  18. P D P D P D P D Controller SIMD Controller fetches instructions All processors execute the same instruction Conditional instructions only way for variation P D P D P D P D

  19. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  20. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • Never built – pipeline architectures?!? • MIMD – multiple instruction, multiple data

  21. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • Streaming apps? • MIMD – multiple instruction, multiple data • Most multiprocessors • Cheap, flexible

  22. Example • Sum the elements in A[] and place result in sum int sum=0; int i; for(i=0;i<n;i++) sum = sum + A[i];

  23. Parallel versionShared Memory

  24. Parallel versionShared Memory int A[NUM]; int numProcs; int sum; int sumArray[numProcs]; myFunction( (input arguments) ) { int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++) mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) { for(i=0;i<numProcs;i++) sum += sumArray[i]; } }

  25. Why Synchronization? • Why can’t you figure out when proc x will finish work?

  26. Why Synchronization? • Why can’t you figure out when proc x will finish work? • Cache misses • Different control flow • Context switches

  27. Supporting Parallel Programs • Synchronization • Cache Coherence • False Sharing

  28. Synchronization • Sum += A[i]; • Two processors, i = 0, i = 50 • Before the action: • Sum = 5 • A[0] = 10 • A[50] = 33 • What is the proper result?

  29. Synchronization • Sum = Sum + A[i]; • Assembly for this equation, assuming • A[i] is already in $t0: • &Sum is already in $s0 lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

  30. lw $t1, 0($s0) SynchronizationOrdering #1 add $t1, $t1, $t0 sw $t1, 0($s0) 5 5 15 38 15 38

  31. lw $t1, 0($s0) SynchronizationOrdering #2 add $t1, $t1, $t0 sw $t1, 0($s0) 5 5 15 38 38 15

  32. Synchronization Problem • Reading and writing memory is a non-atomic operation • You can not read and write a memory location in a single operation • We need hardware primitives that allow us to read and write without interruption

  33. Solution • Software Solution • “lock” – function that allows one processor to leave, all others to loop • “unlock” – releases the next looping processor (or resets to allow next arriving proc to leave) • Hardware • Provide primitives that read & write in order to implement lock and unlock

  34. SoftwareUsing lock and unlock lock(&balancelock) Sum += A[i] unlock(&balancelock)

  35. HardwareImplementing lock & unlock • Swap $1, 100($2) • Swap the contents of $1 and M[$2+100]

  36. Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li $t0, 1 Loop: swap $t0, 0($a0) bne $t0, $0, loop

  37. Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li $t0, 1 Loop: swap $t0, 0($a0) bne $t0, $0, loop Unlock: sw $0, 0($a0)

  38. Outline • Synchronization • Cache Coherence • False Sharing

  39. Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ DRAM P1,P2 are write-back caches

  40. Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  41. Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  42. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  43. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  44. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  45. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 • P1: Rd a $$$ $$$ 3 1 DRAM P1,P2 are write-back caches

  46. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

  47. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?

  48. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 5 What should P1 receive from its load?

  49. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 5 What should P1 receive from its load? 3

  50. Whatever are we to do? • Write-Invalidate • Invalidate that value in all others’ caches • Set the valid bit to 0 • Write-Update • Update the value in all others’ caches

More Related