1 / 17

Parallel Performance

Parallel Performance. Performance Considerations. I parallelized my code and … The code slower than the sequential version, or I don’t get enough speedup What should I do?. Performance Considerations. Algorithmic Fine-grained parallelism True contention False contention

jerrod
Télécharger la présentation

Parallel Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Performance Parallel Performance

  2. Parallel Performance Performance Considerations • I parallelized my code and … • The code slower than the sequential version, or • I don’t get enough speedup • What should I do?

  3. Parallel Performance Performance Considerations • Algorithmic • Fine-grained parallelism • True contention • False contention • Other Bottlenecks

  4. Parallel Performance Algorithmic Bottlenecks • Parallel algorithms might sometimes be completely different from their sequential counterparts • Might need different design and implementation

  5. Parallel Performance Algorithmic Bottlenecks • Parallel algorithms might sometimes be completely different from their sequential counterparts • Might need different design and implementation • Example: MergeSort(once again )

  6. Parallel Performance Recall from Previous Lecture • Merge was the bottleneck 1 1 8 1 1 8 1 1 1 16

  7. Parallel Performance Most Efficient Sequential Merge is not Parallelizable Merge(int* a, int* b, int* result ){ while( <end condition> ) { if(*a <= *b){ *result++ = *a++; } else { *result++ = *b++; } } }

  8. Parallel Performance Parallel Merge Algorithm • Merge two sorted arrays A and B using divide and conquer

  9. Parallel Performance Parallel Merge Algorithm • Merge two sorted arrays A and B • Let A be the larger array (else swap A and B) • Let n be the size of A • Split A into A[0…n/2-1], A[n/2], A[n/2+1…n] • Do binary search to find smallest m such that B[m] >= A[n/2] • Split B into B[0...m-1], B[m,..] • return Merge(A[0…n/2-1], B[0…m-1]), A[n/2], Merge(A[n/2+1…n], B[m…])

  10. Parallel Performance Assignment 1 Extra Credit • Implement the Parallel Merge algorithm and measure performance improvement with your work stealing queue implementation

  11. Parallel Performance Fine Grained Parallelism • Overheads of Tasks • Each Task uses some memory (not as much resources as threads, though) • Work stealing queue operations • If the work done in each task is small, then the overheads might not justify the improvements in parallelism

  12. Parallel Performance False SharingData Locality & Cache Behavior • Performance of computation depends HUGELY on how well the cache is working • Too many cache misses, if processors are “fighting” for the same cache lines • Even if they don’t access the same data

  13. Parallel Performance Cache Coherence P1 P2 P3 P3 P1 P2 P3 P1 P3 P2 P3 P3 • Each cacheline, on each processor, has one of these states: • i - invalid : not cached here • s - shared : cached, but immutable • x - exclusive: cached, and can be read or written • State transitions require communication between caches (cache coherence protocol) • If a processor writes to a line, it removes it from all other caches i x i i i i i s s s i i

  14. Parallel Performance Ping-Pong & False Sharing • Ping-Pong • If two processors both keep writing to the same location, cache line has to go back and forth • Very inefficient (lots of cache misses) • False Sharing • Two processors writing to two different variables may happen to write to the same cacheline • If both variables are allocated on the same cache line • Get ping-pong effect as above, and horrible performance

  15. Parallel Performance False Sharing Example voidWithFalseSharing() { Random rand1 = newRandom(), rand2 = newRandom(); int[] results1 = newint[20000000], results2 = newint[20000000]; Parallel.Invoke( () => { for (inti = 0; i < results1.Length; i++) results1[i] = rand1.Next(); }, () => { for (inti = 0; i < results2.Length; i++) results2[i] = rand2.Next(); }); }

  16. Parallel Performance rand1, rand2 are allocated at same time => likely on same cache line. False Sharing Example voidWithFalseSharing() { Random rand1 = newRandom(), rand2 = newRandom(); int[] results1 = newint[20000000], results2 = newint[20000000]; Parallel.Invoke( () => { for (inti = 0; i < results1.Length; i++) results1[i] = rand1.Next(); }, () => { for (inti = 0; i < results2.Length; i++) results2[i] = rand2.Next(); }); } Call to Next() writes to the random object => Ping-Pong Effect

  17. Parallel Performance rand1, rand2 are allocated by different tasks => Not likely on same cache line. False Sharing, Eliminated? voidWithoutFalseSharing() { int[] results1, results2; Parallel.Invoke( () => { Random rand1 = newRandom(); results1 = newint[20000000]; for (inti = 0; i < results1.Length; i++) results1[i] = rand1.Next(); }, () => { Random rand2 = newRandom(); results2 = newint[20000000]; for (inti = 0; i < results2.Length; i++) results2[i] = rand2.Next(); }); }

More Related