performance optimizations for numa multicore systems n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Performance Optimizations for NUMA-Multicore Systems PowerPoint Presentation
Download Presentation
Performance Optimizations for NUMA-Multicore Systems

play fullscreen
1 / 122

Performance Optimizations for NUMA-Multicore Systems

139 Views Download Presentation
Download Presentation

Performance Optimizations for NUMA-Multicore Systems

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Performance Optimizationsfor NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland

  2. About me • ETH Zurich: research assistant • Research: performance optimizations • Assistant: lectures • TUCN • Student • Communications Center: network engineer • Department of Computer Science: assistant

  3. Computing Unlimited need for performance

  4. Performance optimizations • One goal: make programs run fast • Idea: pick good algorithm • Reduce number of operations executed • Example: sorting

  5. Sorting Execution time [T] Number of operations

  6. Sorting Execution time [T] bubble sort Number of operations

  7. Sorting Execution time [T] bubble sort Number of operations quicksort

  8. Sorting Execution time [T] bubble sort Number of operations 11X faster quicksort

  9. Sorting • We picked good algorithm, work done • Are we really done? • Make sure our algorithm runs fast • Operations take time • We assumed 1 operation = 1 time unit T

  10. Quicksort performance Execution time [T] 1 op = 1 T

  11. Quicksort performance Execution time [T] 1 op = 2 T 1 op = 1 T

  12. Quicksort performance Execution time [T] 1 op = 4 T 1 op = 2 T 1 op = 1 T

  13. Quicksort performance Execution time [T] 1 op = 8 T 1 op = 4 T 1 op = 2 T 1 op = 1 T

  14. Quicksort performance Execution time [T] 32% faster 1 op = 8 T bubble sort (1 op = 1 T) 1 op = 4 T 1 op = 2 T 1 op = 1 T

  15. Latency of operations • Best algorithm not enough • Operations are executed on hardware Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

  16. Latency of operations • Best algorithm not enough • Operations are executed on hardware • Hardware must be used efficiently Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

  17. Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusion • ETH scholarship

  18. Memory accesses CPU 230 cycles access latency RAM

  19. Memory accesses CPU Total access latency = 16 x 230 cycles = 3680 cycles Total access latency = ? 230 cycles access latency RAM

  20. Caching CPU 230 cycles access latency RAM

  21. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  22. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  23. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  24. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  25. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  26. Hits and misses CPU Cache miss: data not in cache = 230 cycles Cache hit: data in cache = 30 cycles 30 cycles access latency Cache 200 cycles access latency RAM

  27. Total access latency CPU Total access latency = ? Total access latency = 4 misses + 12 hits = 4 x 230 cycles + 12 * 30 cycles = 1280 cycles 30 cycles access latency Cache 200 cycles access latency RAM

  28. Benefits of caching • Comparison • Architecture w/o cache: T = 230 cycles • Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement • Do caches always help? • Can you think of access pattern with bad cache usage?

  29. Caching CPU Block size: 35 cycles access latency Cache 200 cycles access latency RAM

  30. Cache-aware programming • Today’s example: matrix-matrix multiplication (MMM) • Number of operations: n3 • Compare naïve and optimized implementation • Same number of operations

  31. MMM: naïve implementation j j = C B A X i i for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; }

  32. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  33. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  34. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  35. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  36. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 4 0 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

  37. MMM: Cache performance • Hit rate • Accesses to A[][]: 3/4 = 75% • Accesses to B[][]:0/4 = 0% • All accesses: 38% • Can we do better?

  38. Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } for(i=0; i<N;i++) for(k=0; k<N; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*B[k][j]; } k = C B A X k i i

  39. MMM CPU Cache hits Total accesses C[][] B[][] 3 ? 4 3 4 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

  40. Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) C[][]: 3/4 = 75% hit rate B[][]:3/4 = 75% hit rate All accesses: 75% hit rate A[][]: 3/4 = 75% hit rate B[][]:0/4 = 0% hit rate All accesses: 38% hit rate Better performance due to cache-friendliness?

  41. Performance of MMM Execution time [s]

  42. Performance of MMM Execution time [s] 20X

  43. Cache-aware programming • Two versions of MMM: ijk and ikj • Same number of operations (~n3) • ikj20X better than ijk • Good performance depends on two aspects • Good algorithm • Implementation that takes hardware into account • Hardware • Many possibilities for inefficiencies • We consider only the memory system in this lecture

  44. Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusions • ETH scholarship

  45. Cache-based architecture CPU 10 cycles access latency L1-C 20 cycles access latency Cache L2 Cache Bus Controller 200 cycles access latency Memory Controller RAM

  46. Multi-core multiprocessor Processor package Processor package Core Core Core Core Core Core Core CPU Core CPU L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C Cache L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

  47. Experiment • Performance of a well-optimized program • soplex from SPECCPU 2006 • Multicore-multiprocessor systems are parallel • Multiple programs run on the system simultaneously • Contender program: milc from SPECCPU 2006 • Examine 4 execution scenarios soplex milc

  48. Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

  49. Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

  50. Performance with sharing: soplex Execution time relative to solo execution