1 / 122

Performance Optimizations for NUMA-Multicore Systems

Performance Optimizations for NUMA-Multicore Systems. Zolt án Majó Department of Computer Science ETH Z urich , Switzerland. About me. ETH Zurich: research assistant Research: performance optimizations Assistant: lectures TUCN Student Communications Center: network engineer

Télécharger la présentation

Performance Optimizations for NUMA-Multicore Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Optimizationsfor NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland

  2. About me • ETH Zurich: research assistant • Research: performance optimizations • Assistant: lectures • TUCN • Student • Communications Center: network engineer • Department of Computer Science: assistant

  3. Computing Unlimited need for performance

  4. Performance optimizations • One goal: make programs run fast • Idea: pick good algorithm • Reduce number of operations executed • Example: sorting

  5. Sorting Execution time [T] Number of operations

  6. Sorting Execution time [T] bubble sort Number of operations

  7. Sorting Execution time [T] bubble sort Number of operations quicksort

  8. Sorting Execution time [T] bubble sort Number of operations 11X faster quicksort

  9. Sorting • We picked good algorithm, work done • Are we really done? • Make sure our algorithm runs fast • Operations take time • We assumed 1 operation = 1 time unit T

  10. Quicksort performance Execution time [T] 1 op = 1 T

  11. Quicksort performance Execution time [T] 1 op = 2 T 1 op = 1 T

  12. Quicksort performance Execution time [T] 1 op = 4 T 1 op = 2 T 1 op = 1 T

  13. Quicksort performance Execution time [T] 1 op = 8 T 1 op = 4 T 1 op = 2 T 1 op = 1 T

  14. Quicksort performance Execution time [T] 32% faster 1 op = 8 T bubble sort (1 op = 1 T) 1 op = 4 T 1 op = 2 T 1 op = 1 T

  15. Latency of operations • Best algorithm not enough • Operations are executed on hardware Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

  16. Latency of operations • Best algorithm not enough • Operations are executed on hardware • Hardware must be used efficiently Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

  17. Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusion • ETH scholarship

  18. Memory accesses CPU 230 cycles access latency RAM

  19. Memory accesses CPU Total access latency = 16 x 230 cycles = 3680 cycles Total access latency = ? 230 cycles access latency RAM

  20. Caching CPU 230 cycles access latency RAM

  21. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  22. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  23. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  24. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  25. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

  26. Hits and misses CPU Cache miss: data not in cache = 230 cycles Cache hit: data in cache = 30 cycles 30 cycles access latency Cache 200 cycles access latency RAM

  27. Total access latency CPU Total access latency = ? Total access latency = 4 misses + 12 hits = 4 x 230 cycles + 12 * 30 cycles = 1280 cycles 30 cycles access latency Cache 200 cycles access latency RAM

  28. Benefits of caching • Comparison • Architecture w/o cache: T = 230 cycles • Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement • Do caches always help? • Can you think of access pattern with bad cache usage?

  29. Caching CPU Block size: 35 cycles access latency Cache 200 cycles access latency RAM

  30. Cache-aware programming • Today’s example: matrix-matrix multiplication (MMM) • Number of operations: n3 • Compare naïve and optimized implementation • Same number of operations

  31. MMM: naïve implementation j j = C B A X i i for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; }

  32. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  33. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  34. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  35. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

  36. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 4 0 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

  37. MMM: Cache performance • Hit rate • Accesses to A[][]: 3/4 = 75% • Accesses to B[][]:0/4 = 0% • All accesses: 38% • Can we do better?

  38. Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } for(i=0; i<N;i++) for(k=0; k<N; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*B[k][j]; } k = C B A X k i i

  39. MMM CPU Cache hits Total accesses C[][] B[][] 3 ? 4 3 4 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

  40. Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) C[][]: 3/4 = 75% hit rate B[][]:3/4 = 75% hit rate All accesses: 75% hit rate A[][]: 3/4 = 75% hit rate B[][]:0/4 = 0% hit rate All accesses: 38% hit rate Better performance due to cache-friendliness?

  41. Performance of MMM Execution time [s]

  42. Performance of MMM Execution time [s] 20X

  43. Cache-aware programming • Two versions of MMM: ijk and ikj • Same number of operations (~n3) • ikj20X better than ijk • Good performance depends on two aspects • Good algorithm • Implementation that takes hardware into account • Hardware • Many possibilities for inefficiencies • We consider only the memory system in this lecture

  44. Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusions • ETH scholarship

  45. Cache-based architecture CPU 10 cycles access latency L1-C 20 cycles access latency Cache L2 Cache Bus Controller 200 cycles access latency Memory Controller RAM

  46. Multi-core multiprocessor Processor package Processor package Core Core Core Core Core Core Core CPU Core CPU L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C Cache L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

  47. Experiment • Performance of a well-optimized program • soplex from SPECCPU 2006 • Multicore-multiprocessor systems are parallel • Multiple programs run on the system simultaneously • Contender program: milc from SPECCPU 2006 • Examine 4 execution scenarios soplex milc

  48. Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

  49. Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

  50. Performance with sharing: soplex Execution time relative to solo execution

More Related