Download Presentation
Performance Optimizations for NUMA-Multicore Systems

1 / 122

# Performance Optimizations for NUMA-Multicore Systems

Download Presentation

## Performance Optimizations for NUMA-Multicore Systems

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Performance Optimizationsfor NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland

2. About me • ETH Zurich: research assistant • Research: performance optimizations • Assistant: lectures • TUCN • Student • Communications Center: network engineer • Department of Computer Science: assistant

3. Computing Unlimited need for performance

4. Performance optimizations • One goal: make programs run fast • Idea: pick good algorithm • Reduce number of operations executed • Example: sorting

5. Sorting Execution time [T] Number of operations

6. Sorting Execution time [T] bubble sort Number of operations

7. Sorting Execution time [T] bubble sort Number of operations quicksort

8. Sorting Execution time [T] bubble sort Number of operations 11X faster quicksort

9. Sorting • We picked good algorithm, work done • Are we really done? • Make sure our algorithm runs fast • Operations take time • We assumed 1 operation = 1 time unit T

10. Quicksort performance Execution time [T] 1 op = 1 T

11. Quicksort performance Execution time [T] 1 op = 2 T 1 op = 1 T

12. Quicksort performance Execution time [T] 1 op = 4 T 1 op = 2 T 1 op = 1 T

13. Quicksort performance Execution time [T] 1 op = 8 T 1 op = 4 T 1 op = 2 T 1 op = 1 T

14. Quicksort performance Execution time [T] 32% faster 1 op = 8 T bubble sort (1 op = 1 T) 1 op = 4 T 1 op = 2 T 1 op = 1 T

15. Latency of operations • Best algorithm not enough • Operations are executed on hardware Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

16. Latency of operations • Best algorithm not enough • Operations are executed on hardware • Hardware must be used efficiently Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

17. Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusion • ETH scholarship

18. Memory accesses CPU 230 cycles access latency RAM

19. Memory accesses CPU Total access latency = 16 x 230 cycles = 3680 cycles Total access latency = ? 230 cycles access latency RAM

20. Caching CPU 230 cycles access latency RAM

21. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

22. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

23. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

24. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

25. Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

26. Hits and misses CPU Cache miss: data not in cache = 230 cycles Cache hit: data in cache = 30 cycles 30 cycles access latency Cache 200 cycles access latency RAM

27. Total access latency CPU Total access latency = ? Total access latency = 4 misses + 12 hits = 4 x 230 cycles + 12 * 30 cycles = 1280 cycles 30 cycles access latency Cache 200 cycles access latency RAM

28. Benefits of caching • Comparison • Architecture w/o cache: T = 230 cycles • Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement • Do caches always help? • Can you think of access pattern with bad cache usage?

29. Caching CPU Block size: 35 cycles access latency Cache 200 cycles access latency RAM

30. Cache-aware programming • Today’s example: matrix-matrix multiplication (MMM) • Number of operations: n3 • Compare naïve and optimized implementation • Same number of operations

31. MMM: naïve implementation j j = C B A X i i for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; }

32. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

33. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

34. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

35. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

36. MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 4 0 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

37. MMM: Cache performance • Hit rate • Accesses to A[][]: 3/4 = 75% • Accesses to B[][]:0/4 = 0% • All accesses: 38% • Can we do better?

38. Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } for(i=0; i<N;i++) for(k=0; k<N; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*B[k][j]; } k = C B A X k i i

39. MMM CPU Cache hits Total accesses C[][] B[][] 3 ? 4 3 4 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

40. Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) C[][]: 3/4 = 75% hit rate B[][]:3/4 = 75% hit rate All accesses: 75% hit rate A[][]: 3/4 = 75% hit rate B[][]:0/4 = 0% hit rate All accesses: 38% hit rate Better performance due to cache-friendliness?

41. Performance of MMM Execution time [s]

42. Performance of MMM Execution time [s] 20X

43. Cache-aware programming • Two versions of MMM: ijk and ikj • Same number of operations (~n3) • ikj20X better than ijk • Good performance depends on two aspects • Good algorithm • Implementation that takes hardware into account • Hardware • Many possibilities for inefficiencies • We consider only the memory system in this lecture

44. Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusions • ETH scholarship

45. Cache-based architecture CPU 10 cycles access latency L1-C 20 cycles access latency Cache L2 Cache Bus Controller 200 cycles access latency Memory Controller RAM

46. Multi-core multiprocessor Processor package Processor package Core Core Core Core Core Core Core CPU Core CPU L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C Cache L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

47. Experiment • Performance of a well-optimized program • soplex from SPECCPU 2006 • Multicore-multiprocessor systems are parallel • Multiple programs run on the system simultaneously • Contender program: milc from SPECCPU 2006 • Examine 4 execution scenarios soplex milc

48. Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

49. Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

50. Performance with sharing: soplex Execution time relative to solo execution