Optimizing Process and Thread Placement on Zeus for Enhanced Performance

Controlling Process/ThreadPlacement on Zeus Raghu Reddy (Raghu.Reddy@noaa.gov) NCEP MTT Meeting

Placement example • GFS, T574 case, with 241 MPI ranks • Nodes=121, PPN=2, NUMTHD=5 • 2 cores left idle • Artificial example to illustrate the problem  NCEP MTT Meeting

Default or “basic” omplace qsub -l nodes=121:ppn=2 jobfile mpiexec_mpt -np 241 omplace –nt 5 ./a.out NCEP MTT Meeting

One Zeus node Node Socket C C C C C C C C C C C C • Two Sockets • Each with it’s own memory (12 GB each, 24 GB Total) • QPI makes it possible for 1 sock to access the other’s memory • Non Uniform Memory Access (NUMA) • Different BW • Difference Latency • Memory BW is shared by 6 cores NCEP MTT Meeting

Omplace with bsand st set NCEP MTT Meeting

Controlling placement • bs = (ppn*numthds)/2 • mpiexec_mpt-np 241 omplace–c 0-:bs=$bs+st=6 –nt5 ./a.out -c CPU list BS Block Size ST Stride NCEP MTT Meeting

Impact of Proper Placement? • Depends on the application • If not memory BW limited, minimal impact • DGEMM • Memory BW limited applications will see good benefit • STREAM • Most applications fall in-between NCEP MTT Meeting

Impact of Placement: Kernels • HPCC Benchmark • MPI benchmark to characterize performance • “Single” (we will ignore this for today) • “Star” – Independent copy per core NCEP MTT Meeting

Controlling placement: GFS • bs = (ppn*numthds)/2 • mpiexec_mpt-np 241 omplace–c 0-:bs=$bs+st=6 –nt5 ./a.out • fe2% dd 20:13:36 20:17:39 nt-5-ppn-2-noomplace • 243 nt-5-ppn-2-noomplace • fe2% dd 16:26:01 16:28:47 nt-5-ppn-2 • 166 nt-5-ppn-2 • fe2% dd 16:30:32 16:33:00 nt-5-ppn-2-bs-5 • 148 nt-5-ppn-2-bs-5 • fe2% NCEP MTT Meeting

A More Practical example • Your MPI (non-threaded application) needs more memory than what is available per core • So you have to use ppn=6 instead of ppn=12 • If you run it without omplace, all ranks would be put on 1 socket • 6 ranks on one socket, 0 ranks on the second! • Use bs = 3 (ppn*numthd/2) • This will put 3 ranks on each socket • Improves memory bandwidth NCEP MTT Meeting

Test run of GFS (no OpenMP) • GFS, T574 case, with 241 MPI ranks • Nodes=41, PPN=6, NUMTHD=1 noomp runs ------------ fe2% dd 13:10:50 13:16:35 nt-1-ppn-6 345 nt-1-ppn-6 fe2% dd 13:26:55 13:32:13 nt-1-ppn-6-nt-2 318 nt-1-ppn-6-nt-2 fe2% dd 13:57:15 14:02:27 nt-1-ppn-6-bs-3 312 nt-1-ppn-6-bs-3 fe2% NCEP MTT Meeting

GFS:Nodes=41, PPN=6, THDS=2 fe2% dd 14:51:07 14:54:52 nt-2-ppn-6 (no TAU) 225 nt-2-ppn-6 fe2% dd 14:27:57 14:32:17 tau-3files-nt-2-ppn-6 260 tau-3files-nt-2-ppn-6 fe2% NCEP MTT Meeting

Summary • Under certain circumstances using proper placement can be beneficial • In general, if you’re using all the available cores this may not be important. • This may be significant if you are leaving some cores idle where it may be beneficial. • Especially so, if you idle cores and use “remote” memory NCEP MTT Meeting

Questions? • Thanks! NCEP MTT Meeting

Optimizing Process and Thread Placement on Zeus for Enhanced Performance

Optimizing Process and Thread Placement on Zeus for Enhanced Performance

Presentation Transcript

QCC College Placement Testing Module 3

Monitoring and Controlling the Project

Avoiding IDEA Due Process Proceedings (or winning when you do get there)

The author: Fedorchenko R.A.

CPE 631: Multiprocessors and Thread-Level Parallelism

Win32 Threads and Thread Synchronization

Chap. 4 Multiprocessors and Thread-Level Parallelism

Multi-Thread Integrative Cooperative Optimization for Rich VRP

The Story Of Hercules

Reducing and controlling deductions Increasing profitability by improving retail compliance

Advanced Placement Environmental Science La Canada High School Dr. E

Outline for Today

ZEUS

Java II

Process Internals

Operating Systems

Controlling Diabetes During the Holidays