380 likes | 484 Vues
Performance Oriented MPI. Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame. Overview. Overview and History of MPI Performance Oriented Point to Point Collectives, Data Types Diagnostics and Tuning Rules of Thumb and Gotchas. Scope of This Talk.
E N D
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame
Overview • Overview and History of MPI • Performance Oriented Point to Point • Collectives, Data Types • Diagnostics and Tuning • Rules of Thumb and Gotchas
Scope of This Talk • Beginning to intermediate user • General principles and rules of thumb • When and where performance might be available • Omit (advanced) low-level issues
Overview and History of MPI • Library (not language) specification • Goals • Portability • Efficiency • Functionality (small and large) • Safety (communicators) • Conservative (current best practices)
Performance in MPI • MPI includes many performance-oriented features • These features are only potentially high-performance • The standard seeks not to preclude performance, it does not mandate it • Progress might only be made during MPI function calls
(Potential) Performance Features • Non-blocking operations • Persistent operations • Collective operations • MPI Datatypes
Basic Point to Point • “Six function MPI” includes • MPI_Send() • MPI_Recv() • These are useful, but there is more
Basic Point to Point MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); }
Non-Blocking Operations • MPI_Isend() • MPI_Irecv() • “I” is for immediate • Paired with MPI_Test()/MPI_Wait()
Non-Blocking Operations MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); }
Persistent Operations • MPI_Send_Init() • MPI_Recv_init() • Creates a request but does not start it • MPI_Start() begins the communication • A single request can be re-used with multiple calls to MPI_Start()
Persistent Operations MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); }
Collective Operations • May be layered on point to point • May use tree communication patterns for efficiency • Synchronization! (No non-blocking collectives)
Collective Operations MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); O(P) O(log P)
MPI Datatypes • May allow MPI to send a message directly from memory • May avoid copying/packing • (General) high performance implementations not widely available network copy
Quiz: MPI_Send() • After I call MPI_Send() • The recipient has received the message • I have sent the message • I can write to the message buffer without corrupting the message • I can write to the message buffer
Sidenote: MPI_Ssend() • MPI_Ssend() has the (perhaps) expected semantics • When MPI_Ssend() returns, the recipient has received the message • Useful for debugging (replace MPI_Send() with MPI_Ssend())
Quiz: MPI_Isend() • After I call MPI_Isend() • The recipient has started to receive the message • I have started to send the message • I can write to the message buffer without corrupting the message • None of the above (I must call MPI_Test() or MPI_Wait())
Quiz: MPI_Isend() • True or False • I can overlap communication and computation by putting some computation between MPI_Isend() and MPI_Test()/MPI_Wait() • False (in many/most cases)
Communication is Still Computation • A CPU, usually the main one, must do the communication work • Part of your process (inside MPI calls) • Another process on main CPU • Another thread on main CPU • Another processor
No Free Lunch • Part of your process (most common) • Fast but no overlap • Another process (daemons) • Overlap, but slow (extra copies) • Another thread (rare) • Overlap and fast, but difficult • Another processor (emerging) • Overlap and fast, but more hardware • E.g., Myri/gm, VIA
How Do I Get Performance? • Minimize time spent communicating • Minimize data copies • Minimize synchronization • I.e., time waiting for communication
Minimizing Communication Time • Bandwidth • Latency
Minimizing Latency • Collect small messages together (if you can) • One 1024-byte message instead of 1024 one-byte messages • Minimize other overhead (e.g., copying) • Overlap with computation (if you can)
Naïve Approach while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); }
Naïve Approach • Deadlock! (Maybe) • Can fix with careful coordination of receiving versus sending on alternate processes • But this can still serialize
MPI_Sendrecv() while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); } }
Immediate Operations while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); }
Receive Before Sending while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); }
Persistent Operations for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); }
Overlapping while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); } }
Advanced Overlap MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } } /* Repeat for black */ }
MPI Data Types • MPI_Type_vector • MPI_Type_struct • Etc. • MPI_Pack might be better network copy
Minimizing Synchronization • At synchronization point (e.g., with collective communication) all processes must arrive at collective call • Can spend lots of time waiting • This is often an algorithmic issue • E.g., check for convergence every 5 iterations instead of every iteration
Gotchas • MPI_Probe • Guarantees extra memory copy • MPI_Any_source • Can cause additional (internal) looping • MPI_All_to_all • All pairs must communicate • Synchronization (avoid in general)
Diagnostic Tools • Totalview • Prism • Upshot • XMPI
Summary • Receive before sending • Collect small messages together • Overlap (if possible) • Use immediate operations • Use persistent operations • Use diagnostic tools