CSS434 Group Communication and MPI Textbook Ch4.4-5 and 15.4

CSS434 Group Communication and MPI Textbook Ch4.4-5 and 15.4 Professor: Munehiro Fukuda CSS434 MPI

Outline • Reliability of group communication • IP multicast • Atomic multicast • Ordering of group communication • Absolute ordering • Total ordering • Causal ordering • FIFO ordering • MPI Java • Programming, compilation and invocation • Major group communication functions: Bcast( ), Reduce( ), Allreduce( ) CSS434 MPI

Group Communication • Communication types • One-to-many: broadcast • Many-to-one: synchronization, collective communication • Many-to-many: gather and scatter • Applications • Fault tolerance based on replicated services (type: one-to-many) • Finding the discovery servers in spontaneous networking (type: a one-to-many request, a many-to-one response) • Better performance through replicated data (type: one-to-many, many-to-one) • Propagation of event notification (type: one-to-many) CSS434 MPI

IP (Unreliable) Multicast import java.net.*; import java.io.*; public class MulticastPeer{ public static void main(String args[]){ // args give message contents & destination multicast group (e.g. "228.5.6.7") MulticastSocket s =null; try { InetAddress group = InetAddress.getByName(args[1]); s = new MulticastSocket(6789); s.joinGroup(group); byte [] m = args[0].getBytes(); DatagramPacket messageOut = new DatagramPacket(m, m.length, group, 6789); s.send(messageOut); // get messages from others in group byte[] buffer = new byte[1000]; for(int i=0; i< 3; i++) { DatagramPacket messageIn = new DatagramPacket(buffer, buffer.length); s.receive(messageIn); System.out.println("Received:" + new String(messageIn.getData())); } s.leaveGroup(group); }catch (SocketException e){System.out.println("Socket: " + e.getMessage()); }catch (IOException e){System.out.println("IO: " + e.getMessage());} }finally {if(s != null) s.close();} } } Using a special network address: IP Class D and UDP CSS434 MPI

Reliability and Ordering • Fault tolerance based on replicated services • Send-to-all and all-reliable semantics • Finding the discovery servers in spontaneous networking • 1-reliable semantics • Better performance through replicated data • Semantics depend on applications. (send-to-all and all- or m-out-of-n reliable semantics) • Propagation of event notifications • In general, send-to-all and all-reliable semantics CSS434 MPI

Atomic Multicast • Send-to-all semantics and all-reliable • Simple emulation: • A repetition of one-to-one communication with acknowledgment • What if a receiver fails • Time-out retransmission • What if a sender fails before all receivers receive a message • All receivers forward the message to the same group and thereafter deliver it to themselves. • A receiver discard the 2nd or the following messages. CSS434 MPI

Message Ordering • R1 and R2 receive m1 and m2 in a different order! • Some message ordering required • Absolute ordering • Consistent/total ordering • Causal ordering • FIFO ordering S2 S1 R1 R2 m2 m1 m1 m2 CSS434 MPI

Absolute Ordering • Rule: • Mi must be delivered before mj if Ti < Tj • Implementation: • A clock synchronized among machines • A sliding time window used to commit message delivery whose timestamp is in this window. • Example: • Distributed simulation • Drawback • Too strict constraint • No absolute synchronized clock • No guarantee to catch all tardy messages Ti < Tj Ti mi Tj mi mj mj CSS434 MPI

Consistent/Total Ordering • Rule: • Messages received in the same order (regardless of their timestamp). • Implementation: • A message sent to a sequencer, assigned a sequence number, and finally multicast to receivers • A message retrieved in incremental order at a receiver • Example: • Replicated database updates • Drawback: • A centralized algorithm Ti < Tj Ti Tj mj mj mi mi CSS434 MPI

Total Ordering Using A Sequencer Message sent to all group members and a sequencer Receive an incoming message in a temporary queue Receive a sequence number message from the sequencer, Reorder the incoming message with this sequence #, and Deliver it if my local counter reaches this number. Receive a message, associate it with a sequence number. multicast it to all group members, and increments the sequence number CSS434 MPI

The ISIS System for Total Ordering Ap4 Proposed Seq Pp4= max(Ap4, Pp4) + 1 P Pp4 2 1 Message Receiver Receiver 3 Ap4 2 P 2 4 2 Proposed Seq Pp4 1 Ap4= max(Ap4, a) 3 Agreed Seq a = max(Pp1, Pp2, P p3, P p4) 1 2 P Sender 1 3 Receiver Ap4 P 3 CSS434 MPI Pp4

Causal Ordering • Rule: • Happened-before relation • If eki, eli ∈h and k < l, then eki → eli, • If ei = send(m) and ej = receive(m), then ei → ej, • If e → e’ and e’ → e”, then e → e” • Implementation: • Use of a vector message • Example: • Distributed file system • Drawback: • Vector as an overhead • Broadcast assumed R1 S1 R2 S2 R3 m4 m1 m1 m4 m2 m2 m3 From R2’s view point m1 →m2 CSS434 MPI

Causal Ordering Using A Vector Stamps Each process maintains a vector. Increment my vector element, and Send a messge with a vector Make sure J’s element of J’s vector reaches J’s element of my vector. (All J’s previous message has been delivered.) Make sure all elements of J’s vector <= all elements of my vector. (I’ve delivered all messages that J has delivered.) Increment the sender’s vector element CSS434 MPI

Vector Stamps • S[i] = R[i] + 1 where i is the source id • S[j] ≤ R[j] where i≠j Site B Site C Site D Site A 1, 1, 1, 0 2, 1, 0, 0 2, 1, 1, 0 2, 1, 1, 0 3,1,1,0 delayed delayed delivered CSS434 MPI

FIFO Ordering • Rule: • Messages received in the same order as they were sent. • Implementation: • Messages assigned a sequence number • Example: • TCP • This is the weakest ordering. S R m1 Router 1 m2 m1 m3 m2 m4 m3 Router 2 m4 CSS434 MPI

Why High-Level Message Passing Tools? • Data formatting • Data formatted into appropriate types at user level • Non-blocking communication • Polling and interrupt handled at system call level • Process addressing • Inflexible hardwired addressing with machine id + local id • Group communication • Group server implemented at user level • Broadcasting simulated by a repetition of one-to-one communication CSS434 MPI

PVM and MPI • PVM: Parallel Virtual Machine • Developed in 80’s • The pioneer library to provide high-level message passing functions • The PVM daemon process taking care of message transfer for user processes in background • MPI: Message Passing Interface • Defined in 90’s • The specification of high-level message passing functions • Several implementations available: mpich, mpi-lam • Library functions directly linked to user programs (no background daemons) • The detailed difference is shown by: • PVMvsMPI.pdf CSS434 MPI

Getting Started with MPI Java • Website: http://www.hpjava.org/courses/arl/lectures/mpi.ppt http://www.hpjava.org/reports/mpiJava-spec/mpiJava-spec.pdf • Creating a machines file: [mfukuda@UW1-320-00 mfukuda]$ vi machines uw1-320-00 uw1-320-01 uw1-320-02 uw1-320-03 • Compile a source program: [mfukuda@UW1-320-00 mfukuda]$ javac MyProg.java • Run the executable file: [mfukuda@UW1-320-00 mfukuda]$ prunjava 4 MyProg args CSS434 MPI

Program Using MPI import mpi.*; class MyProg { public static void main( String[] args ) { MPI.Init( args ); // Start MPI computation int rank = MPI.COMM_WORLD.Rank( ); // Process ID (from 0 to #processes – 1) int size = MPI.COMM_WORLD.Size( ); // # participating processes System.out.println( "Hello World! I am " + rank + " of " + size ); MPI.Finalize(); // Finish MPI computation } } CSS434 MPI

MPI_Send and MPI_Recv void MPI.COMM_WORLD.Send( Object[] message /* in */, int offset /* in */, int count /* in */, MPI.Datatype datatype /* in */, int dest /* in */, int tag /* in */) Status MPI.COMM_WORLD.Recv( Object[] message /* in */, int offset /* in */, int count /* in */, MPI.Datatype datatype /* in */, int source /* in */, int tag /* in */) int Status.Get_count( MPI.Datatype, datatype ) /* #objects received */ MPI.Datatype = BYTE, CHAR, SHORT, INT, LONG, FLOAT, DOUBLE, OBJECT CSS434 MPI

MPI.Send and MPI.Recv import mpi.*; class myProg { public static void main( String[] args ) { int tag0 = 0; MPI.Init( args ); // Start MPI computation if ( MPI.COMM_WORLD.rank() == 0 ) { // rank 0…sender int loop[1]; loop[0] = 3; MPI.COMM_WORLD.Send( "Hello World!", 12, MPI.CHAR, 1, tag0 ); MPI.COMM_WORLD.Send( loop, 1, MPI.INT, 1, tag0 ); } else { // rank 1…receiver int loop[1]; char msg[12]; MPI::COMM_WORLD.Recv( msg, 12, MPI.CHAR, 0, tag0 ); MPI::COMM_WORLD.Recv( loop, 1, MPI.INT, 0, tag0 ); for ( int i = 0; i < loop[0]; i++ ) System.out.println( msg ); } MPI.Finalize( ); // Finish MPI computation } } CSS434 MPI

Message Ordering in MPI • FIFO Ordering in each data type • Messages reordered with a tag in each data type Source Destination Source Destination tag = 1 tag = 3 tag = 2 CSS434 MPI

MPI.Bcast void MPI.COMM_WORLD.Bcast( Object[] message /* in */, int offset /* in */, int count /* in */, MPI.Datatype datatype /* in */, int root /* in */) msg[0] Rank 3 Rank 2 Rank 4 Rank 1 Rank 0 MPI::COMM_WORLD.Bcast( msg, 0, 1, MPI.INT, 2); CSS434 MPI

Rank3 8 Rank2 12 Rank4 4 Rank1 10 Rank0 15 49 MPI_Reduce void MPI.COMM_WORLD.Reduce( Object[] sendbuf /* in */, int sendoffset /* in */, Object[] recvbuf /* out */, int recvoffset /* in */, int count /* in */, MPI.Datatype datatype /* in */, MPI.Op operator /* in */, int root /* in */ ) MPI.Op = MPI.MAX (Maximum), MPI.MIN (Minimum), MPI.SUM (Sum), MPI.PROD (Product), MPI.LAND (Logical and), MPI.BAND (Bitwise and), MPI.LOR (Logical or), MPI.BOR (Bitwise or), MPI.LXOR (logical xor), MPI.BXOR(Bitwise xor), MPI.MAXLOC (MAX location) MPI.MINLOC (MIN loc.) MPI::COMM_WORLD.Reduce( msg, 0, result, 0, 1, MPI.INT, MPI.SUM, 2); CSS434 MPI

0 1 2 4 3 5 6 7 0 1 2 4 3 5 6 7 0 1 2 4 3 5 6 7 0 1 2 4 3 5 6 7 MPI_Allreduce void MPI.COMM_WORLD.Allreduce( Object[] sendbuf /* in */, int sendoffset /* in */, Object[] recvbuf /* out */, int recvoffset /* in */, int count /* in */, MPI.Datatype datatype /* in */, MPI.Op operator /* in */) CSS434 MPI

Exercises (No turn-in) • Explain, in atomic multicast on page 6, why reversing the order of operations “all receivers forward the message to the same group and thereafter deliver it to themselves” makes the multicast no longer atomic. • Assume that four processes communicate with one another in causal ordering. Their current vectors are show below. If Process A sends a message, which processes can receive it immediately? • Show that, if the basic multicast that we use in the algorithm of P9 is also FIFO-ordered, then the resultant totally-ordered multicast is also causally ordered. • Consider pros and cons of PVM’s daemon-based and MPI’s library linking-based message passing. • Why can MPI maintain FIFO ordering? CSS434 MPI

CSS434 Group Communication and MPI Textbook Ch4.4-5 and 15.4

CSS434 Group Communication and MPI Textbook Ch4.4-5 and 15.4

Presentation Transcript

CSS434 Distributed Objects and Remote Invocation Textbook Ch5

MPI One-Sided and Two-Sided Communication

CSS434 Networking Textbook Ch3

CSS434 Process Migration Textbook 7.4.2 and Non-Textbook Contents

Communication and Design Practices Group

CSS434 Distributed Shared Memory Textbook Ch18

CSS434 Grid Computing Textbook No Corresponding Chapters

CSS434 System Models Textbook Ch2

CSS490 Group Communication and MPI Textbook Ch3

15.4 Secession and War

15.4, 5 Solving Logarithmic Equations

Messaging, MOMs and Group Communication

MPI Collective Communication Kadin Tseng Scientific Computing and Visualization Group

Interconnect and MPI

15.4

MPI Collective Communication

MPI and OpenMP

Messaging, MOMs and Group Communication

MPI implementation – collective communication

Messaging and Group Communication