Efficient SCMP Message-Passing System for High-Performance Computing

Send and Receive Based Message-Passing for SCMP Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28th, 2004

A B Thread A Data B Sync A B RTS A B CTS A B Data This presentation introduces the SCMP architecture, discusses problems with the current SCMP message-passing system, and focuses on the design and performance of a new SCMP message-passing system. 1. Overview of SCMP 2. Original Message-Passing System 4. Performance Comparisons 3. New Message-Passing System

Problems with current design trends motivate the SCMP concept. • As transistor sizes shrink, so do communication wires. This leads to higher cross-chip communication latencies. • ILP faces diminishing returns. • Large and complex uni-processors require extensive amounts of design and verification.

SCMP provides PLP through replication. • Up to 64 identical nodes on-chip • Replicated nodes reduce complexity • 2-D network eliminates cross-chip wires SCMP Network with 64 Nodes

SCMP provides TLP through multi-thread hardware support. • Up to 16 threads • Round-robin thread scheduling by hardware • On every node: • 4-stage RISC pipeline • 8MB memory • Networking hardware SCMP Node

The original messaging system has two message types. Thread Message Data Message Because they contain handling information these message formats borrow from the Active-Messages message-passing system.

Network uses wormhole and dimension-order routing. 0 1 2 3 4 5 6 7 • Every router multiplexes virtual channel buffers over physical channels. • Head flits claim virtual channel resources as they travel • If one message blocks, other messages may still continue as long as enough virtual channels are free. • Messages move along X axis, then Y axis • Tail flits release virtual channel resources as they travel.

Dimension-order routing is deadlock free as long as messages eventually drain. • Even with VCs, network can still deadlock if messages don’t drain. • If all contexts are consumed, thread messages block at NIU • Threads may not release until a data message is received • Data messages must not be stopped by congested thread messages • Data messages must have a separate path through network. Router Thread VCs West East Data VCs

The NIU bears most of the messaging load. NIU Thread Buffer Context 1 Context 2 Injection Channel Data Buffer Context 2 To Router From Router Receive Buffer Ejection Channel Memory

Messages are built through assembly instructions.

The thread library facilitates thread messages.

The send library facilitates data messages.

The original message-passing system uses requests and replies. Node A requires data held by Node B • Node A creates a thread on Node B • New thread on Node B sends data to Node A • New thread on Node B sends SYNC message when done A B Thread A B Data Sync

Dynamic memory is a problem. • Request thread on node B must know: • Source Address • Source Stride • Destination Address • Destination Stride • Number of Values to Send • How can Node A know the source address and stride if Node B allocates the buffer dynamically? • Program must contain global pointers

In-order delivery of messages is a problem. • SCMP network does not guarantee in-order delivery of messages • SYNC message may reach Node A before data message • Node A will read bad values from memory A B Data Sync

Request threads and finite thread contexts are a problem. Contexts 0X0000de5a 0X00000f70 0X00000ff8 0X00000ff8 NIU 0X00000ff8 Thread Thread Thread 0X00000ff8 0X00000ff8 • If a node holds highly demanded data, request threads may consume all of its contexts • Additional thread messages will block in the network 0X00000ff8

Send-and-Receive message-passing eliminates all of these problems. • A thread must execute a receive before data will be accepted • Don’t need request messages • Messages are identified abstractly • Don’t need global pointers • Completion notification occurs locally • Don’t need SYNC messages

Rendezvous mode uses an RTS/CTS handshake. Node B holds data required by Node A • Node B sends Node A an RTS message when send is executed • After receive is executed Node A sends Node B a CTS message • Node B sends data after receiving RTS A B RTS A B CTS A B Data

Ready mode foregoes the handshake to reduce message latency. Node B holds data required by Node A • Node B sends data when send is executed • User must ensure that receive has executed on Node A A B Data

The implementation centers around two tables. Send Table Entry Receive Table Entry

Send Table Entries may be in 4 states, and Receive Table Entries may be in 5 states. Send Table Entry States Receive Table Entry States

The new messaging system has four message types. Data Message Thread Message CTS Message RTS Message

The NIU now contains a data queue for every context. NIU Thread Buffer Injection Channel To Router Data Buffer Context 1 Context 2 Context 2 RTS Buffer From Router CTS Buffer Receive Buffer Ejection Channel Memory

Only five new instructions and one modified instruction are needed.

The thread library remains nearly the same.

The new send library is more familiar.

The receive library is all new.

Rendezvous Mode Operation at the Sender sendh No Entry? F SUSPEND CTS Message Arrives T Queue Head And Tag Queue Waiting F Create: Entry->In Use ERROR T Head Flit @ Queue Head Tail Flit Not Sent Send Flit No Entry? T ERROR Entry->Complete F Send RTS Entry->In Progress

Rendezvous Mode Operation at the Receiver RTS Message Arrives Data Message Arrives No Entry No Entry T T DISCARD Record RTS F Entry->RTS Rcvd In Progress F F Block Data In Use T Send CTS T Tail Flit Not Stored F Entry->In Progress Store Flit Block RTS Entry->Complete RTS Rcvd No Entry F F SUSPEND str T T Record str Send CTS Entry->In Use Entry->In Progress

RTS and CTS Messages also need separate VC paths. Router • RTS messages can block in the network. • For a given RTS message to leave the network, RTS messages ahead of it must be satisfied • CTS message to source • Data message back • RTS and CTS messages have their own VC paths. Thread VCs Data VCs West East RTS VCs CTS VCs

Ready Mode Operation at the Sender Head Flit @ Queue Head F sendh No Entry? ERROR No Entry? F T SUSPEND Entry->In Progress T Queue Head And Tag Tail Flit Not Sent Send Flit Create: Entry->In Use Entry->Complete

Data Message Arrives str No Entry T DISCARD No Entry F SUSPEND F In Progress F T Block Data Record str T Entry->In Use Tail Flit Not Stored Store Flit Entry->Complete Ready Mode Operation at the Receiver

Stressmark testing was used to verify that performance was not hurt. • DIS Stressmark Suite • Neighborhood Stressmark • Matrix Stressmark • Transitive Closure Stressmark • LU Factorization Stressmark

The neighborhood stressmark measures image texture. • Every node owns a portion of the total rows • Every row owns complete sum and difference histograms • Each node determines, and requests, the pair’s for pixels in its rows • Each node fills in sum and difference histogram • Histograms are shared • Each node manages only a portion of each histogram • Only the correct portion is sent to a node

Queues with 16 flits perform best.

The new system out performs the old under the neighborhood stressmark.

Matrix stressmark solves a linear system of equations using the Conjugate Gradient Method. • Additional vectors r and p used for intermediate steps • Every node has: • Rows of A • Elements of b and r • Complete x and p • After each iteration p must be globally redistributed • Share with columns • Share with rows

The new system provides marginal improvement over the original under the matrix stressmark.

1 2 0 3 4 14 13 5 6 11 12 7 8 9 15 10 The transitive closure stressmark solves the all-pairs shortest-path problem. • Floyd-Warshall Algorithm • Adjacency Matrix • D[i][j] • Iterative Improvements • D[i][j] = min(D[i][j], D[i][k]+D[k][j]) • Each node owns sub-block of adjacency matrix • Each node needs portion of row k • Each node needs portion of column k

The new system provides marginal improvement over the original under the transitive closure stressmark.

The LU factorization stressmark is used by linear system solvers. • Factors matrix into a lower triangular matrix and an upper triangular matrix. • Matrix is divided into blocks • Pivot block is factored • Pivot column and row blocks are divided by pivot. • Inner active matrix blocks are modified by the pivot row and column blocks. Pivot Row Pivot Pivot Column Inner Active Matrix

The new system out performs the original under the LU factorization stressmark.

Send-and-Receive Messaging for SCMP is worthwhile. • Fixes Problems With Original SCMP Messaging System • Global Buffer Pointers • Races between Data and SYNC messages • Request Thread Storms • Programming Model is more familiar • Performance is better Questions?

Efficient SCMP Message-Passing System for High-Performance Computing

Efficient SCMP Message-Passing System for High-Performance Computing

Presentation Transcript

Message Passing Basics

Message Passing

Message Passing Communication

Message Passing

Message-Passing

Message Passing

Message Passing

Problems with Send and Receive

Semantics of Send and Receive

Message Passing

Message Passing Algorithms for Optimization

Message Passing Programming Based on MPI

Message Passing

Message-Passing Computing

Map Reduce for Message Passing

Send, Receive and Withdraw - Wallet

Message Passing Computing

Send and Receive

Send and receive SMS Australia