Understanding Flynn's Taxonomy and Processor Architectures: A Deeper Look

CSE 160 – Lecture 2

Today’s Topics • Flynn’s Taxonomy • Bit-Serial, Vector, Pipelined Processors • Interconnection Networks • Topologies • Routing • Embedding • Network Bisection

Taxonomy • Flynn (1966) Classified machines by data and control streams

SIMD • SIMD • All processors execute the same program in lockstep • Data that each processor sees is different • Single control processor • Individual processors can be turned on/off at each cycle • Illiac IV, CM-2, MasPar are some examples • Silicon Graphics Reality Graphics engine

MIMD • All processors execute their own set of instructions • Processors operate on separate datastreams • No centralized clock implied • SP-2, T3E, Clusters, Cray’s, etc.

SPMD/MPMD • Single/Multiple Program Multiple Data • SPMD processors run the same program but processors are necessarily run in lock step. • Very popular and scalable programming style • MPMD is similar except that different processors run different programs • PVM distribution has some simple examples

Processor Types • Four types • Bit serial • Vector • Cache-based, pipelined • Custom (eg. Tera MTA or KSR-1)

Bit Serial • Only seen in SIMD machines like CM-2 or MasPar • Each clock cycle, one bit of the data is loaded/written • Simplifies memory system and memory trace count • Popular for very dense (64K) processor arrays

Cache-based, Pipelined • Garden Variety Microprocessor • Sparc, Intel x86, MC68xxx, MIPs, … • Register-based ALUs and FPUs • Registers are of scalar type • Pipelined execution to improve performance of individual chips • Splits up components of basic operation like addition into stages • The more stages, the faster the speedup, but more problems with branching and data/control hazards • Per-processor caches make it challenging to build SMPs (coherency issues) • Now dominates the high-end market

Vector Processors • Very specialized (eg. $$$$$) machines • Registers are true vectors with power of 2 lengths • Designed to efficiently perform matrix-style operations • Ax = b ( b(I) =  A(I,J)*x(J)) • Vector registers v1, v2, v3 • V1 = A(I,*), V2 = b(*) • MULV V3(I), V1, V2 • “Chaining” to efficiently handle larger vectors than size of vector registers • Cray, Hitachi, SGI (now Cray SV-1) are examples

Some Custom Processors • Denelcor HEP/Tera MTA • Multiple register sets • Stack Pointer, Instruction Pointer, Frame Pointer, etc. • Facilitates hardware threads • Switch each clock cycle to different register set • Why? Stalls to memory subsystem in one thread can be hidden by concurrency • KSR-1 • Cache-only memory processor • Basically 2 generations behind standard micros

Going Parallel • Late 70’s, even vector “monsters” started to to go parallel • For //-processing to work, individual processors must synchronize • SIMD – Synchronize every clock cycle • MIMD – Explicit sychronization • Message passing • Semaphores, monitors, fetch-and-increment • Focus on interconnection networks for rest of lecture

Characterizing Networks • Bandwidth • Device/switch latency • Switching types • Circuit switched (eg. Telephone) • Packet switched (eg. Internet) • Store and forward • Virtual Cut Through • Wormhole routed • Topology • Number of connections • Diameter (how many hops through switches)

Latency • Latency is the amount of time taken for a command to start before any effect is seen • Push on gas pedal before car goes forward • Time you enter a line, before cashier starts on your job • First bit leaves computer A, first bit arrives at computer B OR • (Message latency) First bit leaves computer A, last bit arrives at computer B • Startup latency is the amount of time to send a zero length message

Bandwidth • Bits/second that can travel through a connection • A really simple model for calculating the time to send a message of N bytes • Time = latency + N/bandwidth • Bisection is the minimum number of wires that must be cut to divide a network of machines into two equal halves. • Bisection bandwidth is the total bandwidth through the bisection

Interconnection Topologies • Completely connected • Every node has a direct wire connection to every other node (N x (N-1))/2 Wires, Clearly impractical

Line/Ring 1 2 3 4 5 6 7 • Simple interconnection • First topology where routing is an issue • Needed when no direct connection exists between nodes • Want go to node 4 from node 2 have to pass through node 3 • What happens if 2 want to communicate with 3 at the same time 1 want to communicate with 4? • What is the bisection of a line/ring • If the links are of bandwidth B, what is the bisection bandwidth • What is the aggregate bandwidth of the network?

Mesh/Torus • Generalization of line/ring to multiple dimensions • More routes between nodes • What is the bisection of this network? 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Hop Count • Networks are measured by diameter • This is the minimum number of hops that message must traverse for the two nodes that furthest apart • Line: Diameter = N-1 • 2D (NxM) Mesh: Diameter = N+M-2

Tree-based Networks • Nodes organized in a tree fashion (important for some global algorithms) Diameter of this network? Bisection, Bisection Bandwidth?

Hypercubes 1D 2D 4D 3D

Hypercubes 2 • Dimension N Hypercube is constructed by connecting the “corners” of two N-1 hypercubes • Relatively low wire count to build large networks • Multiple routes from any destination to any node. • Exercise to the reader, what is the dimenision of a K-dimensional Hypercube

Labeling/Routing in a Hypercube • Nodes a labeled in Gray Code • Connected neighbors have their binary node number representation differ by one bit. • 3D cube 000 001 101 100 010 011 110 111

The e-cube routing algorithm • Source address S = S0 S1 S2 … Sn • Destination address D = D0 D1 D2 … Dn • Let R = R0 R1 R2 … Rn = S  R • Number of one bits in R indicate distance between S and D • Starting at S, go to neighbor where first Rj = 1 (if Sj = 0 then goto neighbor where Sj=1) • Continue routing from this intermediate node where the next Rk (k > j) is one, goto that neighbor.

E-cube routing example • 8 Dimensional Hypercube (256 Nodes) • S = 134= 0x86 = 10000110 • D = 215 = 0xD7 = 11010111 • S  D = 0x51 = 01010001 • Distance = 3 • S  11000110 (198) • 11010110 (214) • 11010111 (215)

Embedding • A network is embeddable if nodes and links can be mapped to a target network • A mesh is embeddable in a hypercube • There is mapping of hypercube nodes and networks to a mesh • The dilation of an embedding is how many links are needed in the embedding network to represent the embedded network • Perfect embeddings have dilation 1 • Embedding a tree into a mesh has a dilation of 2 (See example in book)

Modern Parallel Machines are Packet Switched • Break message into smaller blocks and send these pieces through the network • Network intermediate points (routers) can be store-and-forward or virtual cut through • Store and forward requires buffering at each switch if an incoming packet has packets ahead of it on an outgoing port (congestion) • Virtual cut-through eliminates the always buffering for store and forward by “cutting through” the switch when the output port is free

Wormhole Routing • Wormhole routing is a variation of virtual cut through • Small headers (flow control digits == Flits) pass through the network. • When a flit is allowed to cut through a switch, the original sender is guaranteed a clear path through that switch. • A tail flit closes the “connection” • Wormhole was defined by Seitz and is used in Myrinet, a very popular cluster interconnect.

Latency of Circuit Switched and Virtual Cut Through • Circuit Switch Latency • (Lc/B) l + (L/B) • Lc = length of control packet • B = bandwidth • l = number of links • L = Length of Packet • Virtual Cut-through latency • (Lh/B) l + (L/B) • Lh = length of header packet

Store-Forward and Wormhole routing Latency • Wormhole Routing Latency • (Lf/B) l + (L/B) • Lf = Length of flit • Store-Forward Latency • (L/B) l • Store and forward latency can be much worse for many hops. • Virtual Cut Through, Wormhole, and Circuit Switch reach (L/B) as message length increases

Deadlock/Livelock • Livelock/Deadlock is a potential problem in any network design. • Livelock occurs in adaptive routing algorithms when a packet never finds destination • Deadlock occurs when packets cannot be forwarded because waiting for other packets to move out of the way. Blocking packet is waiting for blocked packet to move

Next Time … • All about clusters • Introduction to PVM (and MPI)

Understanding Flynn's Taxonomy and Processor Architectures: A Deeper Look