Samira Khan University of Virginia Sep 4, 2019

Data-Centric System Design CS 6501 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 4, 2019 The content and concept of this course are adapted from CMU ECE 740

Review Set 2 • Due Sep 11 • Choose 2 from a set of four • Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974. • Arvind and Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture”, IEEE TC 1990. • H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. • Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986.

AGENDA • Logistics • Review from last lecture • Fundamental concepts • Computing models

THE VON NEUMANN MODEL/ARCHITECTURE • Also called stored program computer (instructions in memory). Two key properties: • Stored program • Instructions stored in a linear memory array • Memory is unified between instructions and data • The interpretation of a stored value depends on the control signals • Sequential instruction processing • One instruction processed (fetched, executed, and completed) at a time • Program counter (instruction pointer) identifies the current instr. • Program counter is advanced sequentially except for control transfer instructions When is a value interpreted as an instruction?

THE DATA FLOW MODEL (OF A COMPUTER) • Von Neumann model: An instruction is fetched and executed in control flow order • As specified by the instruction pointer • Sequential unless explicit control flow instruction • Dataflow model: An instruction is fetched and executed in data flow order • i.e., when its operands are ready • i.e., there is no instruction pointer • Instruction ordering specified by data flow dependence • Each instruction specifies “who” should receive the result • An instruction can “fire” whenever all operands are received • Potentially many instructions can execute at the same time • Inherently more parallel

VON NEUMANN VS DATAFLOW • Consider a Von Neumann program • What is the significance of the program order? • What is the significance of the storage locations? • Which model is more natural to you as a programmer? a b v <= a + b; w <= b * 2; x <= v - w y <= v + w z <= x * y + *2 - + Sequential * Dataflow z

MORE ON DATA FLOW • In a data flow machine, a program consists of data flow nodes • A data flow node fires (fetched and executed) when all it inputs are ready • i.e. when all inputs have tokens • Data flow node and its ISA representation

b a 1 + *7 1 2 2 x 3 y 4 - + 4 3 5 * 5 Dataflow Machine:Instruction Templates Destination 2 Destination 1 Operand 1 Operand 2 Opcode + 3L 4L *3R 4R - 5L + 5R *out Presence bits Each arc in the graph has an operand slot in the program • Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974 • One of the papers assigned for review next week

Static Dataflow Machine (Dennis+, ISCA 1974) Receive • Many such processors can be connected together • Programs can be statically divided among the processors Instruction Templates Op dest1 dest2 p1 src1 p2 src2 1 2 . . . FU FU FU FU FU Send <s1, p1, v1>, <s2, p2, v2>

Static Data Flow Machines • Mismatch between the model and the implementation • The model requires unbounded FIFO token queues per arc but the architecture provides storage for one token per arc • The architecture does not ensure FIFO order in the reuse of an operand slot • The static model does not support • Reentrant code • Function calls • Loops • Data Structures

Problems with Re-entrancy Takes a long time x • Assume this was in a loop • Or in a function • And operations took variable time to execute • How do you ensure the tokens that match are of the same invocation? y Blocked X+1 X+1 This node will fire X+1

Monsoon Dataflow Processor 1990

Dynamic Dataflow Architectures • Allocate instruction templates, i.e., a frame, dynamically to support each loop iteration and procedure call • termination detection needed to deallocate frames • The code can be shared if we separate the code and the operand storage <fp, ip, port, data> a token instruction pointer frame pointer

1 1 b a 2 2 3 + *7 1 2 4 4 x 5 5 y - + 4 3 1 2 3 * 5 4 5 Need to provide storage for only one operand/operator A Frame in Dynamic Dataflow + Program 3L, 4L 3R, 4R * - 5L 3 + 5R out * <fp, ip, p , v> L 7 Frame

op r d1,d2 Code Monsoon Processor (ISCA 1990) InstructionFetch ip OperandFetch fp+r Token Queue Frames ALU FormToken Network Network Gregory M. Papadopoulos David E. Culler, Monsoon: an explicit token-store architecture, ISCA 1990

token in frame 0 token in frame 1 Procedure Linkage Operators an f a1 ... get frame extract tag change Tag 0 change Tag n change Tag 1 Like standard call/return but caller & callee can be active simultaneously n: 1: Fork Graph for f change Tag 1 change Tag 0

Loops and Function Calls Summary

DATA FLOW CHARACTERISTICS • Data-driven execution of instruction-level graphical code • Nodes are operators • Arcs are data (I/O) • As opposed to control-driven execution • Only real dependencies constrain processing • No sequential I-stream • No program counter • Operations execute asynchronously • Execution triggered by the presence of data

DATA FLOW ADVANTAGES/DISADVANTAGES • Advantages • Very good at exploiting irregular parallelism • Only real dependencies constrain processing • Disadvantages • Debugging difficult (no precise state) • Interrupt/exception handling is difficult (what is precise state semantics?) • Too much parallelism? (Parallelism control needed) • High bookkeeping overhead (tag matching, data storage) • Instruction cycle is inefficient (delay between dependent instructions), memory locality is not exploited

DATA FLOW SUMMARY • Availability of data determines order of execution • A data flow node fires when its sources are ready • Programs represented as data flow graphs (of nodes) • Data Flow at the ISA level has not been (as) successful • Data Flow implementations under the hood (while preserving sequential ISA semantics) have been successful • Out of order execution • Hwu and Patt, “HPSm, a high performance restricted data flow architecture having minimal functionality,” ISCA 1986.

OOO EXECUTION: RESTRICTED DATAFLOW • An out-of-order engine dynamically builds the dataflow graph of a piece of the program • which piece? • The dataflow graph is limited to the instruction window • Instruction window: all decoded but not yet retired instructions • Can we do it for the whole program? • Why would we like to? • In other words, how can we have a large instruction window?

ANOTHER WAY OF EXPLOITING PARALLELISM • SIMD: • Concurrency arises from performing the same operations on different pieces of data • MIMD: • Concurrency arises from performing different operations on different pieces of data • Control/thread parallelism: execute different threads of control in parallel  multithreading, multiprocessing • Idea: Use multiple processors to solve a problem

FLYNN’S TAXONOMY OF COMPUTERS • Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 • SISD: Single instruction operates on single data element • SIMD: Single instruction operates on multiple data elements • Array processor • Vector processor • MISD: Multiple instructions operate on single data element • Closest form: systolic array processor, streaming processor • MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) • Multiprocessor • Multithreaded processor

ARRAY VS. VECTOR PROCESSORS ARRAY PROCESSOR VECTOR PROCESSOR Instruction Stream Same op @ same time Different ops @ time LD VR  A[3:0] ADD VR  VR, 1 MUL VR  VR, 2 ST A[3:0]  VR LD2 LD0 LD1 LD3 LD0 AD2 AD0 AD1 AD3 LD1 AD0 MU2 MU0 MU1 MU3 MU0 LD2 AD1 ST2 ST0 ST0 ST1 ST3 MU1 LD3 AD2 ST1 MU2 AD3 Different ops @ same space ST2 MU3 Time ST3 Same op @ space Space Space

SCALAR PROCESSING • Conventional form of processing (von Neumann model) add r1, r2, r3

SIMD ARRAY PROCESSING • Array processor

VECTOR PROCESSOR ADVANTAGES + No dependencies within a vector • Pipelining, parallelization work well • Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work • Reduces instruction fetch bandwidth + Highly regular memory access pattern • Interleaving multiple banks for higher memory bandwidth • Prefetching + No need to explicitly code loops • Fewer branches in the instruction sequence

VECTOR PROCESSOR DISADVANTAGES -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

VECTOR PROCESSOR LIMITATIONS -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

VECTOR MACHINE EXAMPLE: CRAY-1 • CRAY-1 • Russell, “The CRAY-1 computer system,” CACM 1978. • Scalar and vector modes • 8 64-element vector registers • 64 bits per element • 16 memory banks • 8 64-bit scalar registers • 8 24-bit address registers

timeoriginal (1 - f) f timeenhanced (1 - f) f/S AMDAHL’S LAW: BOTTLENECK ANALYSIS • Speedup= timewithout enhancement / timewith enhancement • Suppose an enhancement speeds up a fraction fof a task by a factor of S timeenhanced= timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f)+ f/S) Focus on bottlenecks with large f (and large S)

FLYNN’S TAXONOMY OF COMPUTERS • Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 • SISD: Single instruction operates on single data element • SIMD: Single instruction operates on multiple data elements • Array processor • Vector processor • MISD: Multiple instructions operate on single data element • Closest form: systolic array processor, streaming processor • MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) • Multiprocessor • Multithreaded processor

SYSTOLIC ARRAYS

WHY SYSTOLIC ARCHITECTURES? • Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory • Similar to an assembly line of processing elements • Different people work on the same car • Many cars are assembled simultaneously • Why? Special purpose accelerators/architectures need • Simple, regular design (keep # unique parts small and regular) • High concurrency  high performance • Balanced computation and I/O (memory) bandwidth

SYSTOLIC ARRAYS • H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. Memory: heart PEs: cells Memory pulses data through cells

SYSTOLIC ARCHITECTURES • Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs • Balance computation and memory bandwidth • Differences from pipelining: • These are individual PEs • Array structure can be non-linear and multi-dimensional • PE connections can be multidirectional (and different speed) • PEs can have local memory and execute kernels (rather than a piece of the instruction)

Data-Centric System Design CS 6501 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 4, 2019 The content and concept of this course are adapted from CMU ECE 740

Samira Khan University of Virginia Sep 4, 2019

Samira Khan University of Virginia Sep 4, 2019

Presentation Transcript

Samira Khan University of Virginia Apr 3, 2019

Samira Khan University of Virginia Jan 23, 2019

Samira Khan University of Virginia Mar 20, 2019

University of Virginia

Samira Khan

University of Virginia

The University of Virginia

University of Virginia

AGA KHAN UNIVERSITY

Samira Khan University of Virginia Feb 11, 2019

Sep 2019