The GAP Processor A Processor with a Two-dimensional Execution Unit

The GAP ProcessorA Processor with a Two-dimensional Execution Unit Sascha Uhrig, Basher Shehan, Ralf Jahr University of Augsburg

Outline • Motivation • Basic Idea and Architecture • Optimizations • Evaluation • Summary & Future Work The GAP Processor

Motivation • Current processor design • focuses on multi- and many-core designs • increase of clock frequency no more feasible • neglects single execution streams • Reconfigurable (processor) systems • increase sequential program performance • require special knowledge and/or special tools (design time) • special binary files (runtime) • What’s about a RA-like processor? • fetch/decode stages like a processor • execution stage like a RA The GAP Processor

Basic Idea and Architecture • Starting point: conventional instruction stream • Instructions form a (dynamic) dataflow graph • Many instructions are executed multiple times • Use the program‘s DFG for execution in a RA • by on-the-fly configuration • no special tools are required (we use GCC) • but performance can be improved by special tools • What we have done: • cycle/signal accurate simulator • the simulator uses the PISA instruction set architecture well-known from SimpleScalar simulator • optimization tool for improved mapping • post-linker tool to be able to optimize existing binary files The GAP Processor

Basic Idea and Architecture • The Grid-ALU-Processor (GAP) • superscalar processor frontend • 2D backend of configurable ALUs with simple routing • Frontend: • fetches standard instructions (currently 4 per cycle) • decoding + generation of configuration data incl. timing • goes to idle mode when array is completely configured • Backend: • dataflow from north to south • each architectural register is represented by a column of ALUs • length of the array is not predefined • Performance: • loops should be configured once and executed multiple times • execution is asynchronous with sync to synchronous parts The GAP Processor

Basic Idea and Architecture Fetch Stage Data Cache Decode Stage Instruction Cache Register File Configuration Stage Mapping Network reg reg reg reg Forward Network ALU ALU ALU ALU Load/ Store ALU ALU ALU ALU Load/ Store Branch Control Data Cache Management Line Network ALU ALU ALU ALU Load/ Store reg reg reg reg Reconfigurable Functional Unit Backward Routing Network The GAP Processor

Basic Instruction Mapping • Column is given by the target register • Row depends on • last write to the source registers (must be below) • last read of the target register (value in the column must not be overwritten) • preceding conditional branches • Load/store instructions are enumerated • stores are executed in-order • cond. branches preceding a store must be resolved • Timing is determined by counting the delay • instructions are assumed to require a certain number of pico-cycles (a fraction of a pipeline clock cycle) The GAP Processor

Basic Instruction Mapping Data flow graph Cycles Processor frontend h a g f b f c d 1 + - Add Sub r3 r2 r3 Add Slt 2 + <? r4 Sub Add Lw r2 r4 + M - 3 Shift And r2 r4 r31 Bne Add & << 4 r2 5 + r16 6 r3 < The GAP Processor

Binary Optimization (Optional) • Better distribution of used registers • Static speculative execution of instructions • moving parts of the instructions of a basic block to the preceding basic block • results are used or discarded, depending on the control flow • calculation overhead doesn‘t matter • Inlining • reduce the number of function calls requiring an indirect jump at the return, leading to an array flush The GAP Processor

Challanges of the Basic GAP Architecture • 32 registers require 32 colums • with 32 rows => 1024 ALUs • long wire-delay from leftmost to rightmost ALU • 30 ALU between the burder ALUs • wire-delay must be taken into account at configuration • high fan-out and fan-in • possible connections from all ALUs in row n-1 to row n • pure utilization (<2%) • only columns where the corresponding register is a target register are occupied • dependencies prevent from using all ALUs in a column The GAP Processor

Hardware Optimizations • Decoupling of registers and columns • technique similar to register renaming of ooo processors • Introduction of configuration layers • compensate the smaller array size • fast call/return of already configured functions • longer loop bodies can be mapped to multiple layers • Reducing the interconnection network • decrease the number of horizontal busses to 2/3 • Using a single conf. layer multiple times • dynamic horizontal segmentation of the array possible • increases the number of virtual layers The GAP Processor

Evaluation • Based on instructions per cycle (IPC) • Standard benchmarks • selected integer benchmarks from the MiBench suite • Compiled for the SimpleScalar simulator • GCC, PISA instruction set • Directly comparable to the SimpleScalar • 8-way superscalar ooo processor • Binary optimization optional, for GAP only The GAP Processor

Evaluation Always 32 rows, only a single configuration layer The GAP Processor

Evaluation Array size 4x4, different number of layers The GAP Processor

Evaluation -1,8% 0% -1,7% -,02 % -,01% 0 % -1,9% -,02% Comparison of full interconnect and 2/3 busses only The GAP Processor

Evaluation Speedup of static speculation technique The GAP Processor

Conclusion • Basic GAP architecture • Challenges of GAP and solutions • Some evaluations • GAP speeds up conventional sequential instruction streams without additional tooling • GAP is scalable and preserves compatibility! • Future work: • Evaluation of chip size and power consumption • Precise exceptions • Interrupts The GAP Processor

More Crazy Ideas…. Multi-Core/multithreaded GAP with dynamic partitioning Fetch Stage Fetch Stage Decode Stage Decode Stage Configuration Stage Configuration Stage reg reg reg reg reg ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU reg reg reg reg reg The GAP Processor

Thank you! Questions, discussion….

The GAP Processor A Processor with a Two-dimensional Execution Unit

The GAP Processor A Processor with a Two-dimensional Execution Unit

Presentation Transcript

Simple Processor Control Unit

A new processor option

The Processor

The Processor

A Relational Algebra Processor

The Processor

Processor

A Pipelined Processor

Lets Build a Processor

Processor

A Processor

System Unit And The Processor

Design a MIPS Processor

Simple Processor Control Unit

Designing a Multicycle Processor

Processor Verification with Precise Exceptions and Speculative Execution

Processor Expert: A Tutorial

A Simple Processor

A Processor

Building a Biodiesel Processor

Core-A Processor

Conceptual execution on a processor which exploits ILP