1 / 19

The GAP Processor A Processor with a Two-dimensional Execution Unit

The GAP Processor A Processor with a Two-dimensional Execution Unit Sascha Uhrig , Basher Shehan, Ralf Jahr University of Augsburg. Outline. Motivation Basic Idea and Architecture Optimizations Evaluation Summary & Future Work. Motivation. Current processor design

xarles
Télécharger la présentation

The GAP Processor A Processor with a Two-dimensional Execution Unit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The GAP ProcessorA Processor with a Two-dimensional Execution Unit Sascha Uhrig, Basher Shehan, Ralf Jahr University of Augsburg

  2. Outline • Motivation • Basic Idea and Architecture • Optimizations • Evaluation • Summary & Future Work The GAP Processor

  3. Motivation • Current processor design • focuses on multi- and many-core designs • increase of clock frequency no more feasible • neglects single execution streams • Reconfigurable (processor) systems • increase sequential program performance • require special knowledge and/or special tools (design time) • special binary files (runtime) • What’s about a RA-like processor? • fetch/decode stages like a processor • execution stage like a RA The GAP Processor

  4. Basic Idea and Architecture • Starting point: conventional instruction stream • Instructions form a (dynamic) dataflow graph • Many instructions are executed multiple times • Use the program‘s DFG for execution in a RA • by on-the-fly configuration • no special tools are required (we use GCC) • but performance can be improved by special tools • What we have done: • cycle/signal accurate simulator • the simulator uses the PISA instruction set architecture well-known from SimpleScalar simulator • optimization tool for improved mapping • post-linker tool to be able to optimize existing binary files The GAP Processor

  5. Basic Idea and Architecture • The Grid-ALU-Processor (GAP) • superscalar processor frontend • 2D backend of configurable ALUs with simple routing • Frontend: • fetches standard instructions (currently 4 per cycle) • decoding + generation of configuration data incl. timing • goes to idle mode when array is completely configured • Backend: • dataflow from north to south • each architectural register is represented by a column of ALUs • length of the array is not predefined • Performance: • loops should be configured once and executed multiple times • execution is asynchronous with sync to synchronous parts The GAP Processor

  6. Basic Idea and Architecture Fetch Stage Data Cache Decode Stage Instruction Cache Register File Configuration Stage Mapping Network reg reg reg reg Forward Network ALU ALU ALU ALU Load/ Store ALU ALU ALU ALU Load/ Store Branch Control Data Cache Management Line Network ALU ALU ALU ALU Load/ Store reg reg reg reg Reconfigurable Functional Unit Backward Routing Network The GAP Processor

  7. Basic Instruction Mapping • Column is given by the target register • Row depends on • last write to the source registers (must be below) • last read of the target register (value in the column must not be overwritten) • preceding conditional branches • Load/store instructions are enumerated • stores are executed in-order • cond. branches preceding a store must be resolved • Timing is determined by counting the delay • instructions are assumed to require a certain number of pico-cycles (a fraction of a pipeline clock cycle) The GAP Processor

  8. Basic Instruction Mapping Data flow graph Cycles Processor frontend h a g f b f c d 1 + - Add Sub r3 r2 r3 Add Slt 2 + <? r4 Sub Add Lw r2 r4 + M - 3 Shift And r2 r4 r31 Bne Add & << 4 r2 5 + r16 6 r3 < The GAP Processor

  9. Binary Optimization (Optional) • Better distribution of used registers • Static speculative execution of instructions • moving parts of the instructions of a basic block to the preceding basic block • results are used or discarded, depending on the control flow • calculation overhead doesn‘t matter • Inlining • reduce the number of function calls requiring an indirect jump at the return, leading to an array flush The GAP Processor

  10. Challanges of the Basic GAP Architecture • 32 registers require 32 colums • with 32 rows => 1024 ALUs • long wire-delay from leftmost to rightmost ALU • 30 ALU between the burder ALUs • wire-delay must be taken into account at configuration • high fan-out and fan-in • possible connections from all ALUs in row n-1 to row n • pure utilization (<2%) • only columns where the corresponding register is a target register are occupied • dependencies prevent from using all ALUs in a column The GAP Processor

  11. Hardware Optimizations • Decoupling of registers and columns • technique similar to register renaming of ooo processors • Introduction of configuration layers • compensate the smaller array size • fast call/return of already configured functions • longer loop bodies can be mapped to multiple layers • Reducing the interconnection network • decrease the number of horizontal busses to 2/3 • Using a single conf. layer multiple times • dynamic horizontal segmentation of the array possible • increases the number of virtual layers The GAP Processor

  12. Evaluation • Based on instructions per cycle (IPC) • Standard benchmarks • selected integer benchmarks from the MiBench suite • Compiled for the SimpleScalar simulator • GCC, PISA instruction set • Directly comparable to the SimpleScalar • 8-way superscalar ooo processor • Binary optimization optional, for GAP only The GAP Processor

  13. Evaluation Always 32 rows, only a single configuration layer The GAP Processor

  14. Evaluation Array size 4x4, different number of layers The GAP Processor

  15. Evaluation -1,8% 0% -1,7% -,02 % -,01% 0 % -1,9% -,02% Comparison of full interconnect and 2/3 busses only The GAP Processor

  16. Evaluation Speedup of static speculation technique The GAP Processor

  17. Conclusion • Basic GAP architecture • Challenges of GAP and solutions • Some evaluations • GAP speeds up conventional sequential instruction streams without additional tooling • GAP is scalable and preserves compatibility! • Future work: • Evaluation of chip size and power consumption • Precise exceptions • Interrupts The GAP Processor

  18. More Crazy Ideas…. Multi-Core/multithreaded GAP with dynamic partitioning Fetch Stage Fetch Stage Decode Stage Decode Stage Configuration Stage Configuration Stage reg reg reg reg reg ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU reg reg reg reg reg The GAP Processor

  19. Thank you! Questions, discussion….

More Related