CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CGRA Express: Accelerating Execution using Dynamic Operation Fusion CCCP Research Group, University of Michigan Yongjun Park, Hyunchul Park, Scott Mahlke 1

Coarse-Grained Reconfigurable Architecture (CGRA) • Array of PEs connected in a mesh-like interconnect • High throughput with a large number of resources • Distributed hardware offers low cost/power consumption • High flexibility with dynamic reconfiguration 2

CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications for future embedded systems • High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW • Morphosys : 8x8 array with RISC processor • SiliconHive : hierarchical systolic array • ADRES : 4x4 array with tightly coupled VLIW 3

Performance Bottleneck: Acyclic Code … Software Pipeline Normal schedule Original Loop region dominant Software Pipeline Acyclic region dominant Block 0 Block 0 Block 0 Block 1 Block 1 Block 1 Acyclic region is substantial! It’s time to optimize acyclic code. Block 2 Block 3 Block 2 Block 5 Block 2 Block 3 … Block 3 Block 4 Block 5 Block 5 … … Application Execution Time 4

Key Idea: Chaining Instructions 1. Clock period Longest operation with register file access. 2. CGRA is not VLIW. Register file access is not frequent! 3. Opportunity of instruction chaining. 4. Considerable register access time ≈ Arithmetic operation delay (3.5ns clock period @ IBM 90nm) Critical Path: Slow! Non-critical path : Fast! 5

ADD ADD LSR Dynamic Operation Fusion • Execute multiple dependent operations in one cycle • Key benefits 1. Minimal hardware overhead 2. Multiple subgraphs can be executed simultaneously. 3. Dynamic merging of FUs 4x4 CGRA Add512r10 LD MUL 4x4 CGRA A B Assumption Instruction time = RF read time = RF write time ADD 512 ADD 10 LSR Current : 3 Cycle Operation fusion : 1 Cycle Out 6

Hardware Support • Simple bypass network • Small overhead: 3.8%(SRAM), 2.3%(MUX) 7

Compiler Support • Tick-based scheduling • Tick: small time unit based on hardware delay information • Clock cycle = # of ticks • Clock boundary constraint checking • Resource conflict • Time conflict • Tick-based scheduling • Tick: small time unit based on hardware delay information • Clock cycle = # of ticks • Clock boundary constraint checking • Resource conflict • Time conflict • Tick-based scheduling • Tick: small time unit based on hardware delay information • Clock cycle = # of ticks • Clock boundary constraint checking • Resource conflict • Time conflict 8

Dynamic Operation Fusion Example(1) 1. Conventional Scheduling – 5 cycle 1. Conventional Scheduling DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph Schedule Table Schedule Table Schedule Table Schedule Table Schedule Table Schedule Table const const const const const const RF[0] RF[0] RF[0] RF[0] RF[0] RF[0] const const const const const const RF[1] RF[1] RF[1] RF[1] RF[1] RF[1] const const const const const const SUB(0) SUB(0) SUB(0) SUB(0) SUB(0) SUB(0) ADD(1) ADD(1) ADD(1) ADD(1) ADD(1) ADD(1) ADD(2) ADD(2) ADD(2) ADD(2) ADD(2) ADD(2) const const const const const const LSR(3) LSR(3) LSR(3) LSR(3) LSR(3) LSR(3) CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping Register file Register file Register file Register file Register file Register file LSL(4) LSL(4) LSL(4) LSL(4) LSL(4) LSL(4) OP 0 OP 0 OP 0 OP 0 OP 0 OP 0 OP 1 OP 1 OP 1 OP 1 OP 1 OP 1 OP 5 OP 5 OP 5 OP 5 OP 5 OP 5 ADD(5) ADD(5) ADD(5) ADD(5) ADD(5) ADD(5) OP 2 OP 2 OP 2 OP 2 OP 2 OP 2 OP 3 OP 3 OP 3 OP 3 OP 3 OP 3 OP 4 OP 4 OP 4 OP 4 OP 4 OP 4 RF[2] RF[2] RF[2] RF[2] RF[2] RF[2] 9

Dynamic Operation Fusion Example(2) 2. Dynamic Operation Fusion – 3 Cycle. 2. Dynamic Operation Fusion Schedule Table Schedule Table Schedule Table Schedule Table DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph RF[0] RF[0] RF[0] RF[0] const const const const RF[1] RF[1] RF[1] RF[1] const const const const const const const const SUB(0) SUB(0) SUB(0) SUB(0) ADD(1) ADD(1) ADD(1) ADD(1) ADD(2) ADD(2) ADD(2) ADD(2) const const const const LSR(3) LSR(3) LSR(3) LSR(3) CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping LSL(4) LSL(4) LSL(4) LSL(4) Register file Register file Register file Register file ADD(5) ADD(5) ADD(5) ADD(5) OP 0 OP 0 OP 0 OP 0 OP 1 OP 1 OP 1 OP 1 OP 5 OP 5 OP 5 OP 5 RF[2] RF[2] RF[2] RF[2] OP 2 OP 2 OP 2 OP 2 OP 3 OP 3 OP 3 OP 3 OP 4 OP 4 OP 4 OP 4 10

Experimental Setup • Benchmarks • multimedia applications for embedded systems • Audio decoding (AAC) • Video decoding (H.264) • 3D graphics (3D) • Two designs • baseline : 4x4 heterogeneous CGRA • express : 4x4 heterogeneous CGRA with bypass network 11

Performance Enhancement • Express achieves 7-17% reduction in execution time • Most of reduction comes from acyclic code region. • Express also improves the performance of resource-constrained loop. • Bypass network gives more freedom to compiler. 12

Detailed Result for 3D Graphics • Target application • 3D graphics • Power consumption • 3% higher than the baseline • Performance enhancement • 17% faster than the baseline • Energy consumption • 15% more efficient 13

Conclusion • Acyclic region becomes the performance bottleneck. • The run-time for loops decreases by large factors. • Dynamic operation fusion enables to execute back-to-back operations in a cycle • Bypass network • Tick-based scheduler • Up to17% faster and 15% more energy efficient with 3% hardware overhead 14

Questions? 15

CGRA Express: Accelerating Execution using Dynamic Operation Fusion