Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors Jason Cong, Guoling Han,Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles Supported by NSF, GSRC, Altera, Xilinx.

Outline • Motivation • Limited data bandwidth has become the performance bottleneck of instruction-set extendible processors • Architectural extension • Hash-mapped shadow registers • Associated compilation techniques • Shadow register binding and hash function generation • Experimental results • Conclusions UCLA VLSICAD LAB

Target Reconfigurable Platform • General purpose processor core + programmable fabric • Loosely coupled as a coprocessor • Xilinx MicroBlaze, etc. • Tightly integrated as extra function units in application-specific instruction-set processors • GPP has the capability to extend basic instruction set • Programmable fabric implements the customized instructions • Examples: Altera Nios / Nios II, Tensilica Xtensa, etc. Custom instruction logic for Nios II [source: www.altera.com] UCLA VLSICAD LAB

Target Core Processor Model • Core processor model • Classic single-issue pipelined RISC core (fetch / decode / execute / write-back) • The number of input and output operands an instruction is pre-determined • The custom instruction cannot execute until all the input operands are available • The custom instruction read the core register file during the execute stage, and commit the result during the write-back stage UCLA VLSICAD LAB

a b d c e * * * + + + Data Bandwidth Problem • Fact: about 60% speedup comes from clusters with more than two inputs [P. Ienne et al] • Architecture problem: limited register file bandwidth (two read ports, one write port) • One solution: introducing state registers and move instructions to load extra operands [F. Sun et al, ICCAD’02] • With the extra move instructions, 36% speedup drop on average is observed in our previous study [Cong et al, FPGA’05] • mov(c); • t1 = extop1(a, b, c); • mov(d); • mov(e); • t2 = extop2(b, c, d, e); • t3 = t1 + t2; t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; • t1 = extop1(a, b, c); • t2 = extop2(b, c, d, e); • t3 = t1 + t2; extop2 extop1 *: 2 clock cycles +: 1 clock cycle Speedup: 1.8 Speedup: 1.125 UCLA VLSICAD LAB

Existing Architecture Solutions • Multiport Register File • Low utilization when executing basic instructions • Extra address encoding space in the instruction word • Area and power grows cubically [S. Rixner et al, HPCA’00 ] • Register File Replication • Complete or partial register file copy [Chimaera: S. Hauk et al, TVLSI’04 ] • Power inefficient • Predetermined one-to-one correspondence • Limited compiler optimization UCLA VLSICAD LAB

Previous Approach – Shadow Registers (1) • Core registers are augmented by an extra set of shadow registers [Cong et al, 2005] • Conditionally written • Read only by the custom logic UCLA VLSICAD LAB

Previous Approach – Shadow Registers (2) • Controlling three shadow registers • Two bits are required to be added or encoded in the instruction format • Advantage • Provides opportunities for compiler optimization • How to effectively bind the shadow register to maximize the performance gain? • Limitation • log2K+1 bits are required for K shadow registers • Only allows a small number of shadow registers UCLA VLSICAD LAB

Proposed Approach: Hash-Mapped Shadow Registers Scheme • Shadow registers with single control bit • Control bit = 1 means copy the data, 0 means skip • Hashing unit determines the mapping between core registers and shadow registers • Namely, the execution result to register R[i] in the core register file will be conditionally copied to register SR[j] in the shadow register set where j = hash(i). UCLA VLSICAD LAB

Shadow Registers: Single Control Bit vs. Multiple Control Bits • Hash-mapped shadow registers • Advantages: Only one additional control bit is needed • Much easier to be encoded in the 32-bit instruction format; • More shadow registers are allowed • Control bit count is always 1, independent of the number of actual shadow registers • Hashing unit is configurable • Hashing scheme retargetable to different applications • Limitation: Less flexibility • Each core register can be only mapped to one shadow register • Less room for compiler optimizations UCLA VLSICAD LAB

ASIP Compilation Flow with Shadow Register Binding Arch C code constraint SUIF / CDFG generator 1. Pattern generation 2. Pattern selection CDFG Pattern library 3. Application mapping & Code replacement Optimized code Backend compilation4. Shadow register binding & hash function generation Implementation UCLA VLSICAD LAB

An Example Control Data Flow Graph • Each node represents an instruction • Each edge represents a data transfer, which is associated with a live interval • In CDFG, a live interval [s, t] is from the time a data transfer is initiated through the time it is terminated • One variable might corresponds to multiple live intervals variable lifetime Live intervals 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 2 r1 l1 l2 e2 3 e3 4 e4 5 6 UCLA VLSICAD LAB

Shadow Register Binding  Motivation • 2-read-port register file • 3-input extended instruction • Without shadow register 4 additional moves • Binding for one shadow register • Assume: r1 and r3 are hash-mapped to the same shadow register • It is not necessary to keep a variable in the shadow register for its entire lifetime 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 l1 2 e2 r1 3 e3 l4 4 e4 5 r3 Binding 1: either r1 or r3 in shadow register saves 2 moves 6 Binding 2: l1 and l4 in shadow register saves 3 moves UCLA VLSICAD LAB

Binding for One Shadow Register  Problem Formulation • Binding problem for one shadow register with predetermined hash function • Problem formulation: • Given: • (i) A shadow register sr • (ii) A hash function h • (iii) An interval set S in which each interval will be hash-mapped to sr • Goal: • Select a subset of non-overlapping live intervals in S and bind them to sr so that the maximum number of move operations can be saved UCLA VLSICAD LAB

Shadow Register Binding  Algorithm • Weighted interval graph G(V’, E’) • Create a vertex v for each live interval [s, t] • Weight on each vertex represents # saves if the interval is bound to the shadow register • Create an edge e(v, v’) ifft < s’ where v = [s, t] and v’ = [s’, t’] • Theorem: • Binding problem is equivalent to find a maximum weighted chain in the compatibility graph • Can be optimally solved in time O(|V’|2) • Extension to K shadow registers • Each live interval can only be mapped to one shadow registers • The algorithm can be extended to handle K shadow-register by independently solving a series of one-shadow-register binding problem UCLA VLSICAD LAB

Hash Function Generation  Motivation • Hash function also affects the performance speedup • 2-read-port register file • 3-input extended instruction • No shadow registers • Four additional moves • Two shadow registers available • If r1 and r3 are hash-mapped to the same shadow register • Three moves can be saved • If r1 and r3 are hash-mapped to different shadow registers • All four moves can be saved 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 2 e2 3 e3 4 e4 5 6 UCLA VLSICAD LAB

Hash Function Generation  Problem Formulation • Hash function generation • Problem formulation • Given: • (i) A set of core registers R = {r1, … rN} • (ii) A set of shadow registers SR = {sr1, … srK} • Goal: • Find a many to one function h: RSR so that the maximum number of move operations can be saved using h as the hash function UCLA VLSICAD LAB

Hash Function Generation  Algorithm • Hash function generation problem is equivalent to a multi-way set partitioning problem • A two-step approach is used to solve the problem • Reorder the core register indices to obtain a linear permutation • One simple heuristic: use a mod function to derive the permutation • If N=6 and K=2, sequence r1, r2, r3, r4, r5, r6 r1, r3, r5, r2, r4, r6 • Given the permutation, solve a one dimensional K-way partitioning problem • Adopt the algorithm in [Alpert, DAC’94] • Optimally solvable by dynamic programming r1 r2 r3 r4 r5 r6 sr2 sr1 UCLA VLSICAD LAB

Simulation-Based Performance Evaluation Flow • We adopt a SimpleScalar-based simulation flow to estimate the performance speedup • Difficult to make any architectural and compiler extensions on commercial processors Arch Binary code constraint CDFG extractor 1. Pattern generation 2. Pattern selection CDFG Pattern library 3. Application mapping & Code replacement Optimized code Backend compilation4. Shadow register binding & hash function generation SimpleScalar UCLA VLSICAD LAB Est. Performance

Experimental Setting • Simplescalar v3.0 • Benchmarks: Mediabench and Mibench • Use entire programs instead of small pieces of code for instruction set generation and simulation • Machine Configuration • Single issue in-order processor • DL1: 8KB, 4-way, 1 cycle • IL1: 8KB, direct mapped, 1 cycle • Unified L2: 256KB, 4-way, 8 cycles • Functional units: 2 IntALU, 1 IntMult, 1 FPALU, 1 FPMult • Reconfigurable units: use critical path latencies of the collapsed instructions UCLA VLSICAD LAB

Speedup under Different Shadow Register Architectures (1) • Under 3-input constraint Over 90% of the performance gap can be closed with 5 hash-mapped shadow registers UCLA VLSICAD LAB

Speedup under Different Shadow Register Architectures (2) • Under 4-input constraint Over 95% of the performance gap can be closed with 8 hash-mapped shadow registers UCLA VLSICAD LAB

Speedup Comparison: Shadow registers vs. Register Replication (1) • Under 3-input constraint With the same number of registers, shadow register architecture consistently outperforms partial register replication UCLA VLSICAD LAB

Speedup Comparison: Shadow registers vs. Register Replication (2) • Under 4-input constraint With the same number of registers, shadow register architecture consistently outperforms partial register replication UCLA VLSICAD LAB

Conclusions • A novel low-cost hash-mapped shadow register architecture is proposed • Solve a global shadow register binding and hash function generation problem • Experiments show encouraging speedup UCLA VLSICAD LAB

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors

Presentation Transcript

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Embedded Processors

Vulnerabilities in Embedded Harvard Architecture Processors

Simulator Generation Method of Configurable Processors for MPSoC

Macro instruction synthesis for embedded processors

Distributed Data Management Architecture for Embedded Computing

A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy

Power and Frequency Analysis for Data and Control Independence in Embedded Processors

Compiler Issues for Embedded Processors

Design Support for Embedded Processors and Applications

Scalable Vector Processors for Embedded Systems

A Highly Configurable Cache Architecture for Embedded Systems

Targeting Dynamic Compilation for Embedded Systems

European Network of Excellence on High Performance and Embedded Architecture and Compilation

Embedded Configurable Operating System

Storage Allocation for Embedded Processors

Processors for Embedded Systems

Architecture for Building Self-Configurable Systems

Design Support for Embedded Processors and Applications

Processors for Embedded Systems

Compilation Techniques for Multimedia Processors