1 / 25

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors. Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles. Supported by NSF, GSRC, Altera, Xilinx. Outline. Motivation

rchavarria
Télécharger la présentation

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors Jason Cong, Guoling Han,Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles Supported by NSF, GSRC, Altera, Xilinx.

  2. Outline • Motivation • Limited data bandwidth has become the performance bottleneck of instruction-set extendible processors • Architectural extension • Hash-mapped shadow registers • Associated compilation techniques • Shadow register binding and hash function generation • Experimental results • Conclusions UCLA VLSICAD LAB

  3. Target Reconfigurable Platform • General purpose processor core + programmable fabric • Loosely coupled as a coprocessor • Xilinx MicroBlaze, etc. • Tightly integrated as extra function units in application-specific instruction-set processors • GPP has the capability to extend basic instruction set • Programmable fabric implements the customized instructions • Examples: Altera Nios / Nios II, Tensilica Xtensa, etc. Custom instruction logic for Nios II [source: www.altera.com] UCLA VLSICAD LAB

  4. Target Core Processor Model • Core processor model • Classic single-issue pipelined RISC core (fetch / decode / execute / write-back) • The number of input and output operands an instruction is pre-determined • The custom instruction cannot execute until all the input operands are available • The custom instruction read the core register file during the execute stage, and commit the result during the write-back stage UCLA VLSICAD LAB

  5. a b d c e * * * + + + Data Bandwidth Problem • Fact: about 60% speedup comes from clusters with more than two inputs [P. Ienne et al] • Architecture problem: limited register file bandwidth (two read ports, one write port) • One solution: introducing state registers and move instructions to load extra operands [F. Sun et al, ICCAD’02] • With the extra move instructions, 36% speedup drop on average is observed in our previous study [Cong et al, FPGA’05] • mov(c); • t1 = extop1(a, b, c); • mov(d); • mov(e); • t2 = extop2(b, c, d, e); • t3 = t1 + t2; t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; • t1 = extop1(a, b, c); • t2 = extop2(b, c, d, e); • t3 = t1 + t2; extop2 extop1 *: 2 clock cycles +: 1 clock cycle Speedup: 1.8 Speedup: 1.125 UCLA VLSICAD LAB

  6. Existing Architecture Solutions • Multiport Register File • Low utilization when executing basic instructions • Extra address encoding space in the instruction word • Area and power grows cubically [S. Rixner et al, HPCA’00 ] • Register File Replication • Complete or partial register file copy [Chimaera: S. Hauk et al, TVLSI’04 ] • Power inefficient • Predetermined one-to-one correspondence • Limited compiler optimization UCLA VLSICAD LAB

  7. Previous Approach – Shadow Registers (1) • Core registers are augmented by an extra set of shadow registers [Cong et al, 2005] • Conditionally written • Read only by the custom logic UCLA VLSICAD LAB

  8. Previous Approach – Shadow Registers (2) • Controlling three shadow registers • Two bits are required to be added or encoded in the instruction format • Advantage • Provides opportunities for compiler optimization • How to effectively bind the shadow register to maximize the performance gain? • Limitation • log2K+1 bits are required for K shadow registers • Only allows a small number of shadow registers UCLA VLSICAD LAB

  9. Proposed Approach: Hash-Mapped Shadow Registers Scheme • Shadow registers with single control bit • Control bit = 1 means copy the data, 0 means skip • Hashing unit determines the mapping between core registers and shadow registers • Namely, the execution result to register R[i] in the core register file will be conditionally copied to register SR[j] in the shadow register set where j = hash(i). UCLA VLSICAD LAB

  10. Shadow Registers: Single Control Bit vs. Multiple Control Bits • Hash-mapped shadow registers • Advantages: Only one additional control bit is needed • Much easier to be encoded in the 32-bit instruction format; • More shadow registers are allowed • Control bit count is always 1, independent of the number of actual shadow registers • Hashing unit is configurable • Hashing scheme retargetable to different applications • Limitation: Less flexibility • Each core register can be only mapped to one shadow register • Less room for compiler optimizations UCLA VLSICAD LAB

  11. ASIP Compilation Flow with Shadow Register Binding Arch C code constraint SUIF / CDFG generator 1. Pattern generation 2. Pattern selection CDFG Pattern library 3. Application mapping & Code replacement Optimized code Backend compilation4. Shadow register binding & hash function generation Implementation UCLA VLSICAD LAB

  12. An Example Control Data Flow Graph • Each node represents an instruction • Each edge represents a data transfer, which is associated with a live interval • In CDFG, a live interval [s, t] is from the time a data transfer is initiated through the time it is terminated • One variable might corresponds to multiple live intervals variable lifetime Live intervals 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 2 r1 l1 l2 e2 3 e3 4 e4 5 6 UCLA VLSICAD LAB

  13. Shadow Register Binding  Motivation • 2-read-port register file • 3-input extended instruction • Without shadow register 4 additional moves • Binding for one shadow register • Assume: r1 and r3 are hash-mapped to the same shadow register • It is not necessary to keep a variable in the shadow register for its entire lifetime 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 l1 2 e2 r1 3 e3 l4 4 e4 5 r3 Binding 1: either r1 or r3 in shadow register saves 2 moves 6 Binding 2: l1 and l4 in shadow register saves 3 moves UCLA VLSICAD LAB

  14. Binding for One Shadow Register  Problem Formulation • Binding problem for one shadow register with predetermined hash function • Problem formulation: • Given: • (i) A shadow register sr • (ii) A hash function h • (iii) An interval set S in which each interval will be hash-mapped to sr • Goal: • Select a subset of non-overlapping live intervals in S and bind them to sr so that the maximum number of move operations can be saved UCLA VLSICAD LAB

  15. Shadow Register Binding  Algorithm • Weighted interval graph G(V’, E’) • Create a vertex v for each live interval [s, t] • Weight on each vertex represents # saves if the interval is bound to the shadow register • Create an edge e(v, v’) ifft < s’ where v = [s, t] and v’ = [s’, t’] • Theorem: • Binding problem is equivalent to find a maximum weighted chain in the compatibility graph • Can be optimally solved in time O(|V’|2) • Extension to K shadow registers • Each live interval can only be mapped to one shadow registers • The algorithm can be extended to handle K shadow-register by independently solving a series of one-shadow-register binding problem UCLA VLSICAD LAB

  16. Hash Function Generation  Motivation • Hash function also affects the performance speedup • 2-read-port register file • 3-input extended instruction • No shadow registers • Four additional moves • Two shadow registers available • If r1 and r3 are hash-mapped to the same shadow register • Three moves can be saved • If r1 and r3 are hash-mapped to different shadow registers • All four moves can be saved 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 2 e2 3 e3 4 e4 5 6 UCLA VLSICAD LAB

  17. Hash Function Generation  Problem Formulation • Hash function generation • Problem formulation • Given: • (i) A set of core registers R = {r1, … rN} • (ii) A set of shadow registers SR = {sr1, … srK} • Goal: • Find a many to one function h: RSR so that the maximum number of move operations can be saved using h as the hash function UCLA VLSICAD LAB

  18. Hash Function Generation  Algorithm • Hash function generation problem is equivalent to a multi-way set partitioning problem • A two-step approach is used to solve the problem • Reorder the core register indices to obtain a linear permutation • One simple heuristic: use a mod function to derive the permutation • If N=6 and K=2, sequence r1, r2, r3, r4, r5, r6 r1, r3, r5, r2, r4, r6 • Given the permutation, solve a one dimensional K-way partitioning problem • Adopt the algorithm in [Alpert, DAC’94] • Optimally solvable by dynamic programming r1 r2 r3 r4 r5 r6 sr2 sr1 UCLA VLSICAD LAB

  19. Simulation-Based Performance Evaluation Flow • We adopt a SimpleScalar-based simulation flow to estimate the performance speedup • Difficult to make any architectural and compiler extensions on commercial processors Arch Binary code constraint CDFG extractor 1. Pattern generation 2. Pattern selection CDFG Pattern library 3. Application mapping & Code replacement Optimized code Backend compilation4. Shadow register binding & hash function generation SimpleScalar UCLA VLSICAD LAB Est. Performance

  20. Experimental Setting • Simplescalar v3.0 • Benchmarks: Mediabench and Mibench • Use entire programs instead of small pieces of code for instruction set generation and simulation • Machine Configuration • Single issue in-order processor • DL1: 8KB, 4-way, 1 cycle • IL1: 8KB, direct mapped, 1 cycle • Unified L2: 256KB, 4-way, 8 cycles • Functional units: 2 IntALU, 1 IntMult, 1 FPALU, 1 FPMult • Reconfigurable units: use critical path latencies of the collapsed instructions UCLA VLSICAD LAB

  21. Speedup under Different Shadow Register Architectures (1) • Under 3-input constraint Over 90% of the performance gap can be closed with 5 hash-mapped shadow registers UCLA VLSICAD LAB

  22. Speedup under Different Shadow Register Architectures (2) • Under 4-input constraint Over 95% of the performance gap can be closed with 8 hash-mapped shadow registers UCLA VLSICAD LAB

  23. Speedup Comparison: Shadow registers vs. Register Replication (1) • Under 3-input constraint With the same number of registers, shadow register architecture consistently outperforms partial register replication UCLA VLSICAD LAB

  24. Speedup Comparison: Shadow registers vs. Register Replication (2) • Under 4-input constraint With the same number of registers, shadow register architecture consistently outperforms partial register replication UCLA VLSICAD LAB

  25. Conclusions • A novel low-cost hash-mapped shadow register architecture is proposed • Solve a global shadow register binding and hash function generation problem • Experiments show encouraging speedup UCLA VLSICAD LAB

More Related