Custom Code Generation for Soft Processors
Custom Code Generation for Soft Processors. Martin Labrecque Peter Yiannacouras Gregory Steffan. ECE Dept. University of Toronto. Presented at RAAW 2006, Orlando, FL. Soft Processor: Processor in FPGA. FPGA. Processor. Compelling solution: software programmable
Custom Code Generation for Soft Processors
E N D
Presentation Transcript
Custom Code Generationfor Soft Processors Martin Labrecque Peter Yiannacouras Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL
Soft Processor: Processor in FPGA FPGA Processor • Compelling solution: software programmable • Soft processors are end-user customizable • Different application realm than hard ASIC processors • Can add more features: trade area for performance • Well known approach: add custom instructions (ex. A*B+C) Techniques orthogonal to custom instructions Programmable Logic
Application-Specific Code Generation Use default gcc, ISA? Interested in app-specific optimizations Application Compiler Processor Customized for: • Area • Power • Wallclock time • Freq. requirements
SPREE System(Soft Processor Rapid Exploration Environment) ISA Datapath • Verify ISA against datapath • Datapath Instantiation • Control Generation • Multi-cycle/variable-cycle FUs • Multiplexer select signals • Interlocking • Branch handling SPREE RTL • Output: Synthesizable Verilog [CASES 05, FPGA 06] • Input: Processor description • Made of hand-coded components Processor Description • SPREE System
Back-End Infrastructure RTL 20Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Stratix 1S40C5 Cycle Count 2. Area 3. Clock Frequency 4. Power We can measure area/performance/energy accurately Modelsim RTL Simulator Quartus II 5.0 CAD Software
Area efficiency #Million Instr. x Frequency # Cycles x Area • A combined metric: MIPS #Million Instr. 1000 LEs WallclockTime x Area • 4 criteria trading-off (power not included) • Want app-specific ( average) improvement
Representative Processors <900 LEs, <70 MHz >1500 LEs, >100 MHz F: Fetch D: Decode R: Register EX: Execute M: Memory WB: Writeback Serial F/D/R/EX/WB Pipe3 F/D R/EX/M WB Pipe5 F D R/EX1 EX2/M WB EX1 WB2 F D R EX2/M EX3/WB1 Pipe7
SPREE vs Nios II Serial faster Pipe7 Pipe5 Pipe3 smaller
Code Generation Options Studied( Outline ) Low-level hardware-software tradeoffs Reducing hardware shift support Removing hazard detection logic Impact of unique ISA features Removing delay slots Hi/Lo registers vs 3-operand multiplies Using unaligned memory load and stores Application-specific register management Operand scheduling and forwarding lines Limiting the use of architected registers Combining these into app-specific optimizations
Reducing Hardware Shift Support Best performance per area: Using hard multiplier for shifting Multiplications and shifts: both in software? Software shifting using additions & subtractions Impact of removing the dedicated LUT-based shifter? Costs ~250LEs, 30% of smallest soft processor Can we have partial hardware support for shifting?
343 LEs 48 LEs 2 fixed-amount shifters is cheap! Area for Various Shift Strategies (Pipe3)
Dynamic Instructions Containing Shifts less than 2% of shift amounts are variable some benchmarks have very few shifts Percentage
How to get rid of the shifter Software-only shifts require an order of magnitude more cycles to compute Measure the cost in cycles for each shift operation Replace shifts by hard shifts and/or software shifts: Srl 8 Srl 8 Srl 4 Srl 4 Srl 4 Srl 4 Srl 3 Srl 3 Srl 3 Srl 3 Srl 3 Shift_left(1) ... Srl 16 or or or Evaluate cost in cycles for all combinations of shifters available
Impact of up to 2 Fixed-Amount Shifters (pipe3) Can improve area efficiency by up to 65% Beneficial for certain applications only Area efficiency (MIPS/1000LE)
PC hazard avoided Instr. in delay slot Branch/ Jump F/D R/EX/M WB Time Time Removing Delay Slots load hazard avoided • Default MIPS has branch and load delay slots • Under what conditions are they worth it? • Load delay slots need no additional hardware support • Because of hazard detection in the processor • Branch delay slots require hardware support • We only have predict-not-taken so far • Are working on better branch prediction Instr. in delay slot Load F/D R/EX/M WB
Removing Load Delay Slots (serial) 3% better performance for Serial, 2% for Pipe3 Normalized Wall-Clock Time
Removing Branch Delay Slots pipe3: 7% performance hit pipe7 improvements: 13% freq, 8% performance
3-Operand Multiplies vs Hi/Lo Registers Default MIPS has Hi/Lo registers Motivated by multi-cycle multiplies Hold multiplication results (Hi and Lo each 32 bits) Two special instructions to access Hi/Lo Which to choose? • 3-operand multiplies (NIOS2 and Microblaze) • Two instructions compute high and low parts • Result is stored in register file Hi/Lo Register file Multiplier MUX
Impact of 3-Operand Multiplies 8% slower clock Saves area, reduces frequency, increases power Normalized Value
Impact of 3-Operand Multiplies Only pipe3 benefits from cycle savings
Forwarding Lines and Code Generation Necessary to forward both operands (A and B)? Simultaneous dependences Non commutative operations r3 = r1 + r2 r4 = r3 + r3 r3 = r1 + r2 r4 = r5 - r3 r3 = r1 + r2 r4 = r3 – r5 • Compiler can reorder commutative operands of instrs • Can compiler compensate when only one forwarding line? • Save ~30 LEs for fwding line and incur more stall cycles? Added 1-2% cycle improvement with 1 fwding line 3-4% short of 2 fwding lines’ performance for 30% of apps, 1 fwding line more area efficient
Soft Processor Customization Techniques • Best overall (general purpose) processor • Best per application (application-tuned) • Reduce processor by reducing the ISA (Subset) SPREE automatically removes • Unused connections • Unused components • Unused parts of the ISA • Apply optimization techniques (Opt)
Average Combined Improvements (pipe3) Subsetting & Opts +25% 36% Opts +12% Efficiency (MIPS/1000LEs) App-Specific: +11% Opt: 2 fixed shifts, no dly slots, 3-op mult, op sched overall 36% improvement in efficiency! Subsetting +8%
Summary Software-only and custom shifters Load delay slots Branch delay slots 3-operand multiply Operand scheduling to save a forwarding line App-specific Useless with hazard detection Useful with poor branch prediction Processor-specific App-specific 12% area efficiency over app-specific processor 17% area eff. over subsetted app-specific proc. without adding complexity! Conclusion
Future Research Integrating branch prediction in SPREE Research on memory hierarchy Automatic selection of app-specific SP features
Architectural Parameters Used in SPREE Multiplication Support Hardware FU or software routine Shifter implementation Flipflops, multiplier, or LUTs Pipelining Depth (2-7 stages) Forwarding lines We focus on core microarchitecture (for now)
No specific evaluation of studied features in SP Related work • Custard [Dimond, Mencer, Luk] • Customizable forwarding lines • Optional delay slots • NIOS II [Altera] • 3-operand multiply • No delay slots • Microblaze [Xilinx] • 3-operand multiply • Branches with and without delay slots
Removing some/all hazard detection logic • Can the compiler compensate with scheduling? • E.g., worst case use no-ops to ensure correctness • Challenge: variable, multi-cycle instructions • What is the cost/benefit of doing so? F/D R/EX/M WB Pipe3 Potential hazard F/D stall R/EX/M WB Time
Up to 10% area and 6% frequency gains Impact of Removing Hazard Detection Logic