Custom Code Generation for Soft Processors

Custom Code Generationfor Soft Processors Martin Labrecque Peter Yiannacouras Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL

Soft Processor: Processor in FPGA FPGA Processor • Compelling solution: software programmable • Soft processors are end-user customizable • Different application realm than hard ASIC processors • Can add more features: trade area for performance • Well known approach: add custom instructions (ex. A*B+C) Techniques orthogonal to custom instructions Programmable Logic

Application-Specific Code Generation  Use default gcc, ISA? Interested in app-specific optimizations  Application Compiler Processor Customized for: • Area • Power • Wallclock time • Freq. requirements

Infrastructure

SPREE System(Soft Processor Rapid Exploration Environment) ISA Datapath • Verify ISA against datapath • Datapath Instantiation • Control Generation • Multi-cycle/variable-cycle FUs • Multiplexer select signals • Interlocking • Branch handling SPREE RTL • Output: Synthesizable Verilog [CASES 05, FPGA 06] • Input: Processor description • Made of hand-coded components Processor Description • SPREE System

Back-End Infrastructure RTL 20Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Stratix 1S40C5 Cycle Count 2. Area 3. Clock Frequency 4. Power We can measure area/performance/energy accurately Modelsim RTL Simulator Quartus II 5.0 CAD Software

Area efficiency #Million Instr. x Frequency  # Cycles x Area • A combined metric: MIPS #Million Instr.  1000 LEs WallclockTime x Area • 4 criteria trading-off (power not included) • Want app-specific ( average) improvement

Representative Processors <900 LEs, <70 MHz >1500 LEs, >100 MHz F: Fetch D: Decode R: Register EX: Execute M: Memory WB: Writeback Serial F/D/R/EX/WB Pipe3 F/D R/EX/M WB Pipe5 F D R/EX1 EX2/M WB EX1 WB2 F D R EX2/M EX3/WB1 Pipe7

SPREE vs Nios II Serial faster Pipe7 Pipe5 Pipe3 smaller

Code Generation Options Studied( Outline ) Low-level hardware-software tradeoffs Reducing hardware shift support Removing hazard detection logic Impact of unique ISA features Removing delay slots Hi/Lo registers vs 3-operand multiplies Using unaligned memory load and stores Application-specific register management Operand scheduling and forwarding lines Limiting the use of architected registers Combining these into app-specific optimizations    

Reducing Hardware Shift Support Best performance per area: Using hard multiplier for shifting Multiplications and shifts: both in software? Software shifting using additions & subtractions Impact of removing the dedicated LUT-based shifter? Costs ~250LEs, 30% of smallest soft processor Can we have partial hardware support for shifting? 

343 LEs 48 LEs 2 fixed-amount shifters is cheap! Area for Various Shift Strategies (Pipe3)

Dynamic Instructions Containing Shifts less than 2% of shift amounts are variable some benchmarks have very few shifts Percentage

How to get rid of the shifter Software-only shifts require an order of magnitude more cycles to compute Measure the cost in cycles for each shift operation Replace shifts by hard shifts and/or software shifts: Srl 8 Srl 8 Srl 4 Srl 4 Srl 4 Srl 4 Srl 3 Srl 3 Srl 3 Srl 3 Srl 3 Shift_left(1) ... Srl 16 or or or Evaluate cost in cycles for all combinations of shifters available

Impact of up to 2 Fixed-Amount Shifters (pipe3) Can improve area efficiency by up to 65% Beneficial for certain applications only Area efficiency (MIPS/1000LE)

PC hazard avoided Instr. in delay slot Branch/ Jump F/D R/EX/M WB Time Time  Removing Delay Slots load hazard avoided • Default MIPS has branch and load delay slots • Under what conditions are they worth it? • Load delay slots need no additional hardware support • Because of hazard detection in the processor • Branch delay slots require hardware support • We only have predict-not-taken so far • Are working on better branch prediction Instr. in delay slot Load F/D R/EX/M WB

Removing Load Delay Slots (serial) 3% better performance for Serial, 2% for Pipe3 Normalized Wall-Clock Time

Removing Branch Delay Slots pipe3: 7% performance hit pipe7 improvements: 13% freq, 8% performance

 3-Operand Multiplies vs Hi/Lo Registers Default MIPS has Hi/Lo registers Motivated by multi-cycle multiplies Hold multiplication results (Hi and Lo each 32 bits) Two special instructions to access Hi/Lo Which to choose? • 3-operand multiplies (NIOS2 and Microblaze) • Two instructions compute high and low parts • Result is stored in register file Hi/Lo Register file Multiplier MUX

Impact of 3-Operand Multiplies 8% slower clock Saves area, reduces frequency, increases power Normalized Value

Impact of 3-Operand Multiplies Only pipe3 benefits from cycle savings

Forwarding Lines and Code Generation Necessary to forward both operands (A and B)? Simultaneous dependences Non commutative operations  r3 = r1 + r2 r4 = r3 + r3 r3 = r1 + r2 r4 = r5 - r3 r3 = r1 + r2 r4 = r3 – r5 • Compiler can reorder commutative operands of instrs • Can compiler compensate when only one forwarding line? • Save ~30 LEs for fwding line and incur more stall cycles? Added 1-2% cycle improvement with 1 fwding line 3-4% short of 2 fwding lines’ performance for 30% of apps, 1 fwding line more area efficient

Soft Processor Customization Techniques • Best overall (general purpose) processor • Best per application (application-tuned) • Reduce processor by reducing the ISA (Subset) SPREE automatically removes • Unused connections • Unused components • Unused parts of the ISA • Apply optimization techniques (Opt)

Average Combined Improvements (pipe3) Subsetting & Opts +25% 36% Opts +12% Efficiency (MIPS/1000LEs) App-Specific: +11% Opt: 2 fixed shifts, no dly slots, 3-op mult, op sched overall 36% improvement in efficiency! Subsetting +8%

Summary Software-only and custom shifters Load delay slots Branch delay slots 3-operand multiply Operand scheduling to save a forwarding line App-specific Useless with hazard detection Useful with poor branch prediction Processor-specific App-specific  12% area efficiency over app-specific processor 17% area eff. over subsetted app-specific proc.  without adding complexity! Conclusion

Future Research Integrating branch prediction in SPREE Research on memory hierarchy Automatic selection of app-specific SP features

Thank you

Architectural Parameters Used in SPREE Multiplication Support Hardware FU or software routine Shifter implementation Flipflops, multiplier, or LUTs Pipelining Depth (2-7 stages) Forwarding lines We focus on core microarchitecture (for now)

No specific evaluation of studied features in SP Related work • Custard [Dimond, Mencer, Luk] • Customizable forwarding lines • Optional delay slots • NIOS II [Altera] • 3-operand multiply • No delay slots • Microblaze [Xilinx] • 3-operand multiply • Branches with and without delay slots

Removing some/all hazard detection logic • Can the compiler compensate with scheduling? • E.g., worst case use no-ops to ensure correctness • Challenge: variable, multi-cycle instructions • What is the cost/benefit of doing so? F/D R/EX/M WB Pipe3 Potential hazard F/D stall R/EX/M WB Time

Up to 10% area and 6% frequency gains Impact of Removing Hazard Detection Logic

Custom Code Generation for Soft Processors

Custom Code Generation for Soft Processors

Presentation Transcript

9. Code Scheduling for ILP-Processors

Code generation

Conjoining Soft-Core FPGA Processors

Code Generation

Code Generation

Code Generation

XStream: Rapid Generation of Custom Processors for ASIC Designs

Code Generation

Code Generation

Code Generation

XStream: Rapid Generation of Custom Processors for ASIC Designs

Lutiac – Small Soft Processors for Small Programs

Code Generation

Code Generation

Code Generation for UML

Code Generation

Custom Soft

Code Generation

Code Generation

Code Generation

Code Generation

Conjoining Soft-Core FPGA Processors