Customizing Microblaze Processors for Increased Performance

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx Collaborators: David Sheldon (4th yr UCR PhD student), Roman Lysecky (PhD UCR 2005, now Asst. Prof. at U. Arizona), Rakesh Kumar (PhD UCSD 2006, now Asst. Prof. at UIUC), Dean Tullsen (Prof. at UCSD)

Outline • Two UCR ICCAD’06 papers • Microblaze customization • Microblaze conjoining (and customization) • Current work targetting Microblaze users • “Design of Experiments” paradigm • System-level synthesis for multi-core systems • Related FPGA work • "Warp processing" • Standard binaries for FPGAs Frank Vahid, UC Riverside

App1 App2 Microblaze Customization (ICCAD paper #1) • FPGAs an increasingly popular software platform • FPGA soft core processor • Microprocessor synthesized onto FPGA fabric • Soft core customization • Cores come with configurable parameters • Xilinx Microblaze comes with several instantiatable units: multiplier, barrel shifter, divider, FPU, or cache • Customization: Tuning soft core parameters to a specific application Micro-processor Mul BS Micro-processor Mul Micro-processor Mul Div Div I$ FPU FPU I$ Frank Vahid, UC Riverside

Instantiable Unit Speedups • Instantiating units can yield significant speedups • “base” – Microblaze without any optional units instantiated Frank Vahid, UC Riverside

2x performance tradeoff 4.5x size tradeoff Customization Tradeoffs Data for aifir EEMBC benchmark on Xilinx Microblaze synthesized to Virtex device Frank Vahid, UC Riverside

Key Problem Related to Core Customization • Problem: Synthesis of one soft core configuration, and execution/simulation on that configuration, requires about 15 minutes • Thus, for reasonable customization tool runtimes, can only synthesize 5-10 configurations in search of best one Frank Vahid, UC Riverside

Two Solution Approaches • Traditional CAD approach • Pre-characterize using synthesis and execution/simulation, create abstract problem model, solve using CAD exploration algorithms • Used 0-1 knapsack formulation • Synthesis-in-the-loop approach • Run synthesis and execute/simulate application while exploring • More accurate start 5-10 executions Synthesis and execution/simulation Pre-characterize Model Typically some form of graph Explore finish 5-10 interations start Synthesis and execution/simulation Explore finish Frank Vahid, UC Riverside

1000*Speedup/Size 0.35 Divider Barrel shifter 0.3 0.25 0.2 0.15 Barrel Shifter 0.1 Multiplier 0.05 0 -0.05 Multiplier Barrel Floating Divider MCH Shifter Point Unit Cache Multiplier Cache FPU Divider Cache FPU Fixed impact-ordered tree Synthesis-in-the-Loop Approach • View solution space as tree, each level a decision for a unit, order levels by unit speedup/size for application • 11 synthesis runs make take a few hours • To reduce, can consider using pre-determined order • Determined by soft core vendor based on averages over many benchmarks base No Yes base base+div Application-specific impact-ordered tree Frank Vahid, UC Riverside

Size (Equiv LUTs) Speedup/Size Speedup 6000 9 1.8 8 1.6 5000 7 1.4 4000 6 1.2 5 1 3000 4 0.8 2000 3 0.6 2 0.4 1000 1 0.2 0 0 0 Multiplier Barrel FPU Divider Cache Multiplier Barrel FPU Divider Cache Multiplier Barrel FPU Divider Cache Shifter Shifter Shifter Synthesis-in-the-Loop Approach • Data for fixed impact-ordered tree for 11 EEMBC benchmarks Frank Vahid, UC Riverside

No size constraint, Virtex II Size constraint = 80% of full MB size, Virtex II No size constraint, Spartan2 device 300 250 800 800 200 Tool Runtime (m) 600 600 150 Tool Run Time (m) 100 Tool Run Time (m) 400 400 50 200 200 0 1 1.2 1.4 1.6 0 Size constraint = 80% of application-specific optimal MB configuration (guaranteed to “hurt”), Virtex II 0 Speedup 1 1.5 2 2.5 1 1.5 2 2.5 Speedup Speedup 800 Fixed order Impact- ordered Tree 600 Application-Specific Impact-ordered Tree 400 Tool Run Time (m) Random Impact- 200 ordered Tree Exhaustive 0 1 1.5 2 2.5 Knapsack Speedup Customization Results • Fixed tree approach generally best • App-spec tree better for certain apps, but 2x runtime • ICCAD'06 David Sheldon et al Frank Vahid, UC Riverside Results are averages for 11 EEMBC benchmarks

Processor 1 Multiplier Conjoined Processor 2 Conjoined Processors (ICCAD paper #2) • Conjoined processors • Two processors sharing a hardware unit to save size (Kumar/Jouppi/Tullsen ISCA 2004) • Showed little performance overhead for desktop processors • Only research customer is Intel; for soft core processors, research customers are every soft core user • How much size savings and performance overhead for conjoined Microblazes? Processor 1 Multiplier Processor 2 Multiplier Frank Vahid, UC Riverside

Conjoined Processors – Size Savings Frank Vahid, UC Riverside

Cycle Latency Configuration 2 Barrel Shifter 34 Divider 3 Multiplier 6 30 FPU Add, Sub, Mul Div brev bitmnp Conjoined Processors – Performance Overhead • We created a trace simulator • Reads two instruction traces output by MB simulator • Adds 1-cycle delay for every access to a conjoined unit (pessimistic assumption about contention detection scheme) • Looks for simultaneous access of shared unit, stalls one MB entirely until unit becomes available Frank Vahid, UC Riverside

Conjoined 4.5 4 Unconjoined 3.5 3 2.5 Speedup 2 1.5 1 0.5 0 brev,(brev) (brev),brev (brev),canrdr brev,(canrdr) (brev),bitmnp brev,(bitmnp) (canrdr),canrdr canrdr,(canrdr) (bitmnp),canrdr bitmnp,(canrdr) (bitmnp),bitmnp bitmnp,(bitmnp) Conjoined Processors – Performance Overhead • Data shown for benchmarks that benefit (>1.3x speedup) from barrel shifter • Performance overheads are small Frank Vahid, UC Riverside

knapsack knapsack 12000 exhaustive w/ conj. exhaustive w/ conj. 8 exhaustive w/o conj. exhaustive w/o conj. 10000 7 6 8000 5 4 Speedup 6000 Size (equiv. LUTs) 3 2 4000 1 2000 0 0 AVERAGE canrdr, canrdr tblook, canrdr tblook, tblook tblook, bitmnp bitmnp, bitmnp AVERAGE BaseFP01, canrdr BaseFP01, bitmnp tblook, tblook bitmnp, bitmnp tblook, bitmnp tblook, canrdr canrdr, canrdr BaseFP01, bitmnp BaseFP01, BaseFP01 BaseFP01, canrdr BaseFP01, BaseFP01 Customization Considering Conjoinment • Developed 0-1 knapsack approach • “Disjunctively-Constrained Knapsack Solution” to accomodate conjoinment ICCAD'06 David Sheldon et al Only 8 pairings shown due to space limits Note: To avoid exaggerating the benefits of conjoinment, data only considers benchmark pairs that significantly use a shared unit Frank Vahid, UC Riverside

Outline • Two UCR ICCAD’06 papers • Microblaze customization • Microblaze conjoining (and customization) • Current work targetting Microblaze users • “Design of Experiments” paradigm • System-level synthesis for multi-core systems • Related FPGA work • "Warp processing" • Standard binaries for FPGAs Frank Vahid, UC Riverside

Ongoing Work – Design of Experiments Paradigm • "Design of Experiments" • Well-established discipline (>80 yrs) for tuning parameters • For factories, crops, management, etc. • Want to set parameter values for best output • But each experiment costly, so can't try all combinations • Clear mapping of soft core customization to DOE problem • Given parameters and # of possible experiments • Generates which experiments to run (parameter values) • Analyzes resulting data • Sound mathematical foundations • Present focus of David Sheldon (4th yr Ph.D.) Frank Vahid, UC Riverside

Cycles Y1 12696265 Factor A B C D E F G H I J 3544216 Row # BS FPU MUL DIV MSR COMP ICACHE_type ICACHE_size DCACHE_type DCACHE_size 8262214 1 0 0 0 0 0 0 0 0 0 0 3808647 2 0 0 0 0 0 1 1 1 1 1 2860019 3 0 0 1 1 1 0 0 0 1 1 10818171 4 0 1 0 1 1 0 1 1 0 0 2644509 5 0 1 1 0 1 1 0 1 0 1 8046171 6 0 1 1 1 0 1 1 0 1 0 3601399 7 1 0 1 1 0 0 1 1 0 1 3308450 8 1 0 1 0 1 1 1 0 0 0 9208946 9 1 0 0 1 1 1 0 1 1 0 7276392 10 1 1 1 0 0 0 0 1 1 0 11 1 1 0 1 0 1 0 0 0 1 12 1 1 0 0 1 0 1 0 1 1 Ongoing Work – Design of Experiments Paradigm • Suppose time for 12 experiments • DOE tool generates which 12 experiments to run • User fills in results column Frank Vahid, UC Riverside

Ongoing Work – Design of Experiments Paradigm • DOE tool analyzes results • Finds most important factors for given application Frank Vahid, UC Riverside

Ongoing Work – Design of Experiments Paradigm • Results for a different application Frank Vahid, UC Riverside

Ongoing Work – Design of Experiments Paradigm • Interactions among parameters also automatically determined Frank Vahid, UC Riverside

Microblaze Mul PicoBlaze Mul Div Microblaze I$ FPU Ongoing work – System synthesis • Given N applications • Create customized soft core for each app • Criteria: Meet size constraint, minimize total applications' runtime • Other criteria possible (e.g., meet runtime constraint, minimize size) • Present focus of Ryan Mannion, 3rd yr Ph.D. App1 App2 AppN Frank Vahid, UC Riverside

Ongoing work – System synthesis Graduate Student: Ryan Mannion, 3rd yr Ph.D. • Presently use Integer Linear Program • Solutions for large set of Xilinx devices generated in seconds Frank Vahid, UC Riverside

Outline • Two UCR ICCAD’06 papers • Microblaze customization • Microblaze conjoining (and customization) • Current work targetting Microblaze users • “Design of Experiments” paradigm System-level synthesis for multi-core systems • Related FPGA work • Warp processing • Standard binaries for FPGAs Frank Vahid, UC Riverside

Binary-Level Synthesis • Binary-level FPGA compiler developed 2002-2006 (Greg Stitt, Ph.D. UCR 2007) C++ Java asmM obj Source-level FPGA compiler provides a limited solution Assembler Compiler Compiler FPGA Binary Linker Microproc. Binary Binary-level FPGA compiler provides a more general solution, at the expense of lost high-level information Binary-level FPGA compiler FPGA Binary Microproc. Binary Frank Vahid, UC Riverside

Binary Synthesis Competitive with Source Level • Aggressive decompilation recovers most high-level constructs needed for good synthesis – Makes binary-level synthesis competitive with source level Freescale H264 decoder example, from ISSS/CODES 2005 Frank Vahid, UC Riverside

Microproc. Binary Downloader Chip or board Microproc. Binary Microprocessor FPGA Binary FPGA On-chip Binary-level FPGA Compiler Binary Synthesis Enables Dynamic Hardware/Software Partitioning • Called “Warp Processing” (Vahid/Stitt/Lysecky 2003-2007) • Direct collaborators: • Intel, IBM, and Freescale Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Idea 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Idea 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Frank Vahid, UC Riverside

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04, DATE'05, FCCM'05, ICCAD'05, ISSS/CODES'05, TECS'06, U.S. Patent Pending Frank Vahid, UC Riverside

Average kernel speedup of 41, vs. 21 for Virtex-E WCLA simplicity results in faster HW circuits Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only) SW Only Execution Frank Vahid, UC Riverside

Average speedup of 7.4 Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels) • Energy reduction of 38% - 94% Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis SW Only Execution Frank Vahid, UC Riverside

Warp ProcessorsSpeedups Compared with Digital Signal Processor Frank Vahid, UC Riverside

Warp ProcessorsSpeedups for Multi-Threaded Application Benchmarks Compelling computing advantage of FPGAs: Parallellism from bit level up to processor level, and everywhere in between Frank Vahid, UC Riverside

SW Standard Binary Binary Binary Architectures Standard binaries Translator Applications Tools Proc. FPGA FPGA Ubiquity via Obscurity • Warp processing hides FPGA from languages and tools • ANY microprocessor platform extendible with FPGA • Maintains "ecosystem": application, tool, and architecture developers • New platforms with FPGAs appearing Standard Compiler Profiling New processor platforms with FPGA evolving Frank Vahid, UC Riverside

Standard Binary Standard FPGA binary? SystemC? SW Binary Binary Binary Binary Standard FPGA Compiler Ecosystem for FPGAs presently sorely missing Translator Proc. FPGA FPGA Standard Binaries? • Microprocessor binary represents one form of a "standard binary for FPGAs" • Missing is explicit concurrency • Parallelism, pipelining, queues, etc. • As FPGAs appear in more platforms, might a more general FPGA binary evolve? Standard Compiler Profiling Architectures Standard FPGA binaries Standard binaries Applications Tools Frank Vahid, UC Riverside

FPGA Binary Binary Binary Binary Translator Proc. FPGA * * * * * * * * * * * * + + + + + + Low-end PDA FPGA + + + Translator + + 100 sec FPGA + High-end PDA Translator FPGA 1 sec FPGA Standard Binaries? • Translator would make best use of existing FPGA resources • Could even add FPGA, like adding memory, to improve performance • Add more FPGA to your PDA to implement compute-intensive application? Frank Vahid, UC Riverside

FPGA Standard Binaries • NSF funding received for 2006-2009 • Xilinx letter of support was helpful Graduate Student: Scott Sirowy, 2nd year Ph.D. Frank Vahid, UC Riverside

Conclusions • Soft core customization increasingly important to make best use of limited FPGA resources • Good initial automatic customization results • “Design of Experiments” paradigm looks promising • System-level synthesis may yield very useful MB user tool, perhaps web based • Warp processing and standard FPGA binary work can help make FPGAs ubiquitous • Accomplishments made possible by Xilinx donations and interactions • Continued and close collaboration sought Frank Vahid, UC Riverside

Customizing Microblaze Processors for Increased Performance

Customizing Microblaze Processors for Increased Performance

Presentation Transcript

SYNTHESIS OF APPLICATION SPECIFIC VLIW PROCESSORS

UCR

Design Automation of Co-Processors for Application Specific Instruction Set Processors

Conjoining Soft-Core FPGA Processors

Application-Specific Signatures for Transactional Memory in Soft Processors

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

Application-Specific Customization and Scalability of Soft Multiprocessors

Architecture and Design Automation for Application-Specific Processors

other specific properties.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

The Microarchitecture of FPGA-Based Soft Processors

Systematic Register Bypass Customization for Application-Specific Processors

Automatic Application-Specific Customization of Soft Processor Microarchitecture

The Microarchitecture of FPGA-Based Soft Processors

Application-Specific Customization of FPGA Soft-core Processors

Application-Specific Customization of Soft Processor Microarchitecture

Other Processors

Re-Configurable Application Specific Computing (RASC/FPGA)

Application of XSBase270 and FPGA

Re-Configurable Application Specific Computing (RASC/FPGA)

Conjoining Soft-Core FPGA Processors