Automatic Application-Specific Customization of Soft Processor Microarchitecture

Automatic Application-Specific Customization of Soft Processor Microarchitecture Shobana Padmanabhan Roger D. Chamberlain, Ron K. Cytron, John D. Lockwood Washington University Funded by NSF under grant 03-13203 http://www.arl.wustl.edu/~sp3 Apr 26, 2006

Outline • Motivation • Automatic optimization technique • a novel application of a standard optimization technique • Evaluation & Results

Constrained embedded applications • Embedded applications • Very restrictive FPGA and power constraints • Demanding application performance requirements • Requirement-constraint trade-offs • Soft processors • For application performance improvement • As prototype for custom hardware design

FPGA resources Power App performance Number of registers Set size Set Associativity Soft processors • Parameterized general purpose processors • Customization is performance-cost tradeoff • More “knobs” more options for customization

Soft processor customization • LEON: 10 reconfigurable subsystems • Instruction cache • Parameters: sets, set size, line size, replacement policy • 4 * 7 * 2 * 3 = 168 configurations (4 parameters; 16 values) • Data cache • sets, set size, line size, replacement, fast read, fast write, local RAM, local RAM size • 168 * 2 * 2 * 2 * 7 = 9,408 configns (8 params; 29 values) • Integer unit • multiplier, registers, fast jump, fast decode, ICC, load delay, FPU enable, co-processor enable, hardware watchpoints • 119,040 configurations (10 parameters; 56 values) • & Floating-point unit, memory controller, peripherals,… • 190 parameter values; 5*(1024) configurations!!

Existing approaches • Scaling problems • Runtime measurement problems • Estimation is quick but inaccurate • Simulators are extremely slow

Highlights of our optimization technique • Customize “all” parameters • Parameter independence assumption • Linear with number of parameters • Build only 100’s instead of 5*(1024) of configurations • Search space still includes all 5*(1024) configurations • Feasible and scalable • Formulate as binary integer nonlinear optimization program • A novel application of a standard technique • Use “actual” costs, to be accurate

Cost measurement • Application runtime • From direct execution • Hardware-based profiler • Non-intrusive, cycle-accurate, in “real-time” • Part of Liquid architecture platform • Runtime cost is application-specific • FPGA resources • In terms of LUTs and BRAM, from actual build • Takes >30 minutes • Harder than traditional optimization problems • Resource cost is processor-specific • Power (energy): future work

Outline • Motivation • Automatic optimization technique • a novel application of a standard optimization technique • Evaluation & Results

Our optimization technique Out-of-box soft processor; base configuration Assumes parameter independence Perturb parameter values one by one, build configuration, track resource cost Run application on each configuration, trackruntime cost Formulate costs as Binary Integer NonlinearProgram Near-optimal in practice Solve using TOMLAB/MatLab

Our optimization technique Out-of-box soft processor; base configuration Perturb parameter values one by one, build configuration, track resource cost Run application on each configuration, trackruntime cost Formulate costs as Binary Integer NonlinearProgram Solve using TOMLAB/MatLab

Processor ICache reconfiguration

Processor ICache reconfiguration xi = 0 or 1 (off or on)

Processor ICache reconfiguration xi = 0 or 1 (off or on) No constraint needed

FPGA resource constraints • LUTs • BRAM xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

Optimization • Optimize application runtime • Optimize resource utilization also xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

Problem formulation recapped • Minimize • Subject to … … Parameter validity constraints FPGA resource constraints Binary variables constraint xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

Evaluation & Results • Evaluate the impact of parameter independence assumption • Compare against exhaustive runs of a small subsystem • Dcache parameters of sets and setsize

Evaluation Our technique selects the same configuration Despite parameter independence assumption, near-optimal configuration

Distribution of generated configurations

Illustration of all configurations being searched

Highlights of results • 6.2 – 19.4% improvement in application performance • 2 - 3% savings in resources • Solutions customized simultaneously along many parameters • Customization is indeed application-specific

Conclusion • Our optimization technique • Linear with number of parameter values • Assuming parameter independence • Feasible, scalable • Near-optimal results in practice • Actual costs formulated as Binary Integer Nonlinear Program • A novel application of the technique • Only hours for configuration generation; seconds for optimization • Without any knowledge of architecture • Without any changes to application http://www.arl.wustl.edu/~sp3

Backup

Additional LEON-imposed constraints • LRR replacement with only 2-sets • LRU replacement with 2, 3, or 4-sets xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

Additional FPGA resource constraints • Cache size = (#sets) * (set size) Non-linear icache setsize dcache setsize xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

Evaluation

Cost approximations during eval - BLASTN

Cost approximations during eval - Commbench DRR

Cost approximations during eval – Commbench FRAG

Optimization results • Customization is indeed application-specific.

Cost approximations of the solutions We overestimate runtime decrease, underestimate resource increase

Cost approximations of the solutions We overestimate chip resource decrease, underestimate runtime decrease (except Arith, where we match)

BLASTN results

CommBench DRR results

CommBench FRAG results

BYTE Arith results

Evaluation: cost approximation range, summary

Result: cost approximation range, summary 0 to 19.75% -2 to 3% Nonlinear for LUTs is slightly worse Linear for BRAM is worse

Clock cycles

LEON processor reconfiguration • Icache (Instruction cache) • Dcache (Data cache) • IU (Integer Unit)

Processor DCache reconfiguration Xi = 0 or 1 (off or on)

Processor IU reconfiguration Xi = 0 or 1 (off or on) Valid parameters; next, fit on chip…

Optimization • Optimize application runtime • Optimize resource utilization also • Similarly, optimize power consumption xi = 0 or 1 (off or on) ri,li,bi: delta costs from base configuration n is number of configurations

Future work • Further analysis • Improve cost approximations • To be optimal for all xi • To match actual costs closer • Extensions • Power optimization • Energy optimization • For applications with long runtimes, sampling technique • Run applications on an operating system • ISA reconfiguration • “Give back” • Integrate this with LEON… • Evaluate technique on other configuration/ feature management problems

Details Backup

Existing approaches • Compiler-directed customization of ASIP cores by Gupta et al. (2002) • Considers only 4 functional units; only DSP benchmarks • Tuning caches to applications for low-energy embedded systems by Ross et al. (2004) • Analytical (hierarchical) searching of parameters in their own dimensions, with some full parameter exploration to avoid local minimal • Efficient architecture/compiler co-exploration for ASIPs by Fischer et al. (2002) • Considers only 3 architectural parameters, 4 compiler optimizations • Estimates chip costs • Towards automatic synthesis of a class of application-specific sensor networks by Bakshi et al. (2002) • Analytical model, followed by simulation-based refinement, but no optimization • Automatic generation of application specific processors by Goodwin et al. (2003) • Execution profiles to include/ exclude new “instructions” • Shortcomings • Scaling problems • Runtime measurement problems • Estimating application performance through models is quick but inaccurate • Simulators are slow; hence scale down the application or limit to single execution

Applications • BLASTN • computation, memory intensive • Commbench DRR (Deficit Round Robin) • computation, memory intensive • Commbench FRAG • computation, memory intensive • BYTE Arith • computation intensive

main () { int index = 0, counter = 0, found = 0, matches = 0, *ans; unsigned int currentString = 348432612, base = 0, random = 0; //currentString above is used as a seed also ans = (int*)0x40000004; //memlocation where the # matches are stored for (index = 0; index < SIZE; index++) { hashTable[index] = 4194304; } fillQuery(NUM_QUERY); //populates the hashtable // the loop below generates random bases for the database for (counter = 0; counter < NUM_DATABASE; counter++) { random = Rnd(&random); if (random <= MINT / 4) { base = 0; } else if (random <= MINT / 2) { base = 1; } else if (random <= ((MINT / 2) + (MINT / 4))){ base = 2; } else { base = 3; } found = findMatch(base, &currentString); if (found == 1) { matches++; } } //printf ("Total number of matches found = %d\n", matches); ans[0] = matches; } void fillQuery(int qNum) { int success, index; unsigned int currentString = 473246; unsigned int random = 782333; unsigned int base = 0; for (index = 0; index < qNum; index++) { random = Rnd(&random); if (random <= MINT/ 4) { base = 0; } else if (random <= MINT / 2) { base = 1; } else if (random <= ((MINT / 2) + (MINT / 4))){ base = 2; } else { base = 3; } success = addQuery(base, &currentString); if (success) { success = 0; } else { } } } //uses open address, double hashing unsigned int findMatch(unsigned int base1, unsigned int *currentString) { unsigned int base, step, last, current; *currentString = computeKey(base1, *currentString); base = computeBase(*currentString); step = computeStep(*currentString); last = (base + (SIZE - 1) * step) % SIZE; if (coreLoop(base, step, last, currentString)) { return 1; } else { return 0; } } hash_leon_coreLoop_32K_HT 2388B

Automatic Application-Specific Customization of Soft Processor Microarchitecture

Automatic Application-Specific Customization of Soft Processor Microarchitecture

Presentation Transcript

the automatic processor

IA-64 Microarchitecture --- Itanium Processor

The Microarchitecture Of The Pentium 4 Processor

Application-Specific Signatures for Transactional Memory in Soft Processors

Application-Specific Customization and Scalability of Soft Multiprocessors

Specific Choice of Soft Processor Features

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

The Microarchitecture of FPGA-Based Soft Processors

Systematic Register Bypass Customization for Application-Specific Processors

Application Specific Instruction Generation for Configurable Processor Architectures

The Microarchitecture of FPGA-Based Soft Processors

Application-Specific Customization of FPGA Soft-core Processors

Application-Specific Customization of Soft Processor Microarchitecture

Application Specific Instruction set Processor Design

Microblaze Soft Processor Core

CYPRIS An Application Specific Reconfigurable Processor Michael Stebnisky

Automatic Deployment of Application-Specific Metadata and Code in MOCHA

Microarchitecture of Superscalars (8) Basic Processor Structures

An Application Specific Reconfigurable Graphics Processor

MIPS Microarchitecture Multicycle Processor

THE AUTOMATIC PROCESSOR

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research