Automatically Generating Custom Instruction Set Extensions

Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors 1

Problem Statement • There’s a demand for high performance, low power special purpose systems • E.g. Cell phones, network routers, PDAs • One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) • Combine several primitive operations • We propose an automated method for CFU generation 2

System Overview 3

Example 1 2 Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8 3 4 6 5 7 8 4

Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4 2,4,5 2,6,7 … 3 4 6 5 7 8 5

Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4,5 2,4,5,8 2,6,7,8 … 1,3,4,5,8 3 4 6 5 7 8 6

Characterization • Use the macro library to get information on each potential CFU • Latency is the sum of each primitive’s latency • Area is the sum of each primitive’s macrocell 7

Performance On critical path Cycles saved Cost CFU area Control logic Difficult to measure Decode logic Difficult to measure Register file area Can be amortized Issues we consider LD AND 1 0.1 ADD 1 0.6 ASL 1 0.1 ADD 1 0.6 XOR 0.1 1 BR 8

IO number of input and output operands Usability How well can the compiler use the pattern More Issues to Consider OR LSL AND CMPP 9

Selection • Currently use a Greedy Algorithm • Pick the best performance gain / area first • Can yield bad selections OR LSL AND CMPP 10

Speedup: 1.24 10 cycles can be compressed down to 2! Cost: ~6 adders 6 inputs, 2 outputs C code this DFG came from: r ^=(((s[(t>>24)] + s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff; Case study 1: Blowfish r65 r70 ADD r76 XOR r81 ADD # -1 AND r891 XOR #16 LSR #255 AND #256 ADD #2 LSL r91 ADD 11

Speedup: 1.20 3 cycles can be compressed down to 1 Cost: ~1.5 adders 2 inputs, 2 outputs C code this DFG came from: d = d & 7; if ( d & 4 ) { … } Case study 2: ADPCM Decode r16 #7 AND #4 AND #0 CMPP 12

Experimental Setup • CFU recognition implemented in the Trimaran research infrastructure • Speedup shown is with CFUs relative to a baseline machine • Four wide VLIW with predication • Can issue at most 1 Int, Flt, Mem, Brn inst./cyc. • 300 MHz clock • CFU Latency is estimated using standard cells from Synopsis’ design library 13

Varying the Number of CFUs • More CFUs yields more performance • Weakness in our selection algorithm causes plateaus 14

Varying the Number of Ops • Bigger CFUs yield better performance • If they’re too big, they can’t be used as often and they expose alternate critical paths 15

Related Work • Many people have done this for code size • Bose et al., Liao et al. • Typically done with traces • Arnold, et al. • Previous paper used more enumerative discovery algorithm • We are unique because: • Compiler based approach • Novel analyzation of CFUs 16

Conclusion and Future Work • CFUs have the potential to offer big performance gain for small cost • Recognize more complex subgraphs • Generalized acyclic/cyclic subgraphs • Develop our system to automatically synthesize application tailored coprocessors 17

Automatically Generating Custom Instruction Set Extensions

Automatically Generating Custom Instruction Set Extensions

Presentation Transcript

Instruction Set

MC68HC11 Instruction Set

MIPS Instruction Set

INSTRUCTION SET

Automatically Generating Models for Botnet Detection

AUTOMATICALLY GENERATING CONSISTENT USER INTERFACES

INSTRUCTION SET

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

EXE: Automatically Generating Inputs of Death

Instruction Set

Automatically Generating Linked Data from Tables

INSTRUCTION SET

ARM instruction set

Design-Space Exploration of Resource-Sharing Solutions for Custom Instruction Set Extensions

ARM Instruction Set

ARM instruction set

Instruction Set Virtualization

CPU08 INSTRUCTION SET

Instruction Set Design

Semi-Automatically Generating Data-Extraction Ontology

Automatically Generating Fictional and Factual Narratives