Automatically Generating Custom Instruction Set Extensions
This paper presents a novel automated method for generating Custom Function Units (CFUs) to enhance the performance of general-purpose processors while maintaining low power consumption. As the demand for specialized systems rises—examples include cell phones and network routers—CFUs can combine primitive operations to meet these needs. Our proposed solution leverages a greedy selection algorithm to optimize performance gains relative to area cost. The experimental setup and results indicate significant speedup and cost advantages, emphasizing the potential of CFUs for future application-specific processors.
Automatically Generating Custom Instruction Set Extensions
E N D
Presentation Transcript
Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors 1
Problem Statement • There’s a demand for high performance, low power special purpose systems • E.g. Cell phones, network routers, PDAs • One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) • Combine several primitive operations • We propose an automated method for CFU generation 2
Example 1 2 Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8 3 4 6 5 7 8 4
Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4 2,4,5 2,6,7 … 3 4 6 5 7 8 5
Example 1 2 Potential CFUs 1,3 2,4 2,6 … 1,3,4,5 2,4,5,8 2,6,7,8 … 1,3,4,5,8 3 4 6 5 7 8 6
Characterization • Use the macro library to get information on each potential CFU • Latency is the sum of each primitive’s latency • Area is the sum of each primitive’s macrocell 7
Performance On critical path Cycles saved Cost CFU area Control logic Difficult to measure Decode logic Difficult to measure Register file area Can be amortized Issues we consider LD AND 1 0.1 ADD 1 0.6 ASL 1 0.1 ADD 1 0.6 XOR 0.1 1 BR 8
IO number of input and output operands Usability How well can the compiler use the pattern More Issues to Consider OR LSL AND CMPP 9
Selection • Currently use a Greedy Algorithm • Pick the best performance gain / area first • Can yield bad selections OR LSL AND CMPP 10
Speedup: 1.24 10 cycles can be compressed down to 2! Cost: ~6 adders 6 inputs, 2 outputs C code this DFG came from: r ^=(((s[(t>>24)] + s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff; Case study 1: Blowfish r65 r70 ADD r76 XOR r81 ADD # -1 AND r891 XOR #16 LSR #255 AND #256 ADD #2 LSL r91 ADD 11
Speedup: 1.20 3 cycles can be compressed down to 1 Cost: ~1.5 adders 2 inputs, 2 outputs C code this DFG came from: d = d & 7; if ( d & 4 ) { … } Case study 2: ADPCM Decode r16 #7 AND #4 AND #0 CMPP 12
Experimental Setup • CFU recognition implemented in the Trimaran research infrastructure • Speedup shown is with CFUs relative to a baseline machine • Four wide VLIW with predication • Can issue at most 1 Int, Flt, Mem, Brn inst./cyc. • 300 MHz clock • CFU Latency is estimated using standard cells from Synopsis’ design library 13
Varying the Number of CFUs • More CFUs yields more performance • Weakness in our selection algorithm causes plateaus 14
Varying the Number of Ops • Bigger CFUs yield better performance • If they’re too big, they can’t be used as often and they expose alternate critical paths 15
Related Work • Many people have done this for code size • Bose et al., Liao et al. • Typically done with traces • Arnold, et al. • Previous paper used more enumerative discovery algorithm • We are unique because: • Compiler based approach • Novel analyzation of CFUs 16
Conclusion and Future Work • CFUs have the potential to offer big performance gain for small cost • Recognize more complex subgraphs • Generalized acyclic/cyclic subgraphs • Develop our system to automatically synthesize application tailored coprocessors 17