390 likes | 394 Vues
Processor Acceleration Through Automated Instruction Set Customization. Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003. CPU. ASIC. Motivation. Cell phones, PDAs, digital cameras, etc. are everywhere
E N D
Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 1
CPU ASIC Motivation • Cell phones, PDAs, digital cameras, etc. are everywhere • High performance yet low power design point • General core + ASIC solution • Limited post-programmability • General core + application specific instructions (CFUs) CPU CFU 2
+ ^ & << + + ^ ^ | & << CFU 2 ^ * + | & << + | ^ What is a CFU? • Combine multiple primitive operations • Smaller code size, fewer RF reads • Increases performance CFU 1 1 1 2 ^ * + 2 1 3
Automation is Key • This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR 173 AND 4
Related Work • Tensilica Xtensa • Commercial example • MIPS core + manually constructed CFU • Automatic instruction set synthesis is mature field • See paper for comparison of techniques • Our contributions • Novel technique for automatic CFU creation • System to utilize CFUs in multiple applications • Analysis of how effectively CFUs for one application apply to other applications in the same domain 5
System Overview • Synthesis • Subgraph identification • Discover candidates for CFUs • Weed out what shouldn’t be picked • Selection • Determine which candidates to use as CFUs • Compilation • Subgraph replacement • Make use of the CFUs in a range of applications 6
Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | 7
Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output & ^ % << | CFU Candidates & << 8
Subgraph Identification + * • Grow subgraphs from seed nodes • All nodes are seeds • Most directions don’t make sense • How to decide where to grow? • Making decisions using factors similar to an architect • Take 4 factors into consideration • Criticality, Latency, Area, Input/Output • Sum of these factors determines value of each direction • NOT picking CFUs & ^ % << | CFU Candidates & + << & 9
Critical Path • Combining operations on the critical path will shrink the longer dependence chains • Maximize potential performance gain • Wt = • Slack is # cycles off longest dependence path ^ & 10/(0+1) = 10 ^ 10/(2+1) = 3.33 >> >> >> & & & + + + << << << << + + + + 10
Latency • Growing toward low latency operations allows combination of more nodes in a cycle • Maximize DFG compression • Wt = ^ & ^ >> >> >> & & & 10*0.3 / 0.36 = 8.33 + + + << << << << 10*0.3 / 0.6 = 5 + + + + 11
^ & ^ >> >> >> & & & + + + << << << << + + + + Area • Want the most benefit for the least area • Wt = • Area is the sum of macrocell areas 10*0.5/0.5 = 10 10*0.5/1.5 = 3.33 12
Input/Output • Want CFUs to use as few RF ports as possible • Smaller encoding • Allow growth of larger candidates • Wt = ^ & ^ 10*2/(4+1)= 4 >> >> >> & & & 10*2/(2+1)= 6.67 + + + << << << << + + + + 13
Example ^ & 28.5 35 ^ 30.8 37.5 37.5 28.5 >> >> >> & & & + + + << << << << + + + + 14
Example ^ & 28.5 35 ^ 30.8 40 28.5 >> >> >> 33.5 & & & + + + << << << << + + + + 15
Example ^ & 28.5 35 ^ 30.8 28.5 >> >> >> 36 36 & & & + + + << << << << + + + + 16
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 17
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 18
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 19
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 20
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 21
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 22
Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 23
^ & ^ >> >> >> & & & + + + << << << << + + + + Finished – Met External Constraints 24
^ ^ ^ ^ ^ ^ << << << << << << << << << << << ^ ^ + + & & & & & & & + + Set of Candidates ^ & ^ ^ ^ ^ ^ << << << << << << & & & & & & + + + + + ^ << << << << ^ << << << & & 25
1.50 1.38 1.25 Speedup 1.13 1.00 Avoids Exponential Explosion 26
Greedy Selection Heuristic • Use estimates of performance improvement / cost 27
1 4 1 4 2 5 2 5 CFU 3 6 3 Compiler Replacement • Multiple applications can utilize CFUs • Vflib pattern matcher [Cor ’99] Instruction Synthesis CFU Description Compiler 28
Experimental Setup • Implemented in the Trimaran toolset • Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle • CFUs use Int issue slot • CFU latency/area generated as sum of each individual macrocell • Pipeline latches were added if CFU latency >1 clock cycle • 300 MHz clock assumed • No branch or memory instructions in CFUs • Four application domains tested • Audio, Encryption, Image, Network 29
IN_1 0x8 >> 0xF | IN_2 + Generalizing CFUs Subsumed (Multiple Paths) Wildcards (Multiple Nodes) IN_1 0x8, 0x0 IN_1 0x8 >> >> 0xF, 0x0 0xF | |,& IN_2 IN_2 + +,- 32
2.0 CFUs Subsumed Subgraphs 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 sha rijn-sha sha-rijn rijndael blowfish bfish-rijn rijn-bfish bfish-sha sha-bfish Effects of Generalization Speedup 33
Conclusions • Developed two phase instruction set synthesis system • Guide function removes bad candidates • Greedy selection heuristic • Substantial speedups can be attained with very little die impact • Subsumed subgraphs and wildcarding increase cross-application effectiveness 34
Questions? http://cccp.eecs.umich.edu 35
Selection • Uses estimates of performance improvement • Greedy Heuristic used ^ & ^ >> >> >> & & & + + + << << << << + + + + 39