140 likes | 275 Vues
Systematic Register Bypass Customization for Application-Specific Processors. Kevin Fan, Nathan Clark, Michael Chu, K. V. Manjunath, Rajiv Ravindran, Mikhail Smelyanskiy, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. Introduction.
E N D
Systematic Register Bypass Customizationfor Application-Specific Processors Kevin Fan, Nathan Clark, Michael Chu,K. V. Manjunath, Rajiv Ravindran,Mikhail Smelyanskiy, Scott MahlkeAdvanced Computer Architecture Laboratory University of Michigan 1
Introduction • Bypass network allows for data forwarding to reduce pipeline stalls • Full bypass: any FU can bypass from any other FU and from any pipeline stage # paths = (issue width)2 bypassable stages input ports per FU output ports per FU 2
Bypass Path Utilization • As processors get wider and deeper, cost of bypass network increases quadratically [Palacharla ’98] • Only few bypasses are heavily utilized 3
Designing a Partial Bypass Network • Reduce hardware at the cost of runtime • Design a sparse bypass network while minimizing performance impact • Challenges: • Reconcile different requirements for different program regions • Interplay between different bypass paths • Huge search space, exponential number of possible configurations 4
Spacewalking Partial Bypass • Profile-guided Pareto ascent • Rank bypass paths by importance • Remove least important path and evaluate performance impact • Update rankings with new statistics • Repeat until performance degrades too far Bypasses (Ranked by Importance) Program Most Useful … Evaluate New Machine Replace Bypass If Performance Drops Too Much Remove the least useful bypass Cost/ Performance Paretomachines Least Useful X 1 Performance Usage statistics Cost 5
Ranking Bypass Paths cycles bypass was used total cycles Importance = % utilization offload potential redundant cycles cycles bypass was used Bypass path +1 +2 Equivalent bypass paths 6
I1 I2 M3 A Closer Look • Uses more bypasses than necessary • Not all edges require 1-stage bypass Ma Critical edges Ic Ib Id Ie If I1 I2 M3 7
I1 I2 M3 Compiling for Partial Bypass Optimal: Possible edge latencies • Difficulties: • Latencies between operations vary depending on resource assignments • Current assignment will affect future decisions • Naïve scheduler will arbitrarily place Op c • Need to provide resource hints to the scheduler to break ties Ma 1,2 1,2 Ic Ic Ic Ib 1,2 1,2 1,2 Id Id Id Ie Ie Ie 1,2 Scheduler: If If If 8
BUG Preference Algorithm • Perform pre-scheduling pass over the DFG • Bottom-Up Greedy algorithm based on [Ellis ’85] • Traverse DFG, critical paths first • Select bypass paths to achieve earliest completion time for each operation • Take into account time to: • Get inputs • Execute • Send outputs to consumers 9
Ma Ma Ic {2} Ib Ic Ib Ma {3} Ma Id Ie {2} Id Ie Ic Ib {1,2} Ic Ib {1} If If Id Ie {1,2} Id Ie {1,2} I1 I2 M3 Ma If {1,2} If {1,2} Ic Ib Id Ie If BUG Example • Place ops b, d, f on unit 1 since M bypasses to it • Place ops c, e on unit 2 since resource is free 10
Bypass Cost Savings Relative Performance 11
Pareto-optimal Machines djpeg (5-wide) g721dec (9-wide) BUG Preferences ILP Preferences 12
Bypass Usage is Variable Utilization epic bfish rawc rawd rasta cjpeg djpeg mesa unepic pegenc pegdec gsmenc gsmdec g721enc g721dec mpeg2enc 13
Conclusion • Significant bypass network cost can be saved without much performance loss • Our approach: • Intelligent bypass spacewalking • Resource hints allow compiler to schedule code effectively • 95% of original performance maintained when removing 60% of utilized bypasses • http://cccp.eecs.umich.edu 14