Parallel Instruction Set Extension Identification

Parallel Instruction Set Extension Identification { dshap092, mmont044, mbolic } @site.uottawa.ca http://carg.site.uottawa.ca/ Daniel Shapiro, Michael Montcalm and MiodragBolic

Overview • Introduction • Prior Art • SpeedupResults • Speedup Analysis • Task Scheduling • Parallel ISE Enumeration Experiment • Compiler Execution Time Results • Performance Analysis • Conclusion & Future Work

Introduction • Hardware / software partition • COINS compiler (Java) • Control Flow Graph (CFG) • Data Flow Graph (DFG) of each basic block • Three address code using SSA • Modulefunctionbasicblockstatementnode • Now we have directed acyclic graphs for basic blocks • Enumerate convex subgraphs of each SSA DFG and then select the “best” subset Goal: Perform this compilation procedure faster

Introduction • Instruction • Enumeration • Multicore desktops are common • Adapt the algorithm to run on a multicore • Thread pool of workers • In our case the Intel Core i7 with 12GB DDR3 RAM

Introduction • Always use software optimizations before adding hardware • Enumeration algorithm changes execution time of compiler • Also the I/O constraint (e.g. 8,8) and the hardware size constraint (e.g. 10,000 LEs) change the model execution time. • The workload for the benchmark affects the observed speedup, as we will see.

Prior Art • ISEs + threads in [2] to find maximal convex subgraphs of basic block dataflow graphs. • [2] only applies the inter-basic-block parallelism to the selection of ISEs and not their enumeration. • We adapt this idea of having a global solver to the approach of [1], which can find ISEs smaller than the maximal subgraphs identified by [5]. • Many groups have used Integer Linear Programming (ILP) to do ISE identification

Prior Art • [6] used threads to perform parallel ISE enumeration within a basic block. • We go further and applied it to the scope of the control flow graph. • Our approach can be executed on as many processors as there are basic blocks • Our work can be combined with the existing approach of using multiple threads to perform ISE enumeration on a single basic block.

Speedup Results Workload Hardware constraint

Speedup Analysis • Search space limited algorithm • Intra-basic block • I/O constrained • No pointers in HW (illegal nodes) • 2x speedup is nice • If we just make everything into hardware, then we cannot update the program without a firmware update.

Task Scheduling • Use well-know thread creation techniques to accelerate the ISE enumeration part of ISE identification • Scheduling is performed with inexact information in order to save compilation time (# stmt instead of # nodes). • Scheduling the parallel tasks quickly and intelligently is critical (see right).

Parallel ISE Enumeration Experiment • Used greedy ISE enumeration algorithm • I/O constraint of (8,8) • Hardware size constraint of 10K LEs and 10M LEs (in practice we gathered much more data) • Compared the sequential and parallel approaches to ISE enumeration • Speedup observed, but data for algorithm execution time were not as expected.

Compiler Execution Time Results Speedupreversal -6% +53%

Performance Analysis • Compiler execution time • So far, results are only positive sometimes. • We expected much better numbers for such a powerful computer. • Additional overhead time was needed for creating, distributing, and then collecting the thread data. (not the problem) • Probably there is still a memory dependency.

Conclusion & Future Work Conclusions Future work Analyze the source of the overhead using vTune Reduce the source of the overhead, once identified Distribute the enumeration of ISEs across multiple computers, perhaps using Microsoft Solver Foundation • Using multiple threads for ISE enumeration is beneficial on average • Peak 53.7% faster • To our knowledge this is the first use of this technique in the literature • Approach is applicable to many ISE enumeration algorithms

References [1] K. Atasu, G. Dundar, and C. Ozturan, “An integer linear programmingapproach for identifying instruction-set extensions,” in Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesignand System Synthesis, 2005, pp. 172–177. [2] C. Galuzzi, E. M. Panainte, Y. Yankova, K. Bertels, and S. Vassiliadis, “Automatic selection of application-specific instruction-set extensions,” in Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis, 2006, pp. 160–165. [3] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specificinstruction-setextensionsundermicroarchitecturalconstraints,” in Design Automation Conference, 2003, pp. 256–261.

Questions?

Parallel Instruction Set Extension Identification

Parallel Instruction Set Extension Identification

Presentation Transcript

Instruction Set

MC68HC11 Instruction Set

MIPS Instruction Set

8085 Instruction Set

INSTRUCTION SET

Instruction Set Extension for Dynamic Time Warping

Parallel Mandelbrot Set

INSTRUCTION SET

Instruction Set Issues

68000 Instruction Set

Instruction Set Architecture

Instruction Set

Instruction Set Architecture

INSTRUCTION SET

ARM instruction set

ARM Instruction Set

ARM instruction set

Instruction Set Virtualization

CPU08 INSTRUCTION SET

Instruction Set Design

SHARC instruction set

ARM Instruction Set