Non-traditional Parallelism

Non-traditional Parallelism • Parallelism – Use multiple contexts to achieve better performance than possible on a single context. • Traditional Parallelism – We use extra threads/processors to offload computation. Threads divide up the execution stream. • Non-traditional parallelism – Extra threads are used to speed up computation without necessarily off-loading any of the original computation • Primary advantage  nearly any code, no matter how inherently serial, can benefit from parallelization. • Another advantage – threads can be added or subtracted without significant disruption.

Traditional Parallelism Thread 1 Thread 2 Thread 3 Thread 4

Non-Traditional Parallelism Thread 1 Thread 2 Thread 3 Thread 4 • Speculative precomputation, dynamic speculative precomputation, many others. • Most commonly – prefetching, possibly branch pre-calculation.

Background -- Helper Threads Chappell, Stark, Kim, Reinhardt, Patt, “Simultaneous Subordinate Micro-threading” 1999 Use microcoded threads to manipulate the microarchitecture to improve the performance of the main thread. Zilles 2001, Collins 2001, Luk 2001 Use a regular SMT thread, with code distilled from the main thread, to support the main thread.

Outline • Speculative Precomputation [Collins, et al 2001 – Intel/UCSD] • Dynamic Speculative Precomputation • Event-Driven Simultaneous Optimization • Value Specialization • Inline Prefetching • Thread Prefetching

Speculative Precomputation – Motivation Processor Architecture and Compilation Lab

Speculative Precomputation (SP) • In SP, a p-slice is a thread derived from a trace of execution between a trigger instruction and the delinquent load. • All instructions upon which the load’s address is not dependent are removed (often 90-95%). • Live-in register values (typically 2-6) must be explicitly copied from main thread to helper thread. Processor Architecture and Compilation Lab

Trigger instruction Spawn thread Prefetch Memory latency Speculative Precomputation Delinquent load Processor Architecture and Compilation Lab

Advantages over Traditional Prefetching • Because SP uses actual program code, can precompute addresses that fit no predictable pattern. • Because SP runs in a separate thread, it can interfere with the main thread much less than software prefetching. When it isn’t working, it can be killed. • Because it is decoupled from the main thread, the prefetcher is not constrained by the control flow of the main thread. • All the applications in this study already had very aggressive software prefetching applied, when possible. Processor Architecture and Compilation Lab

SP Optimizations • On-chip memory for transfer of live-in values. • Chaining triggers – for delinquent loads in loops, a speculative thread can trigger the next p-slice (think of this as a looping prefetcher which targets a load within a loop) • Minimizes live-in copy overhead. • Enables SP threads to get arbitrarily far ahead. • Necessitates a mechanism to stop the chaining prefetcher. Processor Architecture and Compilation Lab

Advantages from Chaining Triggers • Chaining triggers executed without impacting main thread • Target delinquent loads arbitrarily far ahead of non-speculative thread • Speculative threads make progress independent of main thread • Use basic triggers to initiate precomputation, but use chaining triggers to sustain it Processor Architecture and Compilation Lab

SP Performance Processor Architecture and Compilation Lab

Conclusion • Speculative precomputation uses otherwise idle hardware thread contexts • Pre-computes future memory accesses • Targets worst behaving static loads in a program • Chaining Triggers enable speculative threads to spawn additional speculative threads • Results in tremendous performance gains, even with conservative hardware assumptions Processor Architecture and Compilation Lab

Outline • Speculative Precomputation • Dynamic Speculative Precomputation [Collins, et al – UCSD/Intel] • Event-Driven Simultaneous Optimization • Value Specialization • Inline Prefetching • Thread Prefetching

Dynamic Speculative Precomputation (DSP) -- Motivation • SP, as well as similar techniques proposed about the same time, require • Profile support • Heavy user or compiler interaction • It is thus susceptible to profile-mismatch, requires recompilation for each machine architecture, and if they require user interaction… • (or, a bit more accurately, we just wanted to see if we could do it all in hardware) Processor Architecture and Compilation Lab

Dynamic Speculative Precomputation • relies on the hardware to • identify delinquent loads • create speculative threads • optimize the threads when they aren’t working quite well enough • eliminate the threads when they aren’t working at all • destroy threads when they are no longer useful… Processor Architecture and Compilation Lab

Dynamic Speculative Precomputation • Like hardware prefetching, works without software support or recompilation, regardless of the machine architecture. • Like SP, works with minimal interference on main thread. • Like SP, works on highly irregular memory access patterns. Processor Architecture and Compilation Lab

Necessary Analysis Capabilities • Identify delinquent loads • Delinquent Load Identification Table • Construct p-slices and apply optimizations • Retired Instruction Buffer • Spawn and manage P-slices • Slice Information Table • Implemented as back-end instruction analyzers Processor Architecture and Compilation Lab

Example SMT Processor Pipeline Re-order Buffer Re-order Buffer Re-order Buffer Re-order Buffer Centralized Instruction Queue Data Cache Register Renaming Monolithic Register File Execution Units PC ICache PC PC PC Processor Architecture and Compilation Lab

Modified Pipeline Retired Instruction Buffer (RIB) Delinquent Load Identification Table (DLIT) Re-order Buffer Re-order Buffer Re-order Buffer Re-order Buffer Centralized Instruction Queue Data Cache Slice Information Table (SIT) Register Renaming Monolithic Register File Execution Units PC ICache PC PC PC Processor Architecture and Compilation Lab

Creating p-slices • Once delinquent load identified, RIB buffers instructions until the delinquent load appears as the newest instruction in the buffer. • Dependence analysis easily identifies load’s antecedents, a trigger instruction, and the live-in’s needed by the slice. • Similar to register live-range analysis • But much easier Processor Architecture and Compilation Lab

Retired Instruction Buffer • Construct p-slices to prefetch delinquent loads • Buffers information on an in-order run of committed instructions • Comparable to trace cache fill unit • FIFO structure • RIB normally idle Processor Architecture and Compilation Lab

P-slice Construction • Analyze instructions between two instances of delinquent load • Most recent to oldest • Maintain partial p-slice and register live-in set • Add to p-slice instructions which produce live-in set register • Update register live-in set • When analysis terminates, p-slice has been constructed and live-in registers identified Processor Architecture and Compilation Lab

Example struct DATATYPE { int val[10]; }; DATATYPE * data [100]; for(j = 0; j < 10; j++) { for(i = 0; i < 100; i++) { data[i]->val[j]++; } } loop: I1 load r1=[r2] I2 add r3=r3+1 I3 add r6=r3-100 I4 add r2=r2+8 I5 add r1=r4+r1 I6 load r5=[r1] I7 add r5=r5+1 I8 store [r1]=r5 I9 blt r6, loop Processor Architecture and Compilation Lab

P-slice Construction Example Analyze from recent Instruction Included Live-in Set load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 To oldest add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] Ö r1 add r1 = r4+r1 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] Ö r1 add r1 = r4+r1 r1 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] Ö r1 add r1 = r4+r1 Ö r1 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] Ö r1 add r1 = r4+r1 Ö r1,r4 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] Ö r1 add r1 = r4+r1 Ö r1,r4 add r2 = r2+8 r1,r4 add r6 = r3-100 add r3 = r3+1 load r1 = [r2] blt r6, loop store [r1] = r5 add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction Included Live-in Set load r5 = [r1] Ö r1 add r1 = r4+r1 Ö r1,r4 add r2 = r2+8 r1,r4 add r6 = r3-100 r1,r4 add r3 = r3+1 r1,r4 load r1 = [r2] Ö r2,r4 blt r6, loop r2,r4 store [r1] = r5 r2,r4 add r5 = r5+1 r2,r4 load r5 = [r1] Processor Architecture and Compilation Lab

P-slice Construction Example Instruction P-Slice load r5 = [r1] load r1 = [r2] add r1 = r4+r1 load r5 = [r1] add r1 = r4+r1 add r2 = r2+8 add r6 = r3-100 add r3 = r3+1 Live-in Set load r1 = [r2] r2,r4 blt r6, loop store [r1] = r5 Delinquent Load is trigger add r5 = r5+1 load r5 = [r1] Processor Architecture and Compilation Lab

Optimizations over Basic Slices • If two occurrences of the load are in the buffer (the common case), we’ve identified a loop that can be exploited for better slices. • Can perform additional analysis passes and optimizations • Retain live-in set from previous pass • Increases construction latency but keeps RIB simple • Optimizations • Advanced trigger placement (if dependences allow, move trigger earlier in loop) • Induction unrolling (prefetch multiple iterations ahead) • Chaining (looping) slices – prefetch many loads with a single thread. Processor Architecture and Compilation Lab

Processor Architecture and Compilation Lab

Conclusion • Dynamic Speculative Precomputation aggressively targets delinquent loads • Thread based prefetching scheme • Uses back-end (off critical path) instruction analyzers • P-slices constructed with no external software support • Multi-pass RIB analysis enables aggressive p-slice optimizations Processor Architecture and Compilation Lab

Outline • Speculative Precomputation • Dynamic Speculative Precomputation • Event-Driven Simultaneous Optimization • Value Specialization • Inline Prefetching • Thread Prefetching

Event-Driven Simultaneous Compilation With Weifeng Zhang and Brad Calder

Event-Driven Optimization • Use “helper threads” to recompile/optimize the main thread. • Optimization is triggered by interesting events that are identified in hardware (event-driven).

Event-Driven Optimization Thread 1 Thread 2 Thread 3 Thread 4 • Execution and Compilation take place in parallel!

Event-driven multithreaded dynamic optimization • A new model of optimization • Computation and optimization occur in parallel • Optimizations are triggered by the program’s runtime behavior • Advantages • Low overhead profiling of runtime behavior • Low overhead optimization by exploiting additional hardware context • Quick response to the program’s changing behavior • Aggressive optimizations

The proposed optimization model event event Main thread • Maintaining only one copy of the optimized code Helper thread Helper thread original code base optimized code Re-optimized code • Recurrent optimization on already optimized code when the behavior changes • Gradually enabling aggressive optimizations

The proposed Trident framework • Hardware event-driven • Hardware monitors the program’s behavior with no software overhead • Optimization threads triggered to respond to particular events. • Optimization events handled ASAP to quickly adapt to the program’s changing behaviors • Hardware Multithreaded • Concurrent, low-overhead helper threads • Gradual re-optimization upon new events Trident events Main thread Optimization threads

Trident architecture • Register a given thread to be monitored, and create helper thread contexts • Monitor the main thread to generate events (into the queue) • Helper thread is triggered to perform optimization. Update the code cache and patch the main thread

Trident hardware events • Events • Occurrence of a particular type of runtime behavior • Generic events • Hot branch events • Trace invalidation • Optimization specific events • Hot value events • Delinquent Load events • Other events • ?

Hot Trace Formation • The Trident Framework is built around a fairly traditional dynamic optimization system • => hot trace formation, code cache • Trident captures hot traces in hardware (details omitted) • However, even with its basic optimizations, Trident has key advantages over previous systems • Hardware hot branch events identify hot traces • Zero-overhead monitoring • Low-overhead optimization in another thread • No context switches between these functions

What is a hot trace? • Definitions • Hot trace • A number of basic blocks frequently running together • Trace formation • Streamlining these blocks for better execution locality • Code cache • Memory buffer to store hot traces start A 1 0 B C 1 0 D E call H I J F K return G

Base optimizations during trace formation • Streamlining the instruction sequence • Redundant branch/load removal • Constant propagation • Instruction re-association • Code elimination • Architecture-aware optimizations • reduction of RAS (return address stack) mis-predictions (orders of magnitude) • I-cache conscious placement of traces within code cache. • Trace Invalidation

Why dynamic value specialization? • Value specialization • Make a special version of the code corresponding to likely live-in values • Advantages over hardware value prediction • Value predictions are made in the background and less frequently • No limits on how many predictions can be made • Allow more sophisticated prediction techniques • Propagate predicted values along the trace • Trigger other optimizations such as strength reduction

Why dynamic value specialization? • Value specialization • Make a special version of the code corresponding to likely live-in values • Advantages over software value specialization • Can adapt to semi-invariant runtime values (eg, values that change, but slowly) • Adapts to actual dynamic runtime values. • Detects optimizations that are no longer working.

Non-traditional Parallelism