Ocelot: PTX Emulator

Ocelot: PTX Emulator

Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD GPU Backend

Execution Model NVIDIA’s PTX Execution Model • Array of multiprocessors each executing a CTA • SIMD multiprocessors • Single instruction multiple thread (SIMT) execution • Enables hardware to exploit control flow uniformity and data locality • Parallel thread execution (PTX) • Explicit memory hierarchy • Cooperative thread arrays (CTAs) and ordering constraints

The Ocelot PTX Emulator Abstract machine model • Performs functional simulation of the PTX execution model • Access to complete machine state • Enables detailed performance evaluation/debugging • e.g. Program correctness checking • Alignment checks, out of bounds etc. • Workload characterization and modeling • Trace generation to drive architecture simulators • Timing information not available PTX 3.0 (Fermi) support

Emulator Implementation • Serialized execution of CTAs • Implements the CUDA execution model semantics • CUDA does not guarantee concurrent CTA execution! • Implements abstract reconvergence mechanism • Multiple reconvergence mechanisms are implemented: • IPDOM, Barrier, Thread frontiers • Software support for special functions: • Texture sampling • 1D, 2D, 3D, cube • Nearest, linear • New instructions may be prototyped

Ocelot Source Code: PTX Emulator Device Backend • ocelot/ • executive/ • interface/EmulatorDevice.h • interface/EmulatedKernel.h • interface/EmulatorCallStack.h • interface/CooperativeThreadArray.h • interface/ReconvergenceMechanism.h • interface/TextureOperations.h • trace/ • interface/TraceEvent.h • interface/TraceGenerator.h 6

Trace Generator Interface • Emulator broadcasts events to eventtrace analyzers during execution • Events provide detailed device state: • PC, activity mask, operand data, thread ID, etc. • Used for error checking, instrumentation, and simulation

Trace Generator Interface // ocelot/trace/interface/TraceGenerator.h // Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // Called when a traced kernel is launched to // retrieve some parameters from the kernel virtualvoid initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtualvoid event(const TraceEvent & event); // Called when an event is committed virtualvoid postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtualvoid finish(); };

TraceEvent Object • Captures execution of a dynamic instruction, including • Device pointer to access device state • Kernel grid, CTA dimensions, parameter values • PC and internal representation of PTX instruction • Set of active threads executing instruction • Memory addresses and size of transfer • Branch target(s) and diverging threads

trace::TraceEvent • // Fall through thread mask in case of a branch • BitMaskfallthrough; • // Vector of memory addresses possibly generated for this instruction • U64Vector memory_addresses; • // Vector of sizes of memory operations possibly issued by this instruction • ir::PTXU32 memory_size; • // Dimensions of the kernel grid that generated the event • ir::Dim3 gridDim; • // Dimensions of the kernel block that generated the event • ir::Dim3 blockDim; • // Captures just events related to thread reconvergence • ReconvergenceTraceEvent reconvergence; • }; • classTraceEvent { • public: • // ID of the block that generated the event • ir::Dim3 blockId; • // PC index into EmulatedKernel's packed instruction sequence • ir::PTXU64 PC; • // Depth of call stack [i.e. number of contexts on the runtime stack] • ir::PTXU32 contextStackSize; • // Instruction pointer to instruction pointed to by PC • constir::PTXInstruction* instruction; • // Bit mask of active threads that executed this instruction • BitMask active; • // Taken thread mask in case of a branch • BitMask taken;

PTX Emulator – CUDA Debugging – Race Detection // file: raceCondition.cu __global__ void raceCondition(int *A) { __shared__ intSharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load } . . . raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr ); . . . ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near raceCondition.cu:9:0

PTX Emulator – CUDA Debugging- Illegal Memory Accesses // file: memoryCheck.cu __global__ void badMemoryReference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64); badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0

Interactive Debugger • Interactive command-line debugger implemented using TraceGenerator interface • Enables • Inspectionof application state • Single-stepping of instructions • Breakpoints and watchpoints • Faults in MemoryChecker and RaceDetector invoke ocelot-dbg automatically $ ./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg)

Performance Tuning .... for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid-offset]; } } .... • Identify critical regions • Memory demand • Floating-point intensity • Shared memory bank conflicts

Example: Inter-thread Data Flow • Which kernels exchange computed results through shared memory • Track id of producer thread • Ensure threads are well synchronized

Current Trace Generators

Using Trace Generators • Implement TraceGenerator interface • override methods: • initialize( ), event( ), postEvent( ), finish( ) • Add to Ocelot runtime • explicitly: ocelot::addTraceGenerator( ) • or, add to trace::TraceConfiguration • Online analysis or serialize event traces • Link applications with libocelotTrace.so

Configuring & Executing Applications trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: true, ignoreIrrelevantWrites: true }, debugger: { enabled: true, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", alwaysAttach: false }, }, executive: { devices: [ "emulated" ], } } • Edit configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • trace specifies which trace generators are initially attached • executive controls device properties • trace: • memoryChecker – ensures accesses to a memory region associated with the currently selected device • raceDetector - enforces synchronized access to .shared • debugger - interactive debugger • executive: • devices: • emulated – Ocelot PTX emulator (trace generators)

Example: Thread Load Imbalance //! Computes number of dynamic instructions for each thread classThreadLoadImbalance: public trace::TraceGenerator { public: std::vector< size_t > dynamicInstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtualvoid event(const TraceEvent & event) { if (!dynamicInstructions.size()) dynamicInstructions.resize(event.active.size(), 0); for (inti = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicInstructions[i]++; } } }; Dynamic Instruction Count • Mandelbrot (CUDA SDK)

Example: Control-flow divergence

Control-Flow Divergence in PTX Emulator • PTX Emulator facilitates customizable handlers for control flow divergence • Currently implements: • Immediate post-dominator (ipdom) • And many more • Abstract handlers implement potentially divergent control instructions • e.g. Bra, Bar, Exit, Vote • Executing instructions drives TraceGenerators • Reconvergence affects active threads, dynamic instruction count, and instruction trace • Analysis tools can group threads into warps of arbitrary size

class ReconvergenceMechanism // ocelot/executive/interface/ReconvergenceMechanism.h classReconvergenceMechanism { public: ReconvergenceMechanism(CooperativeThreadArray *cta); virtual ~ReconvergenceMechanism(); virtual void initialize() = 0; virtual void evalPredicate(CTAContext &context) = 0; virtual booleval_Bra(CTAContext &context, constir::PTXInstruction &instr, const boost::dynamic_bitset<> & branch, const boost::dynamic_bitset<> & fallthrough) = 0; virtual void eval_Bar(CTAContext &context, constir::PTXInstruction &instr) = 0; virtual void eval_Reconverge(CTAContext &context, constir::PTXInstruction &instr) = 0; virtual void eval_Exit(CTAContext &context, constir::PTXInstruction &instr) = 0; virtual void eval_Vote(CTAContext &context, constir::PTXInstruction &instr); virtual boolnextInstruction(CTAContext &context, constir::PTXInstruction &instr, constir::PTXInstruction::Opcode &) = 0; virtualCTAContext& getContext() = 0; }

Example: Immediate Post-Dominator Reconvergence // ocelot/executive/interface/ReconvergenceMechanism.h classReconvergenceIPDOM: publicReconvergenceMechanism { public: ReconvergenceIPDOM(CooperativeThreadArray *cta); ~ReconvergenceIPDOM(); void initialize(); voidevalPredicate(CTAContext &context); booleval_Bra(CTAContext &context, PTXInstruction&instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough); voideval_Bar(CTAContext &context, PTXInstruction&instr); voideval_Reconverge(CTAContext &context, PTXInstruction&instr); voideval_Exit(CTAContext &context, PTXInstruction&instr); boolnextInstruction(CTAContext &context, PTXInstruction&instr, PTXInstruction::Opcode&); CTAContext& getContext(); size_tstackSize() const; void push(CTAContext&); void pop(); std::vector<CTAContext> runtimeStack; std::vector<int> pcStack; unsigned int reconvergeEvents; };

Example: Immediate Post-Dominator Reconvergence // ocelot/executive/implementation/ReconvergenceMechanism.cpp voidexecutive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context, constir::PTXInstruction &instr) { if (context.active.count() < context.active.size()) { // deadlock - not all threads reach synchronization barrier std::stringstream message; message << "barrier deadlock:\n"; throwRuntimeException(message.str(), context.PC, instr); } }

Example: Immediate Post-Dominator Reconvergence void executive::ReconvergenceIPDOM::eval_Reconverge( executive::CTAContext &context, constir::PTXInstruction &instr) { if(runtimeStack.size() > 1) { if(pcStack.back() == context.PC) { pcStack.pop_back(); runtimeStack.pop_back(); ++reconvergeEvents; } else { context.PC++; } } else { context.PC++; } } void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context, constir::PTXInstruction &instr) { eval_Bar(context, instr); context.running = false; }

Summary: Ocelot PTX Emulator • Used for instruction level analysis • Can be attached to trace generators and trace analyzers • Simple processing and filtering of machine state • Forms the basis for a range of productivity tools • Correctness tools • Debugging tools • Workload characterization • Drives instruction and address traces to MacSim

Ocelot: PTX Emulator