1 / 26

Ocelot: PTX Emulator

Ocelot: PTX Emulator. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Execution Model . NVIDIA’s PTX Execution Model. Array of multiprocessors each executing a CTA SIMD multiprocessors Single instruction multiple thread (SIMT) execution

millie
Télécharger la présentation

Ocelot: PTX Emulator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ocelot: PTX Emulator

  2. Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD GPU Backend

  3. Execution Model NVIDIA’s PTX Execution Model • Array of multiprocessors each executing a CTA • SIMD multiprocessors • Single instruction multiple thread (SIMT) execution • Enables hardware to exploit control flow uniformity and data locality • Parallel thread execution (PTX) • Explicit memory hierarchy • Cooperative thread arrays (CTAs) and ordering constraints

  4. The Ocelot PTX Emulator Abstract machine model • Performs functional simulation of the PTX execution model • Access to complete machine state • Enables detailed performance evaluation/debugging • e.g. Program correctness checking • Alignment checks, out of bounds etc. • Workload characterization and modeling • Trace generation to drive architecture simulators • Timing information not available PTX 3.0 (Fermi) support

  5. Emulator Implementation • Serialized execution of CTAs • Implements the CUDA execution model semantics • CUDA does not guarantee concurrent CTA execution! • Implements abstract reconvergence mechanism • Multiple reconvergence mechanisms are implemented: • IPDOM, Barrier, Thread frontiers • Software support for special functions: • Texture sampling • 1D, 2D, 3D, cube • Nearest, linear • New instructions may be prototyped

  6. Ocelot Source Code: PTX Emulator Device Backend • ocelot/ • executive/ • interface/EmulatorDevice.h • interface/EmulatedKernel.h • interface/EmulatorCallStack.h • interface/CooperativeThreadArray.h • interface/ReconvergenceMechanism.h • interface/TextureOperations.h • trace/ • interface/TraceEvent.h • interface/TraceGenerator.h 6

  7. Trace Generator Interface • Emulator broadcasts events to eventtrace analyzers during execution • Events provide detailed device state: • PC, activity mask, operand data, thread ID, etc. • Used for error checking, instrumentation, and simulation

  8. Trace Generator Interface // ocelot/trace/interface/TraceGenerator.h // Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // Called when a traced kernel is launched to // retrieve some parameters from the kernel virtualvoid initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtualvoid event(const TraceEvent & event); // Called when an event is committed virtualvoid postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtualvoid finish(); };

  9. TraceEvent Object • Captures execution of a dynamic instruction, including • Device pointer to access device state • Kernel grid, CTA dimensions, parameter values • PC and internal representation of PTX instruction • Set of active threads executing instruction • Memory addresses and size of transfer • Branch target(s) and diverging threads

  10. trace::TraceEvent • // Fall through thread mask in case of a branch • BitMaskfallthrough; • // Vector of memory addresses possibly generated for this instruction • U64Vector memory_addresses; • // Vector of sizes of memory operations possibly issued by this instruction • ir::PTXU32 memory_size; • // Dimensions of the kernel grid that generated the event • ir::Dim3 gridDim; • // Dimensions of the kernel block that generated the event • ir::Dim3 blockDim; • // Captures just events related to thread reconvergence • ReconvergenceTraceEvent reconvergence; • }; • classTraceEvent { • public: • // ID of the block that generated the event • ir::Dim3 blockId; • // PC index into EmulatedKernel's packed instruction sequence • ir::PTXU64 PC; • // Depth of call stack [i.e. number of contexts on the runtime stack] • ir::PTXU32 contextStackSize; • // Instruction pointer to instruction pointed to by PC • constir::PTXInstruction* instruction; • // Bit mask of active threads that executed this instruction • BitMask active; • // Taken thread mask in case of a branch • BitMask taken;

  11. PTX Emulator – CUDA Debugging – Race Detection // file: raceCondition.cu __global__ void raceCondition(int *A) { __shared__ intSharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load } . . . raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr ); . . . ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near raceCondition.cu:9:0

  12. PTX Emulator – CUDA Debugging- Illegal Memory Accesses // file: memoryCheck.cu __global__ void badMemoryReference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64); badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0

  13. Interactive Debugger • Interactive command-line debugger implemented using TraceGenerator interface • Enables • Inspectionof application state • Single-stepping of instructions • Breakpoints and watchpoints • Faults in MemoryChecker and RaceDetector invoke ocelot-dbg automatically $ ./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg)

  14. Performance Tuning .... for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid-offset]; } } .... • Identify critical regions • Memory demand • Floating-point intensity • Shared memory bank conflicts

  15. Example: Inter-thread Data Flow • Which kernels exchange computed results through shared memory • Track id of producer thread • Ensure threads are well synchronized

  16. Current Trace Generators

  17. Using Trace Generators • Implement TraceGenerator interface • override methods: • initialize( ), event( ), postEvent( ), finish( ) • Add to Ocelot runtime • explicitly: ocelot::addTraceGenerator( ) • or, add to trace::TraceConfiguration • Online analysis or serialize event traces • Link applications with libocelotTrace.so

  18. Configuring & Executing Applications trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: true, ignoreIrrelevantWrites: true }, debugger: { enabled: true, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", alwaysAttach: false }, }, executive: { devices: [ "emulated" ], } } • Edit configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • trace specifies which trace generators are initially attached • executive controls device properties • trace: • memoryChecker – ensures accesses to a memory region associated with the currently selected device • raceDetector - enforces synchronized access to .shared • debugger - interactive debugger • executive: • devices: • emulated – Ocelot PTX emulator (trace generators)

  19. Example: Thread Load Imbalance //! Computes number of dynamic instructions for each thread classThreadLoadImbalance: public trace::TraceGenerator { public: std::vector< size_t > dynamicInstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtualvoid event(const TraceEvent & event) { if (!dynamicInstructions.size()) dynamicInstructions.resize(event.active.size(), 0); for (inti = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicInstructions[i]++; } } }; Dynamic Instruction Count • Mandelbrot (CUDA SDK)

  20. Example: Control-flow divergence

  21. Control-Flow Divergence in PTX Emulator • PTX Emulator facilitates customizable handlers for control flow divergence • Currently implements: • Immediate post-dominator (ipdom) • And many more • Abstract handlers implement potentially divergent control instructions • e.g. Bra, Bar, Exit, Vote • Executing instructions drives TraceGenerators • Reconvergence affects active threads, dynamic instruction count, and instruction trace • Analysis tools can group threads into warps of arbitrary size

  22. class ReconvergenceMechanism // ocelot/executive/interface/ReconvergenceMechanism.h classReconvergenceMechanism { public: ReconvergenceMechanism(CooperativeThreadArray *cta); virtual ~ReconvergenceMechanism(); virtual void initialize() = 0; virtual void evalPredicate(CTAContext &context) = 0; virtual booleval_Bra(CTAContext &context, constir::PTXInstruction &instr, const boost::dynamic_bitset<> & branch, const boost::dynamic_bitset<> & fallthrough) = 0; virtual void eval_Bar(CTAContext &context, constir::PTXInstruction &instr) = 0; virtual void eval_Reconverge(CTAContext &context, constir::PTXInstruction &instr) = 0; virtual void eval_Exit(CTAContext &context, constir::PTXInstruction &instr) = 0; virtual void eval_Vote(CTAContext &context, constir::PTXInstruction &instr); virtual boolnextInstruction(CTAContext &context, constir::PTXInstruction &instr, constir::PTXInstruction::Opcode &) = 0; virtualCTAContext& getContext() = 0; }

  23. Example: Immediate Post-Dominator Reconvergence // ocelot/executive/interface/ReconvergenceMechanism.h classReconvergenceIPDOM: publicReconvergenceMechanism { public: ReconvergenceIPDOM(CooperativeThreadArray *cta); ~ReconvergenceIPDOM(); void initialize(); voidevalPredicate(CTAContext &context); booleval_Bra(CTAContext &context, PTXInstruction&instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough); voideval_Bar(CTAContext &context, PTXInstruction&instr); voideval_Reconverge(CTAContext &context, PTXInstruction&instr); voideval_Exit(CTAContext &context, PTXInstruction&instr); boolnextInstruction(CTAContext &context, PTXInstruction&instr, PTXInstruction::Opcode&); CTAContext& getContext(); size_tstackSize() const; void push(CTAContext&); void pop(); std::vector<CTAContext> runtimeStack; std::vector<int> pcStack; unsigned int reconvergeEvents; };

  24. Example: Immediate Post-Dominator Reconvergence // ocelot/executive/implementation/ReconvergenceMechanism.cpp voidexecutive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context, constir::PTXInstruction &instr) { if (context.active.count() < context.active.size()) { // deadlock - not all threads reach synchronization barrier std::stringstream message; message << "barrier deadlock:\n"; throwRuntimeException(message.str(), context.PC, instr); } }

  25. Example: Immediate Post-Dominator Reconvergence void executive::ReconvergenceIPDOM::eval_Reconverge( executive::CTAContext &context, constir::PTXInstruction &instr) { if(runtimeStack.size() > 1) { if(pcStack.back() == context.PC) { pcStack.pop_back(); runtimeStack.pop_back(); ++reconvergeEvents; } else { context.PC++; } } else { context.PC++; } } void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context, constir::PTXInstruction &instr) { eval_Bar(context, instr); context.running = false; }

  26. Summary: Ocelot PTX Emulator • Used for instruction level analysis • Can be attached to trace generators and trace analyzers • Simple processing and filtering of machine state • Forms the basis for a range of productivity tools • Correctness tools • Debugging tools • Workload characterization • Drives instruction and address traces to MacSim

More Related