Programming with CUDA WS 08

1. Programming with CUDAWS 08/09 Lecture 8 Thu, 18 Nov, 2008

2. Previously CUDA Runtime Component Common Component Data types, math functions, timing, textures Device Component Math functions, warp voting, atomic functions, synch function, texturing Host Component High-level runtime API Low-level driver API

3. Previously CUDA Runtime Component Host Component APIs Mutually exclusive Runtime API is easier to program, hides some details from programmer Driver API gives low level control, harder to program Provide: device initialization, management of device, streams and events

4. Today CUDA Runtime Component Host Component APIs Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered)? Runtime API provides: emulation mode for debugging Driver API provides: management of contexts & modules, execution control Final Projects

5. Memory Management: Linear Memory CUDA Runtime APIDeclare: TYPE*Allocate: cudaMalloc, cudaMallocPitchCopy: cudaMemcpy, cudaMemcpy2DFree: cudaFree CUDA Driver APIDeclare: CUdeviceptrAllocate: cuMemAlloc, cuMemAllocPitchCopy: cuMemcpy, cuMemcpy2DFree: cuMemFree Host Runtime Component

6. Memory Management: Linear Memory Pitch (stride) � expected:// host codefloat *array2D;cudaMallocPitch ((void**) array2D, width*sizeof (float), height);// device codeint size = width * sizeof (float);for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c];} Host Runtime Component

7. Memory Management: Linear Memory Pitch (stride) � expected, WRONG:// host codefloat *array2D;cudaMallocPitch ((void**) array2D, width*sizeof (float), height);// device codeint size = width * sizeof (float);for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c];} Host Runtime Component

8. Memory Management: Linear Memory Pitch (stride) � CORRECT:// host codefloat *array2D; int pitch;cudaMallocPitch ((void**) array2D, &pitch, width*sizeof (float), height);// device codefor (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*pitch; for (int c = 0; c < width; ++c) float element = row[c];} Host Runtime Component

9. Memory Management: Linear Memory Pitch (stride) � why? Allocation using pitch functions appropriately pads memory for efficient transfer and copy Width of allocated rows may exceed width*sizeof(float)? True width given by pitch Host Runtime Component

10. Memory Management: CUDA Arrays CUDA Runtime APIDeclare: cudaArray*Channel: cudaChannelFormatDesc, cudaCreateChannelDesc<TYPE>Allocate: cudaMallocArrayCopy (from linear): cudaMemcpy2DToArrayFree: cudaFreeArray Host Runtime Component

11. Memory Management: CUDA Arrays CUDA Driver APIDeclare: CUarrayChannel: CUDA_ARRAY_DESCRIPTOR objectAllocate: cuArrayCreateCopy (from linear): CUDA_MEMCPY2D objectFree: cuArrayDestroy Host Runtime Component

12. Memory Management: various other functions to copy from Linear memory to CUDA arrays Host to constant memory See Reference Manual Host Runtime Component

13. Texture Management Run-time API: texture type derived fromstruct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc;} normalized: 0: false, otherwise true Host Runtime Component

14. Texture Management filterMode:cudaFilterModePoint: no filtering, returned value is of nearest texel cudaFilterModeLinear: filters 2/4/8 neighbors for 1D/2D/3D texture, floats only addressMode: (x,y,z)cudaAddressModeClamp, cudaAddressModeWrap: normalized coordinates only Host Runtime Component

15. Texture Management channelDesc: texel typestruct cudaChannelFormatDesc { int x,y,z,w; enum cudaChannelFormatKind f;} x,y,z,w: #bits per component f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat Host Runtime Component

16. Texture Management Run-time API: texture type derived fromstruct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc;} Apply only to texture references bound to CUDA arrays Host Runtime Component

17. Texture Management Binding a texture reference to a texture Runtime API: Linear memory: cudaBindTexture CUDA Array: cudaBindTextureToArray Driver API: Linear memory: cuTexRefSetAddress CUDA Array: cuTexRefSetArray Host Runtime Component

18. Runtime API: debugging using the emulation mode No native debug support for device code Code should be compiled either for device emulation OR execution: mixing not allowed Device code is compiled for the host Host Runtime Component

19. Runtime API: debugging using the emulation mode Features Each CUDA thread is mapped to a host thread, plus one master thread Each thread gets 256KB on stack Host Runtime Component

20. Runtime API: debugging using the emulation mode Advantages Can use host debuggers Can use otherwise disallowed functions in device code, e.g. printf Device and host memory are both readable from either device or host Host Runtime Component

21. Runtime API: debugging using the emulation mode Advantages Any device or host specific function can be called from either device or host code Runtime detects incorrect use of synch functions Host Runtime Component

22. Runtime API: debugging using the emulation mode Some errors may still remain hidden Memory access errors Out of context pointer operations Incorrect outcome of warp vote functions as warp size is 1 in emulation mode Result of FP operations often different on host and device Host Runtime Component

23. Driver API: Context management A context encapsulates all resources and actions performed within the driver API Almost all CUDA functions operate in a context, except those dealing with Device enumeration Context management Host Runtime Component

24. Driver API: Context management Each host thread can have only one current device context at a time Each host thread maintains a stack of current contexts cuCtxCreate()? Creates a context Pushes it to the top of the stack Makes it the current context Host Runtime Component

25. Driver API: Context management cuCtxPopCurrent()? Detaches the current context from the host thread � makes it �uncurrent� The context is now floating It can be pushed to any host thread's stack Host Runtime Component

26. Driver API: Context management Each context has a usage count cuCtxCreate creates a context with a usage count of 1 cuCtxAttach increments the usage count cuCtxDetach decrements the usage count Host Runtime Component

27. Driver API: Context management A context is destroyed when its usage count reaches 0. cuCtxDetach, cuCtxDestroy Host Runtime Component

28. Driver API: Module management Modules are dynamically loadable packages of device code and data output by nvcc Similar to DLLs Host Runtime Component

29. Driver API: Module management Dynamically loading a module and accessing its contentsCUmodule cuModule;cuModuleLoad(&cuModule, �myModule.cubin�);CUfunction cuFunction;cuModuleGetFunction(&cuFunction, cuModule, �myKernel�); Host Runtime Component

30. Driver API: Execution control Set kernel parameters cuFuncSetBlockShape()? #threads/block for the function How thread IDs are assigned cuFuncSetSharedSize()? Size of shared memory cuParam*()? Specify other parameters for next kernel launch Host Runtime Component

31. Driver API: Execution control Launch kernel cuLaunch(), cuLaunchGrid()? Example 4.5.3.5 in Prog Guide Host Runtime Component

32. Final Projects Ideas? DES cracker Image editor Resize and smooth an image Gamut mapping? 3D Shape matching

33. All for today Next time Memory and Instruction optimizations

34. On to exercises!

Programming with CUDA WS 08

Programming with CUDA WS 08

Presentation Transcript

Parallel Programming with CUDA

CUDA Programming,

GPU programming: CUDA

Programming with CUDA WS 08

CUDA GPU Programming

Basic CUDA Programming

CUDA Lecture 4 CUDA Programming Basics

CUDA programming (continue)

CUDA programming Performance considerations (CUDA best practices)

CUDA Programming

Programming with CUDA and Parallel Algorithms

CUDA Programming

Parallel GPU Programming with NVIDIA Cuda

Programming With CUDA

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

CUDA Programming Model