100 likes | 214 Vues
This document outlines the implementation of GPGPU parallel primitives including Map, Reduce, Scan, Histogram, and Compact algorithms. Core functionalities involve data squaring, maximum reduction, and histogram creation/tasks utilizing CUDA kernels. The paper emphasizes key synchronization techniques among threads and dynamic memory management through shared memory. Profiling tools such as Nsight and Visual Profiler are explored for performance analysis and optimization strategies for histogram algorithms. This guide serves as a comprehensive resource for developers interested in GPGPU programming.
E N D
Párhuzamos primitívek • Map • Reduce • Scan • Histogram • Compact
Map // TODO // // ID := threadIdx.x + blockIdx.x * blockDim.x // IF ID > dataSize THEN return // data[ID] := square(data[ID]) __global__ void mapKernel(int* data, unsigned intdataSize) { // TODO }
Reduce // TODO // // FOR s = dataSize / 2 ; s > 0 ; s >>= 1 DO: // IF (ID < s) // data[ID] = max(data[ID], data[ID + s]) // SYNCHRONIZE THREADS // __global__ void reduceKernel(float* data, intdataSize) { intid = threadIdx.x + blockIdx.x * blockDim.x; // TODO }
Histogram // TODO // // histogram[data[id]] := histogram[data[id]] + 1 // SYNCHRONIZATION! (atomicADD) __global__ void histogramGlobalKernel(int* data, int* histogram, intdataSize) { // TODO }
Histogram (shared) // TODO // // IF LID < histogramSize DO: // lhistogram[LID] := 0 // SYNCHRONIZE THREADS // // Add data to local histogram // // SYNCHRONIZE THREADS // // IF LID < histogramSize DO: // histogram[LID] = lhistogram[LID] __global__ void histogramLocalKernel(int* data, int* histogram, intdataSize, inthistogramSize) { extern __shared__ intlhistogram[]; intid = threadIdx.x + blockIdx.x * blockDim.x; intlid = threadIdx.x; // TODO }
Scan (exclusive) // TODO // // IF ID > 0 THEN data[ID] = data[ID - 1] // ELSE data[ID] = 0 // SYNCHRONIZE THREADS // // FOR s = 1; s < dataSize; s *= 2 DO: // tmp := data[ID] // IF ( ID + s < dataSize THEN // data[ID + s] += tmp; // SYNCHRONIZE THREADS // // IF(ID = 0) THEN data[ID] = 0; __global__ void exscanKernel(int* data, intdataSize) { intid = threadIdx.x + blockIdx.x * blockDim.x; // TODO }
Scan (exclusive) • Módosítsuk a scan-t, hogy tetszőleges méretű adathalmazon működjön
Profiling • Nézzük meg az Nsight Performance Analysis segítségével a histogram kerneleket! • Nézzük meg a Visual Profilerrel is! • Milyen módon optimalizálható a histogram algoritmus?