1 / 29

Programming Massively Parallel Graphics Processors

Programming Massively Parallel Graphics Processors. Andreas Moshovos Winter 2009. Goals: Graphics Processors Learn how program GPUs Learn how to get performance out of GPUs Understand GPU architecture and limitations CUDA: Compute Unified Device Architecture/NVidia How:

turner
Télécharger la présentation

Programming Massively Parallel Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Massively Parallel Graphics Processors Andreas Moshovos Winter 2009

  2. Goals: • Graphics Processors • Learn how program GPUs • Learn how to get performance out of GPUs • Understand GPU architecture and limitations • CUDA: Compute Unified Device Architecture/NVidia • How: • Weekly assignments for the first few weeks • A large team project • Ideal Scenario: • Non-ECE Non-CS people will team up with CS/ECE and attack an interesting problem

  3. What is a GPU • Specialized processor for graphics • Embarrassingly parallel: • Lots of: • Read data, calculate, write • Used to be fixed function • Are becoming more programmable • What is CUDA • A C extension for programming for NVIDIA GPUs • Straightforward to learn • Challenge is in getting performance

  4. Sequential Execution Model int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time

  5. Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[index] = a[index] * fade; time

  6. Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time

  7. Programmer’s view – Typical System If you care about performance a lot CPU regs CPU caches Memory Memory 12.8GB/sec – 31.92GB/sec 8B per transfer

  8. Programmer’s view with GPU CPU GPU 3GB/s 141GB/sec Memory 12.8GB/sec – 31.92GB/sec 8B per transfer GPU Memory 1GB on our systems

  9. Programmer’s view with GPU CPU GPU Copy to GPU mem Launch GPU threads Synchronize with GPU Copy from GPU mem time

  10. Structure: CPU vs. GPU

  11. But what about performance? • Focus on PEAK performance first: • What the manufacturer guarantees you’ll never exceed • Two Aspects: • Data Access Rate Capability • Bandwidth • Data Processing Capability • How many ops per sec

  12. Data Processing Capability • Focus on floating point data • GFLOPS • Billion Floating-Point Operations per Second • Caveat: FOPs can be different • But today things are not as bad as before • High-End CPU today • 3.4Ghz x 8 FOPS/cycle = 27 GFLOPS • Assumes SSE • High-End GPU today / GTX280 • 933.1 GFLOPS or 34x capability

  13. Data Access Capability • High-End CPU Today • 31.92 GB/sec (nehalem) - 12.8 GB/sec (hapertown) • Bus width 64-bit • GPU / GTX280 • 141.7 GB/sec • Bus width 512-bit • 4.39x – 11x

  14. GPU vs. CPU

  15. GPU vs. CPU

  16. What the programmer needs to know? • Many details about the architecture • But fortunately most of it is simple

  17. Programmer’s view: GPU Architecture

  18. My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

  19. Threads / Blocks / Grid Block size = 12 #blocks = 5 Block 0: a[0]…a[11] … Block 4: a[48] .. a[59] a[48] a[59]

  20. Memory Hierarchy Anything declared inside The kernel __shared__ int… __global__ int…

  21. Performance Programmer’s view Mark Silberstein, Technion

  22. CUDA keywords, etc. • Declspecs • global, device, shared, local, constant • Keywords • threadIdx, blockIdx • Intrinsics • __syncthreads • Runtime API • Memory, symbol, execution management • Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) cudaThreadSynchronize (); // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

  23. Floating-Point Caveats • Single precisions floating point support is not 100% IEEE 754 • No denormals, fixed rounding modes • Must check that SNR remains acceptable • But there are lots of SP FP units • GTX280 supports double precision • But there are very few of these units

  24. Development Process • Course Specific • Get an account on the eecg network • Fill in your name/ID/current e-mail on the list • Wait until confirmation is received • Machines • ug51.eecg through ug75.eecg.utoronto.ca • SF2204 • Keycode: _______

  25. Development Process • Once you are on ugxx machine • source /cad1/CUDA/cuda.csh • That will create a NVIDIA_CUDA_SDK • Go in and type “make dbg=1” • This builds several examples under bin/linux/debug • The source code is in the projects subdir • We’ll post a handout soon on the course website

  26. Development Process • Create a xxxx.cu file • Compile it with nvcc • Makefile is provided by the SDK • Nvcc is a preprocessor

  27. So, why would Parallel Processing work? • Parallel Processing and Programming has been around for a while • Golden age was the 80s • Didn’t work • Programming is hard • Hardware was expensive • Single processor performance was doubling every 18 months • Why would it work now? • Cost / Single processor • Not a done deal at all  Programming is still hard

  28. Course Staff • Andreas Moshovos • EA310, 416-946-7373 • moshovos@eecg.toronto.edu • www.eecg.toronto.edu/~moshovos • TA • Hassan Shojania • hassan@eecg.toronto

  29. Course Structure • Till the end of February / weekly assignments • CUDA programming • GTX280 architecture • CUDA performance • Floating Point • March / Project Proposal and work • Case studies • General Parallel Programming guidelines • April • Project Presentations • Make up lectures?

More Related