1 / 28

Complete Unified Device Architecture

Complete Unified Device Architecture. A Highly Scalable Parallel Programming Framework. Submitted in partial fulfillment of the requirements for the Maryland high school diploma Andrew “Shirley” Das Sarma (Calico Cannonballs McMullins), Blair Computational Methods 2009.

dareh
Télécharger la présentation

Complete Unified Device Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland high school diploma Andrew “Shirley” Das Sarma (Calico Cannonballs McMullins), Blair Computational Methods 2009

  2. Background: Why CUDA?Scientific Computing • A large computer market • Arithmetic-intensive • Huge datasets • Distributed • Parallel

  3. Background: Why CUDA?Moore’s Law • Transistors double every 24 months • Slowing down? • New tricks • Multicore • Multi-node • Metrics • Transistors per circuit • Performance per unit cost

  4. Background: Why CUDA?CPU vs. GPU • CPUs optimized for general workload • More instructions per second • Pipelining, lookahead branch prediction, etc. • GPUs optimized for parallel calculations • 1 pixel shader = 1 thread • Lots of pixel shaders • Lots of arithmetic • On-card DRAM

  5. Background: Why CUDA?CPU vs. GPU

  6. Background: Why CUDA?CPU vs. GPU In terms of raw computing power, GPUs surpass CPUs.

  7. What is CUDA? • GPGPU (not just graphics, or no graphics) • Runs on CPU and GPU • High-level language • Extension of C • FORTRAN coming soon • One compiler • Only NVIDIA so far • Tesla • Larrabee • Unfathomably cool

  8. How it works • C language extension • Language constructs • Keywords • Low-overhead threads • Independent blocks • CPU or GPU: choose one • CPU good for sequential or non-numerical tasks • GPU good for highly parallel calculations

  9. Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory GPU block diagram

  10. CUDA: A C extension • Declspecs: host, global, device • Keywords: blockIdx,threadIdx, etc. • Intrinsics: __syncthreads() • Runtime API • cudaMalloc() • cudaMemcpy() • etc. • Kernel launch: kernel<<<blocks,threads>>>()

  11. CUDA: A C extension Integrated source (foo.cu) nvcc/cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp OCG gcc / cl G80 SASS foo.sass

  12. Background: Pointers • Pointer: a structure that contains the address of some other data in memory • malloc(size_t sz) returns a pointer to sz bytes of available memory • To declare a 20-element int array:int * A = (int *) malloc(20*sizeof(int));

  13. Background: Threads • Sequence of instructions • One thread at a time • Multicore • Desktop computer has thousands of threads • Usually fewer than 4 cores • GPU comfortably runs millions of threads • Hundreds of cores

  14. CUDA execution model • Arrays of parallel threads • Each thread executes the same code • Work determined by threadIdx, blockIdx, blockDim, gridDim • Blocks: collections of threads • Threads in a block can cooperate and share fast local memory • No inter-block cooperation • 1D, 2D, or 3D block/thread numbering

  15. CUDA execution model

  16. CUDA execution model • All functions are declared __host__, __global__, or __device__ • Host: Runs on CPU, called from CPU • Global: Runs on GPU, called from CPU • Device: Runs on GPU, called from GPU

  17. CUDA memory model • Global memory • Faster than CPU memory • Slower than cache • Accessible by all threads • Block shared memory • Small-ish, fast, shared by threads in a block • Thread memory • Small, fast, local • Texture memory • Small, fast, global

  18. Example: SAXPY () __host__ void SAXPYCPU(float * X, float * Y, float a, int N) { for(int i=0; i<N; i++) Y[i] = a*X[i] + Y[i] } __global__ void SAXPYGPU(float * X, float * Y, float a) { int i = blockDim.x*blockIdx.x+threadIdx.x; Y[i] = a*X[i] + Y[i]; } (continued)

  19. Example: SAXPY __host__ int main() { int N = 1073741824 ; //2^30 ≈ 1 billion size_t sz = N * sizeof(float); //bytes we need float * h_X = (float *) malloc(sz); //allocate the float * h_Y = (float *) malloc(sz); //host memory /*some code to fill up h_X and h_Y*/ float * d_X, * d_Y; cudaMalloc((void **)&d_X, sz); //allocate the cudaMalloc((void **)&d_Y, sz); //device memory //move the data onto the GPGPU cudaMemcpy(d_X, h_X, sz, cudaMemcpyHostToDevice); cudaMemcpy(d_Y, h_Y, sz, cudaMemcpyHostToDevice); (continued)

  20. Example: SAXPY //data is on the device; time to do some SAXPY int threadsPerBlock = 256; int blocks = N / threadsPerBlock; SAXPYGPU<<<blocks, threadsPerBlock>>>(X, Y, 2); cudaThreadSynchronize(); //wait until done cudaMemcpy(h_Y, d_Y, sz, cudaMemcpyDeviceToHost); cudaFree(d_X); cudaFree(d_Y); //we no longer need the device memory }

  21. Example: SAXPY That was easy.

  22. Example: 2D integration Simpson 2D coefficient matrix: Our function: f(x,y)=exy(x+y+π)-1/2 sin(log(x-y+π)) Want ∫∫ f(x,y) dA over |x|,|y| ≤ 1

  23. Example: 2D integration __host__ int main() { int B = N/T; //(N+1)^2=points, T=threads, B=blocks size_t sz = B*N*sizeof(dtyp); //dtyp is typedef’d dtyp * d, *h = (dtyp *) malloc(sz); cudaMalloc((void **)&d, sz); dim3 Threads(T); dim3 Grid(B, N); //W=bound of integration S2DGPU<<<Grid, Threads>>>(-W, W, -W, W, d); //INVOKE cudaThreadSynchronize(); //wait for it to finish cudaMemcpy(h, d, sz, cudaMemcpyDeviceToHost); cudaFree(d); dtyp u=0; for(int i=0; i<B*N; i++) u += h[i]; //sigma the different results u += f2(W, W); //algorithm misses last point u *= (dtyp)4*W*W/(9*N*N); //normalize }

  24. Example: 2D integration __host__ void S2DCPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp* a) { *a=0; dtyp x=x0, y; for(int i=0; i<=N; i++) { y = y0; for(int j=0; j<=N; j++) { bool c1 = i==0||i==N, c2 = j==0||j==N; *a+=(c1?(c2?1:(j%2==0?2:4)): (i%2==0?(c2?2:(j%2==0?4:8)): (c2?4:(j%2==0?8:16))))*f2(x,y); y += (yf-y0)/N; } x += (xf-x0)/N; } }

  25. Example: 2D integration __global__ void S2DGPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp * a) { int X = blockIdx.x*blockDim.x+threadIdx.x; int Y = blockIdx.y; dtyp x = x0+(xf-x0)*X/(gridDim.x*blockDim.x); dtyp y = y0+(yf-y0)*Y/gridDim.y; __shared__ dtyp u[T]; bool evx = (X&1)==0, evy = (Y&1)==0; u[threadIdx.x] = (X==0?(Y==0?1:(evy?2:4)):(evx?(Y==0?2:(evy?4:8)): (Y==0?4:(evy?8:16))))*F(x,y); if(threadIdx.x==0) if(blockIdx.x==0) u[threadIdx.x]+=(blockIdx.y==0?1: ((blockIdx.y&1)==0?2:4))*F(xf,y); else if(blockIdx.x==1) u[threadIdx.x]+=(blockIdx.y==0?1: ((blockIdx.y&1)==0?2:4))*F(x0+(xf-x0) *blockIdx.y/gridDim.y, yf); __syncthreads(); if(threadIdx.x==0) { for(int i=1; i<T; i++) u[0]+=u[i]; a[blockIdx.x*gridDim.y+Y]=u[0]; } }

  26. Next-gen GPGPUs Intel Xeon 5500 • 4 cores @ 3.2 GHz • Up to 192 GB DRAM* • *Memory not included • 64 GB/s memory bandwidth • ~50 GFLOPS • 130 W (2.6 W/GFLOPS) • $2,300 ($46/GFLOPS) NVIDIA Tesla S1070 • 960 cores @ 1.44 GHz • 16 GB DRAM • No more, no less • 506 GB/s memory bandwidth • 4000 GFLOPS • 800 W (.2 W/GFLOPS) • $4,000 ($1/GFLOPS)

  27. Runtime data: 2D integration Note: All times are in milliseconds.

More Related