AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University

GPGPU • General Purpose Programming on GPUs (accelerators)‏ • High performance/price ratio • High language support • CUDA • Performance vs Productivity • Hard to program • Memory hierarchy to manage • ...

And Make the Programming Simple! Get High Performance from GPU • Automatic code generation • Device memory access is expensive • Using shared memory • Texture and constant memory • Coalescing device memory access • ...

FEATURES OF SHARED MEMORY • Small, fast, like a cache • 16KB on each multiprocessor (no more than 48KB even on the latest GPU) • Read-write • Software controlled • __shared__ float data[n][n]; • Allocating shared memory: • Similar to register allocation

Problem Formulation for Shared Memory Arrangement • Consider variables and basic blocks in a function • Element of array, array, section of array • Each variable can have several live ranges in the function • Access feature of live range: read, write, read-write, temp • Determine in which basic block a variable is allocated to shared memory • Assign_point[i][k]: variable i, basic block k

Integer Programming Problem • Integer Linear Programming • Objective function • Maximize z = CT x • Constraints • Solution • Values of x • Special case of linear programming • All the unknown variables are integers (1-0 in our case)‏ • Solvable for reasonable size of problems

Integer Programming for Shared Memory Arrangement • Objective Function • Maximize shared memory usage • Minimize data transfer between memory hierarchies

Integer Programming for Shared Memory Arrangement (cnt’d)‏ • Objective Function

An Example to Show size_alloc for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ......

Integer Programming for Shared Memory Arrangement (cnt’d)‏ • Constraints • Total allocation does not exceed the limit of shared memory at any time • Only at most one assign_point is 1 in each live range

Integer Programming for Shared Memory Arrangement (cnt’d)‏ • Obtaining parameters • Using LLVM compiler framework • Pass 1: get access features • Read, write, read-write, temp • Pass 2: get live ranges, loop information, indices, and all other parameters

Code Generation • According to the shared memory arrangement obtained from the integer programming model • Under the framework in previous work • Move data to cover gap caused by data evicted from shared memory

An Example for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ...... Integer Programming Solver A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */

An Example (cnt’d)‏ Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<m*r;i++) s_B[i]=B[i]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++)‏ s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++)‏ for(int k=0;k<r;k++)‏ s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ...... } /* Synchronize and combination of C */

Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++)‏ tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0; for(int c = 0;c<numCol;c++)‏ { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++)‏ { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; } }

Experiments • Effectiveness of using shared memory • Compare with intuitive approach in previous work • Greedy sorting: sort all the variables in increasing order of size, and allocation them on shared memory until to the limit of shared memory • Effectiveness of loop transformation suggested by the integer programming model

Experiment Results

Experiment Results • K-means EM

Experiment Results (cnt’d)‏ • PCA Co-clustering

Effect of Loop Transformation • PCA Co-clustering

Conclusion and Future Work • Proposed an integer programming model for shared memory arrangement on GPU • Consider numeric variable, array, and section of array • Suggested loop transformation for optimization • Got better results than the intuitive method • Will automate the code generation and loop transformation selection in future

THANK YOU! Questions?

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

Presentation Transcript

Programming with Shared Memory

Shared-Memory Programming with Threads

Patterns for Programming Shared Memory

Integer Programming

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

An Mixed Integer Approach for Optimizing Production Planning

Shared-Memory Programming with Threads

Integer Programming

Integer Programming

SMiLE Shared Memory Programming

SMiLE Shared Memory Programming

Programming with Shared Memory

Programming with Shared Memory

Shared-memory Parallel Programming

INTEGER PROGRAMMING

Seminar Shared memory Programming

Programming with Shared Memory

Programming with Shared Memory

Programming with Shared Memory

Programming with Shared Memory

Shared-Memory Programming with Threads

Shared-Memory Programming with Threads