AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS
AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS. Wenjing Ma Gagan Agrawal The Ohio State University. GPGPU. General Purpose Programming on GPUs (accelerators) High performance/price ratio High language support CUDA Performance vs Productivity
AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS
E N D
Presentation Transcript
AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University
GPGPU • General Purpose Programming on GPUs (accelerators) • High performance/price ratio • High language support • CUDA • Performance vs Productivity • Hard to program • Memory hierarchy to manage • ...
And Make the Programming Simple! Get High Performance from GPU • Automatic code generation • Device memory access is expensive • Using shared memory • Texture and constant memory • Coalescing device memory access • ...
FEATURES OF SHARED MEMORY • Small, fast, like a cache • 16KB on each multiprocessor (no more than 48KB even on the latest GPU) • Read-write • Software controlled • __shared__ float data[n][n]; • Allocating shared memory: • Similar to register allocation
Problem Formulation for Shared Memory Arrangement • Consider variables and basic blocks in a function • Element of array, array, section of array • Each variable can have several live ranges in the function • Access feature of live range: read, write, read-write, temp • Determine in which basic block a variable is allocated to shared memory • Assign_point[i][k]: variable i, basic block k
Integer Programming Problem • Integer Linear Programming • Objective function • Maximize z = CT x • Constraints • Solution • Values of x • Special case of linear programming • All the unknown variables are integers (1-0 in our case) • Solvable for reasonable size of problems
Integer Programming for Shared Memory Arrangement • Objective Function • Maximize shared memory usage • Minimize data transfer between memory hierarchies
Integer Programming for Shared Memory Arrangement (cnt’d) • Objective Function
An Example to Show size_alloc for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ......
Integer Programming for Shared Memory Arrangement (cnt’d) • Constraints • Total allocation does not exceed the limit of shared memory at any time • Only at most one assign_point is 1 in each live range
Integer Programming for Shared Memory Arrangement (cnt’d) • Obtaining parameters • Using LLVM compiler framework • Pass 1: get access features • Read, write, read-write, temp • Pass 2: get live ranges, loop information, indices, and all other parameters
Code Generation • According to the shared memory arrangement obtained from the integer programming model • Under the framework in previous work • Move data to cover gap caused by data evicted from shared memory
An Example for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ...... Integer Programming Solver A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */
An Example (cnt’d) Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<m*r;i++) s_B[i]=B[i]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) for(int k=0;k<r;k++) s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ...... } /* Synchronize and combination of C */
Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++) tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0; for(int c = 0;c<numCol;c++) { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; } }
Experiments • Effectiveness of using shared memory • Compare with intuitive approach in previous work • Greedy sorting: sort all the variables in increasing order of size, and allocation them on shared memory until to the limit of shared memory • Effectiveness of loop transformation suggested by the integer programming model
Experiment Results • K-means EM
Experiment Results (cnt’d) • PCA Co-clustering
Effect of Loop Transformation • PCA Co-clustering
Conclusion and Future Work • Proposed an integer programming model for shared memory arrangement on GPU • Consider numeric variable, array, and section of array • Suggested loop transformation for optimization • Got better results than the intuitive method • Will automate the code generation and loop transformation selection in future
THANK YOU! Questions?