50 likes | 186 Vues
This paper presents an analytical model for matrix multiplication using the ATLAS library. It focuses on optimizing the tile size (NB) to enhance performance. The model accounts for increasingly complex scenarios while ensuring the entire working set fits in L1 cache. It details the memory requirements for each iteration of the nested loops and discusses fully associative cache impacts, optimal replacement strategies like LRU, and efficient memory usage. This work aims to minimize data movement and maximize computational efficiency, offering insights for improving matrix multiplication algorithms.
E N D
matmul(c[i:i+NB-1,j:j+NB-1], a[i:i+NB-1,k:k+NB-1],b[k:k+NB-1,j:j+NB-1]) for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) c[i][j] += a[i][k] * b[k][j]
Modeling for Tile Size (NB) • Models of increasing complexity • 3*NB2 ≤ C • Whole work-set fits in L1 • NB2 + NB + 1 ≤ C • Fully Associative • Optimal Replacement • Line Size: 1 word • or • Line Size > 1 word • or • LRU Replacement
Explanation for LRU Model (II) Each iteration of j requires: -NB2 elems of A -a column of C (NB elems) -a column of B (NB elems) In the middle of iteration j+1, being able to reuse the Elements of A requires holding not one, but two colums of B; and one extra element of C. Thus: