330 likes | 460 Vues
This document provides a detailed analysis of cache characteristics in Intel Pentium 4 processors, focusing on aspects such as cache hierarchy, associativity, and the workings of CPUID instruction. It discusses cache attributes including the first and second-level caches, set-associative cache configurations, and the implications of cache design on performance. Emphasizing the balance between cost, speed, and complexity, it outlines how caching strategies affect data access times and overall system efficiency, helping users gain insights into optimizing cache performance.
E N D
Computer Systems Cache characteristics Computer Systems – cache characteristics
How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size Computer Systems – cache characteristics
CPUID-instruction asm("push %ebx;movl $2,%eax;CPUID;movl %eax,cache_eax;movl %ebx,cache_ebx;movl %edx,cache_edx;movl %ecx,cache_ecx;pop %ebx"); Computer Systems – cache characteristics
A limited number of answers http://www.sandpile.org/ia32/cpuid.htm Computer Systems – cache characteristics
set 0: valid tag cache block cache block set 1: valid tag • • • cache block set S-1: valid tag Direct-Mapped Cache • Simplest kind of cache • Characterized by exactly one line per set. E=1 lines per set Computer Systems – cache characteristics
Valid Tag Cache block Cache block Valid Tag E = C/B lines in the one and only set Set 0: • • • Cache block Valid Tag Fully Associative Cache • Difficult and expensive to build one that is both large and fast Computer Systems – cache characteristics
valid tag cache block set 0: E=2 lines per set valid tag cache block valid tag cache block set 1: valid tag cache block • • • valid tag cache block set S-1: valid tag cache block E – way Set Associative Caches • Characterized by more than one line per set Computer Systems – cache characteristics
Accessing Set Associative Caches • Set selection • Use the set index bits to determine the set of interest. valid tag cache block set 0: cache block valid tag valid tag cache block selected set set 1: cache block valid tag t bits s bits • • • b bits 0 0 0 0 1 cache block valid tag m-1 0 tag set index block offset cache block valid tag set S-1: s = log2 (S) S = C / (B x E) Computer Systems – cache characteristics
(1) The valid bit must be set. =1? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address = ? Accessing Set Associative Caches • Line matching and word selection • must compare the tag in each valid line in the selected set. 0 1 2 3 4 5 6 7 1 1001 selected set (i): 1 0110 w0 w1 w2 w3 t bits s bits b bits 0110 i 100 b = log2 (B) m-1 0 tag set index block offset Computer Systems – cache characteristics
Observations • Contiguously region of addresses form a block • Multiple address blocks share the same set-index • Blocks that map to the same set can be uniquely identified by the tag Computer Systems – cache characteristics
Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 8 Level k: 9 14 3 Data is copied between levels in block-sized transfer units Caching in a Memory Hierarchy 4 10 10 4 Level k+1: 0 1 2 3 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 4 4 5 6 7 8 9 10 10 11 12 13 14 15 Same set-index Computer Systems – cache characteristics
General Caching Concepts Request 12 Request 14 Level k: 14 12 • Program needs object d, which is stored in some block b. • Cache hit • Program finds b in the cache at level k. E.g., block 14. • Cache miss • b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. • If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? 0 1 2 3 14 4* 9 14 3 12 4* 14 Request 12 12 4* Level k+1: 0 1 2 3 12 4 5 6 7 4* 8 9 10 11 12 13 14 15 12 Conflict miss Computer Systems – cache characteristics
Intel Processors CacheSRAM http://www11.brinkster.com/bayup/dodownload.asp?f=37 Computer Systems – cache characteristics
Intel Processors CacheSRAM ftp://download.intel.com/design/Pentium4/manuals/25366514.pdf (section 2.4) Computer Systems – cache characteristics
Impact • Associativitymore lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) • Block sizelarger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1). Computer Systems – cache characteristics
Optimum Block size http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture15.pdf Computer Systems – cache characteristics
Absolute Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf Computer Systems – cache characteristics
Relative Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf Computer Systems – cache characteristics
Matrix Multiplication Example • Major Cache Effects to Consider • Total cache size • Exploit temporal locality and keep the working set small (e.g., by using blocking) • Block size • Exploit spatial locality • Description: • Multiply N x N matrices • O(N3) total operations • Accesses • N reads per source element • N values summed per destination • Assumption • No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Variable sum held in register Computer Systems – cache characteristics
Column- wise Fixed Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 = 1.25 Computer Systems – cache characteristics
Row-wise Column- wise Fixed Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,j) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 = 1.25 Computer Systems – cache characteristics
Row-wise Row-wise Fixed Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 = 0.5 Computer Systems – cache characteristics
Row-wise Row-wise Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C Fixed • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 = 0.5 Computer Systems – cache characteristics
Column - wise Column- wise Fixed Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 = 2.0 Computer Systems – cache characteristics
Column- wise Column- wise Fixed Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 = 2.0 Computer Systems – cache characteristics
Summary of Matrix Multiplication • ijk (& jik): • 2 loads, 0 stores • misses/iter = 1.25 • kij (& ikj): • 2 loads, 1 store • misses/iter = 0.5 • jki (& kji): • 2 loads, 1 store • misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Computer Systems – cache characteristics
Matrix Multiply Performance • Miss rates are helpful but not perfect predictors. • Code scheduling matters, too. Computer Systems – cache characteristics
Improving Temporal Locality by Blocking • Example: Blocked matrix multiplication • do not mean “cache block”. • Instead, it mean a sub-block within the matrix. • Example: N = 8; sub-block size = 4 A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 = X Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C12 = A11B12 + A12B22 C21 = A21B11 + A22B21 C22 = A21B12 + A22B22 Computer Systems – cache characteristics
Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } } } Computer Systems – cache characteristics
kk jj jj kk i i A B C Update successive elements of sliver row sliver accessed bsize times block reused n times in succession Blocked Matrix Multiply Analysis • Innermost loop pair multiplies • a 1 X bsize sliver of A by • a bsize X bsize block of B and accumulates into 1 X bsize sliver of C • Loop over i steps through n row slivers of A & C, using same B Computer Systems – cache characteristics
Matrix Multiply Performance • Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions • relatively insensitive to array size. Computer Systems – cache characteristics
Concluding Observations • Programmer can optimize for cache performance • How data structures are organized • How data are accessed • Nested loop structure • Blocking is a general technique • All systems favor “cache friendly code” • Getting absolute optimum performance is very platform specific • Cache sizes, line sizes, associativities, etc. • Can get most of the advantage with generic code • Keep working set reasonably small (temporal locality) • Use small strides (spatial locality) Computer Systems – cache characteristics
Assignment • Homework problem 6.28:What percentage of writes will miss in the cache for the blank screen function? Computer Systems – cache characteristics