1 / 33

Computer Systems

Computer Systems. Cache characteristics. How do you know?. (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz)

shino
Télécharger la présentation

Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Systems Cache characteristics Computer Systems – cache characteristics

  2. How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size Computer Systems – cache characteristics

  3. CPUID-instruction asm("push %ebx;movl $2,%eax;CPUID;movl %eax,cache_eax;movl %ebx,cache_ebx;movl %edx,cache_edx;movl %ecx,cache_ecx;pop %ebx"); Computer Systems – cache characteristics

  4. A limited number of answers http://www.sandpile.org/ia32/cpuid.htm Computer Systems – cache characteristics

  5. set 0: valid tag cache block cache block set 1: valid tag • • • cache block set S-1: valid tag Direct-Mapped Cache • Simplest kind of cache • Characterized by exactly one line per set. E=1 lines per set Computer Systems – cache characteristics

  6. Valid Tag Cache block Cache block Valid Tag E = C/B lines in the one and only set Set 0: • • • Cache block Valid Tag Fully Associative Cache • Difficult and expensive to build one that is both large and fast Computer Systems – cache characteristics

  7. valid tag cache block set 0: E=2 lines per set valid tag cache block valid tag cache block set 1: valid tag cache block • • • valid tag cache block set S-1: valid tag cache block E – way Set Associative Caches • Characterized by more than one line per set Computer Systems – cache characteristics

  8. Accessing Set Associative Caches • Set selection • Use the set index bits to determine the set of interest. valid tag cache block set 0: cache block valid tag valid tag cache block selected set set 1: cache block valid tag t bits s bits • • • b bits 0 0 0 0 1 cache block valid tag m-1 0 tag set index block offset cache block valid tag set S-1: s = log2 (S) S = C / (B x E) Computer Systems – cache characteristics

  9. (1) The valid bit must be set. =1? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address = ? Accessing Set Associative Caches • Line matching and word selection • must compare the tag in each valid line in the selected set. 0 1 2 3 4 5 6 7 1 1001 selected set (i): 1 0110 w0 w1 w2 w3 t bits s bits b bits 0110 i 100 b = log2 (B) m-1 0 tag set index block offset Computer Systems – cache characteristics

  10. Observations • Contiguously region of addresses form a block • Multiple address blocks share the same set-index • Blocks that map to the same set can be uniquely identified by the tag Computer Systems – cache characteristics

  11. Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 8 Level k: 9 14 3 Data is copied between levels in block-sized transfer units Caching in a Memory Hierarchy 4 10 10 4 Level k+1: 0 1 2 3 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 4 4 5 6 7 8 9 10 10 11 12 13 14 15 Same set-index Computer Systems – cache characteristics

  12. General Caching Concepts Request 12 Request 14 Level k: 14 12 • Program needs object d, which is stored in some block b. • Cache hit • Program finds b in the cache at level k. E.g., block 14. • Cache miss • b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. • If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? 0 1 2 3 14 4* 9 14 3 12 4* 14 Request 12 12 4* Level k+1: 0 1 2 3 12 4 5 6 7 4* 8 9 10 11 12 13 14 15 12 Conflict miss Computer Systems – cache characteristics

  13. Intel Processors CacheSRAM http://www11.brinkster.com/bayup/dodownload.asp?f=37 Computer Systems – cache characteristics

  14. Intel Processors CacheSRAM ftp://download.intel.com/design/Pentium4/manuals/25366514.pdf (section 2.4) Computer Systems – cache characteristics

  15. Impact • Associativitymore lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) • Block sizelarger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1). Computer Systems – cache characteristics

  16. Optimum Block size http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture15.pdf Computer Systems – cache characteristics

  17. Absolute Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf Computer Systems – cache characteristics

  18. Relative Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf Computer Systems – cache characteristics

  19. Matrix Multiplication Example • Major Cache Effects to Consider • Total cache size • Exploit temporal locality and keep the working set small (e.g., by using blocking) • Block size • Exploit spatial locality • Description: • Multiply N x N matrices • O(N3) total operations • Accesses • N reads per source element • N values summed per destination • Assumption • No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Variable sum held in register Computer Systems – cache characteristics

  20. Column- wise Fixed Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 = 1.25 Computer Systems – cache characteristics

  21. Row-wise Column- wise Fixed Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,j) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 = 1.25 Computer Systems – cache characteristics

  22. Row-wise Row-wise Fixed Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 = 0.5 Computer Systems – cache characteristics

  23. Row-wise Row-wise Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C Fixed • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 = 0.5 Computer Systems – cache characteristics

  24. Column - wise Column- wise Fixed Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 = 2.0 Computer Systems – cache characteristics

  25. Column- wise Column- wise Fixed Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 = 2.0 Computer Systems – cache characteristics

  26. Summary of Matrix Multiplication • ijk (& jik): • 2 loads, 0 stores • misses/iter = 1.25 • kij (& ikj): • 2 loads, 1 store • misses/iter = 0.5 • jki (& kji): • 2 loads, 1 store • misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Computer Systems – cache characteristics

  27. Matrix Multiply Performance • Miss rates are helpful but not perfect predictors. • Code scheduling matters, too. Computer Systems – cache characteristics

  28. Improving Temporal Locality by Blocking • Example: Blocked matrix multiplication • do not mean “cache block”. • Instead, it mean a sub-block within the matrix. • Example: N = 8; sub-block size = 4 A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 = X Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C12 = A11B12 + A12B22 C21 = A21B11 + A22B21 C22 = A21B12 + A22B22 Computer Systems – cache characteristics

  29. Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } } } Computer Systems – cache characteristics

  30. kk jj jj kk i i A B C Update successive elements of sliver row sliver accessed bsize times block reused n times in succession Blocked Matrix Multiply Analysis • Innermost loop pair multiplies • a 1 X bsize sliver of A by • a bsize X bsize block of B and accumulates into 1 X bsize sliver of C • Loop over i steps through n row slivers of A & C, using same B Computer Systems – cache characteristics

  31. Matrix Multiply Performance • Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions • relatively insensitive to array size. Computer Systems – cache characteristics

  32. Concluding Observations • Programmer can optimize for cache performance • How data structures are organized • How data are accessed • Nested loop structure • Blocking is a general technique • All systems favor “cache friendly code” • Getting absolute optimum performance is very platform specific • Cache sizes, line sizes, associativities, etc. • Can get most of the advantage with generic code • Keep working set reasonably small (temporal locality) • Use small strides (spatial locality) Computer Systems – cache characteristics

  33. Assignment • Homework problem 6.28:What percentage of writes will miss in the cache for the blank screen function? Computer Systems – cache characteristics

More Related