480 likes | 607 Vues
CS252 Graduate Computer Architecture Lecture 15 3+1 Cs of Caching and many ways Cache Optimizations. John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 http://www-inst.eecs.berkeley.edu/~cs252.
E N D
CS252Graduate Computer ArchitectureLecture 153+1 Cs of Caching and many ways Cache Optimizations John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 http://www-inst.eecs.berkeley.edu/~cs252
Review: Computational Predictors • Last Value Predictors • Predict that instruction will produce same value as last time • Requires some form of hysteresis. Two subtle alternatives: • Saturating counter incremented/decremented on success/failure replace when the count is below threshold • Keep old value until new value seen frequently enough • Second version predicts a constant when appears temporarily constant • Stride Predictors • Predict next value by adding the sum of most recent value to difference of two most recent values: • If vn-1 and vn-2 are the two most recent values, then predict next value will be: vn-1 + (vn-1 – vn-2) • The value (vn-1 – vn-2) is called the “stride” • Important variations in hysteresis: • Change stride only if saturating counter falls below threshold • Or “two-delta” method. Two strides maintained. • First (S1) always updated by difference between two most recent values • Other (S2) used for computing predictions • When S1 seen twice in a row, then S1S2 • More complex predictors: • Multiple strides for nested loops • Complex computations for complex loops (polynomials, etc!) cs252-S07, Lecture 15
Review: Context Based Predictors • Context Based Predictor • Relies on Tables to do trick • Classified according to the order: an “n-th” order model takes last n values and uses this to produce prediction • So – 0th order predictor will be entirely frequency based • Consider sequence: a a a b c a a a b c a a a • Next value is? • “Blending”: Use prediction of highest order available cs252-S07, Lecture 15
Review: Which is better? • Stride-based: • Learns faster • less state • Much cheaper in terms of hardware! • runs into errors for any pattern that is not an infinite stride • Context-based: • Much longer to train • Performs perfectly once trained • Much more expensive hardware cs252-S07, Lecture 15
Correlation of Predicted Sets • Way to interpret: • l = last value • s = stride • f = fcm3 • Combinations: • ls = both l and s • Etc. • Conclusion? • Only 18% not predicted correctly by any model • About 40% captured by all predictors • A significant fraction (over 20%) only captured by fcm • Stride does well! • Over 60% of correct predictions captured • Last-Value seems to have very little added value cs252-S07, Lecture 15
Number of unique values • Data Observations: • Many static instructions (>50%) generate only one value • Majority of static instructions (>90%) generate fewer than 64 values • Majority of dynamic instructions (>50%) correspond to static insts that generate fewer than 64 values • Over 90% of dynamic instructions correspond to static insts that generate fewer than 4096 unique values • Suggests that a relatively small number of values would be required for actual context prediction cs252-S07, Lecture 15
Direction Predictors History (n) n bits Address (n) s bits Bias: True Bias: False Choice Predictor Result An anti-aliasing predictor: Bi-Mode[Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge] • Two separate Gshare predictors+Choser • One for each bias • Only one used/updated! • Sort branches by bias • Meta predictor chooses • Contructive aliasing helps rather than hinders cs252-S07, Lecture 15
An alternative: Genetic Programming for Design • "A Language for Describing Predictors and its Application to Automatic Synthesis“ • Paper by Joel Emer and Nikolas Gloy, • Genetic programming has two key aspects: • An Encoding of the design space. • This is a symbolic representation of the result space (genome). • Much of the domain-specific knowledge and “art” involved here. • A Reproduction strategy • Includes a method for generating offspring from parentsMutation: Changing random portions of an individualCrossover: Merging aspects of two individuals • Includes a method for evaluating the effectiveness (“fitness”) of individual solutions. • Generation of new branch predictors via genetic programming: • Everything derived from a “basic” predictor (table) + simple operators. • Expressions arranged in a tree • Mutation: random modification of node/replacement of subtree • Crossover: swapping the subtrees of two parents. cs252-S07, Lecture 15
Administrivia • It’s official: Midterm I next Wednesday (21st) • Location: 310 Soda Hall (Here) • Time: 5:30 – 8:30 • No class on day of exam • Meet afterwards for pizza and beverages at LaVal’s • Topics: Everything up to prediction mechanisms • Closed-book • but can bring one page of notes (both sides, 8 ½ x11 ) cs252-S07, Lecture 15
Why More on Memory Hierarchy? Processor-Memory Performance Gap Growing cs252-S07, Lecture 15
Generations of Microprocessors • Time of a full cache miss in instructions executed: 1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648 • Why not recompute the value rather than taking the time to fetch it from memory? cs252-S07, Lecture 15
Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (cost) (power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • 2 dies per package: Proc/I$/D$ + L2$ • Caches have no inherent value, only try to close performance gap cs252-S07, Lecture 15
What is a cache? • Small, fast storage used to improve average access time to slow memory. • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache! • Registers a cache on variables • First-level cache a cache on second-level cache • Second-level cache a cache on memory • Memory a cache on disk (virtual memory) • TLB a cache on page table • Branch-prediction a cache on prediction information? Proc/Regs L1-Cache Bigger Faster L2-Cache Memory Disk, Tape, etc. cs252-S07, Lecture 15
What happens on a Cache miss? • For in-order pipeline, 2 options: • Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr • Use Full/Empty bits in registers + MSHR queue • MSHR = “Miss Status/Handler Registers” (Kroft)Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. • Per cache-line: keep info about memory address. • For each word: register (if any) that is waiting for result. • Used to “merge” multiple requests to one memory line • New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. • Attempt to use register before result returns causes instruction to block in decode stage. • Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. • Out-of-order pipelines already have this functionality built in… (load queues, etc). cs252-S07, Lecture 15
Review: Cache performance • Miss-oriented Approach to Memory Access: • Separating out Memory component entirely • AMAT = Average Memory Access Time cs252-S07, Lecture 15
Reducing Misses • Classifying Misses: 3 Cs • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache) • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache) • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache) • More recent, 4th “C”: • Coherence - Misses caused by cache coherence. cs252-S07, Lecture 15
Impact on Performance • Suppose a processor executes at • Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control • Miss Behavior: • 10% of memory operations get 50 cycle miss penalty • 1% of instructions get same miss penalty • CPI = ideal CPI + average stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 • 58% of the time the proc is stalled waiting for memory! • AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 cs252-S07, Lecture 15
Proc Proc I-Cache-1 Proc D-Cache-1 Unified Cache-1 Unified Cache-2 Unified Cache-2 Example: Harvard Architecture • Unified vs Separate I&D (Harvard) • Statistics (given in H&P): • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% • 32KB unified: Aggregate miss rate=1.99% • Which is better (ignore L2 cache)? • Assume 33% data ops 75% accesses from instructions (1.0/1.33) • hit time=1, miss time=50 • Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 cs252-S07, Lecture 15
Review: 6 Basic Cache Optimizations • Reducing hit time • Avoiding Address Translation during Cache Indexing • E.g., Overlap TLB and cache access • Reducing Miss Penalty 2. Giving Reads Priority over Writes • E.g., Read complete before earlier writes in write buffer 3. Multilevel Caches • Reducing Miss Rate 4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 6. Higher Associativity (Conflict misses) cs252-S07, Lecture 15
Cache Processor DRAM Write Buffer More Detail: Read Priority over Write on Miss • Write Buffer is needed between the Cache and Memory • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: • Typical number of entries: 4 • Works fine if:Store frequency (w.r.t. time) << 1 / DRAM write cycle • Must handle burst behavior as well! cs252-S07, Lecture 15
3 8 3 8 DRAM RAS/ CAS Write DATA RAS/ CAS Read DATA Proc Processor + DRAM 8 8 3 8 3 8 Write DATA Read DATA RAS/ CAS Read DATA RAS/ CAS Write DATA RAW Hazards from Write Buffer! • Write-Buffer Issues: Could introduce RAW Hazard with memory! • Write buffer may contain only copy of valid data Reads to memory may get wrong result if we ignore write buffer • Solutions: • Simply wait for write buffer to empty before servicing reads: • Might increase read miss penalty (old MIPS 1000 by 50% ) • Check write buffer contents before read (“fully associative”); • If no conflicts, let the memory access continue • Else grab data from buffer • Can Write Buffer help with Write Back? • Read miss replacing dirty block • Copy dirty block to write buffer while starting read to memory cs252-S07, Lecture 15
Reducing hit time Small and simple caches Way prediction Trace caches Increasing cache bandwidth Pipelined caches Multibanked caches Nonblocking caches Reducing Miss Penalty Critical word first Merging write buffers Reducing Miss Rate Victim Cache Hardware prefetching Compiler prefetching Compiler Optimizations 12 Advanced Cache Optimizations cs252-S07, Lecture 15
1. Fast Hit times via Small and Simple Caches • Index tag memory and then compare takes time • Small cache can help hit time since smaller memory takes less time to index • E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron • Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip • Simple direct mapping • Can overlap tag check with data transmission since no choice • Access time estimate for 90 nm using CACTI model 4.0 • Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches cs252-S07, Lecture 15
Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 : : : : : : Adr Tag Compare Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit Recall: Set Associative Cache • N-way set associative: N entries for each Cache Index • N direct mapped caches operates in parallel • Example: Two-way set associative cache • Cache Index selects a “set” from the cache • The two tags in the set are compared to the input in parallel • Data is selected based on the tag result • Disadvantage: Time to set mux cs252-S07, Lecture 15
Hit Time Miss Penalty Way-Miss Hit Time 2. Fast Hit times via Way Prediction • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? • Way prediction: keep extra bits in cache to predict the “way,” or block within the set, of next cache access. • Multiplexor is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data • Miss 1st check other blocks for matches in next clock cycle • Accuracy 85% • Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles • Used for instruction caches vs. data caches cs252-S07, Lecture 15
3. Fast Hit times via Trace Cache (Pentium 4 only; and last time?) • Find more instruction level parallelism?How avoid translation from x86 to microops? • Trace cache in Pentium 4 • Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory • Built-in branch predictor • Cache the micro-ops vs. x86 instructions • Decode/translate from x86 to micro-ops on trace cache miss + 1. better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block) • 1. complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size - 1. instructions may appear multiple times in multiple dynamic traces due to different branch outcomes cs252-S07, Lecture 15
4: Increasing Cache Bandwidth by Pipelining • Pipeline cache access to maintain bandwidth, but higher latency • Instruction cache access pipeline stages: 1: Pentium 2: Pentium Pro through Pentium III 4: Pentium 4 • greater penalty on mispredicted branches • more clock cycles between the issue of the load and the use of the data cs252-S07, Lecture 15
5. Increasing Cache Bandwidth: Non-Blocking Caches • Non-blocking cacheor lockup-free cacheallow data cache to continue to supply cache hits during a miss • requires F/E bits on registers or out-of-order execution • requires multi-bank memories • “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests • “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses • Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses • Requires muliple memory banks (otherwise cannot support) • Penium Pro allows 4 outstanding memory misses cs252-S07, Lecture 15
Integer Floating Point Value of Hit Under Miss for SPEC (old data) • FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 • Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 0->1 1->2 2->64 Base “Hit under n Misses” cs252-S07, Lecture 15
6: Increasing Cache Bandwidth via Multiple Banks • Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses • E.g.,T1 (“Niagara”) L2 has 4 banks • Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system • Simple mapping that works well is “sequential interleaving” • Spread block addresses sequentially across banks • E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; … cs252-S07, Lecture 15
block 7. Reduce Miss Penalty: Early Restart and Critical Word First • Don’t wait for full block before restarting CPU • Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Spatial locality tend to want next sequential word, so not clear size of benefit of just early restart • Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block • Long blocks more popular today Critical Word 1st Widely used cs252-S07, Lecture 15
8. Merging Write Buffer to Reduce Miss Penalty • Write buffer to allow processor to continue while waiting to write to memory • If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry • If so, new data are combined with that entry • Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory • The Sun T1 (Niagara) processor, among many others, uses write merging cs252-S07, Lecture 15
9. Reducing Misses: a “Victim Cache” • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator To Next Lower Level In Hierarchy cs252-S07, Lecture 15
10. Reducing Misses by Hardware Prefetching of Instructions & Data • Prefetching relies on having extra memory bandwidth that can be used without penalty • Instruction Prefetching • Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. • Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer • Data Prefetching • Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages • Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes cs252-S07, Lecture 15
11. Reducing Misses by Software Prefetching Data • Data Prefetch • Load data into register (HP PA-RISC loads) • Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) • Special prefetching instructions cannot cause faults;a form of speculative execution • Issuing Prefetch Instructions takes time • Is cost of prefetch issues < savings in reduced misses? • Higher superscalar reduces difficulty of issue bandwidth cs252-S07, Lecture 15
12. Reducing Misses by Compiler Optimizations • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions • Reorder procedures in memory so as to reduce conflict misses • Profiling to look at conflicts(using tools they developed) • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in memory • Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap • Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows cs252-S07, Lecture 15
Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality cs252-S07, Lecture 15
Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality cs252-S07, Lecture 15
Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j]= 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j]+ c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality cs252-S07, Lecture 15
Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; • Two Inner Loops: • Read all NxN elements of z[] • Read N elements of 1 row of y[] repeatedly • Write N elements of 1 row of x[] • Capacity Misses a function of N & Cache Size: • 2N3 + N2 => (assuming no conflict; otherwise …) • Idea: compute on BxB submatrix that fits cs252-S07, Lecture 15
Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; • B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B +N2 • Conflict Misses Too? cs252-S07, Lecture 15
Reducing Conflict Misses by Blocking • Conflict misses in caches not FA vs. Blocking size • Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache cs252-S07, Lecture 15
Summary of Compiler Optimizations to Reduce Cache Misses (by hand) cs252-S07, Lecture 15
Compiler Optimization vs. Memory Hierarchy Search • Compiler tries to figure out memory hierarchy optimizations • New approach: “Auto-tuners” 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, …) and algorithms, then produce C code to be compiled for that computer • “Auto-tuner” targeted to numerical method • E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W cs252-S07, Lecture 15
Mflop/s Best: 4x2 Reference Mflop/s Sparse Matrix – Search for Blocking for finite element problem [Im, Yelick, Vuduc, 2005] cs252-S07, Lecture 15
Best Sparse Blocking for 8 Computers • All possible column block sizes selected for 8 computers; How could compiler know? 8 4 row block size (r) 2 1 1 2 4 8 column block size (c) cs252-S07, Lecture 15
Conclusion • Memory wall inspires optimizations since so much performance lost there • Reducing hit time: Small and simple caches, Way prediction, Trace caches • Increasing cache bandwidth:Pipelined caches, Multibanked caches, Nonblocking caches • Reducing Miss Penalty: Critical word first, Merging write buffers • Reducing Miss Rate: Compiler optimizations • Reducing miss penalty or miss rate via parallelism:Hardware prefetching, Compiler prefetching • “Auto-tuners” search replacing static compilation to explore optimization space? cs252-S07, Lecture 15