高性能计算机系统结构

高性能计算机系统结构 胡伟武

本次内容 • 存储层次的基本概念 • CACHE分类 • CACHE性能 • CACHE优化技术 • 常见处理器的存储层次 • 运用之妙、存乎一心

CPU与RAM的速度差距 µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. “Less’ Law?” DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 • 摩尔定律 • CPU的频率和RAM的容量每18个月翻一番 • 通过存储层次来弥补差距 • 寄存器、CACHE、存储器、IO

摩尔定律使CPU的内容发生了变化 • 冯诺依曼结构的核心思想 • 存储程序：指令和数据都存放在存储器中 • 计算机的五个组成部分 • 运算器、控制器、存储器、输入、输出 • 运算器和控制器合称中央处理器（CPU） • 为了缓解存储瓶颈，把部分存储器做在片内 • 现在的CPU芯片：控制器+运算器+部分存储器 • 片内CACHE占了整个芯片的很大一部分面积

控制线 计算机硬件系统的组成数据线外存储器输入设备输出设备内存 CPU CACHE 运算器控制器

Generations of Microprocessors • Time of a full cache miss in instructions executed: 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 • 1/2X latency x 3X clock rate x 3X Instr/clock Þ 5X

Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (cost) (power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • 2 dies per package: Proc/I$/D$ + L2$ • Caches have no inherent value, only try to close performance gap

计算机中的存储层次 • 存储层次基本原理 • 程序访问的局部性：时间局部性和空间局部性 • 新型的应用（如媒体）对传统的局部性提出了挑战 • 越小越简单的硬件越快 • 越快的硬件越昂贵

Cache访存请求 CPU读写请求存储器处理器内存返回数据 Cache返回数据 CACHE的结构 • Cache的特征 • Cache的内容是主存储器内容的一个子集 • Cache没有程序上的意义，只是为了降低访存延迟 • 处理器访问Cache和访问存储器使用相同的地址 • Cache的结构特点 • 同时存储数据和地址 • 通过地址的比较判断相应数据是否在Cache中 • 需要考虑所需要的数据不在Cache中的情况 • 替换机制，写策略等

CACHE的类型 • Cache块的位置 • 全相联（Fully Associative） • 组相联（ Set Associative） • 直接相联（ Direct Mapped） • Cache失效时的替换机制? • 随机替换, LRU, FIFO • 写策略 • 写命中策略：Write Back vs. Write Through • 写失效策略：Write allocate vs. Write non-allocate

全相联、直接相联、组相联 • 同一单元在不同结构CACHE中的位置 • 直接相联 • 全相联 • 组相联（a）直接相联（b）全相联（c）组相联内存

Offset Tag C C C C C C C C Mux + 全相联 • 命中率高 • 硬件复杂、延迟大 … hit data

Index Offset Tag C Mux hit data 直接相联 • 硬件简单、延迟最小 • 命中率低

Index Offset Tag Tag Tag Mux Mux data1 data0 Mux C C + hit0 hit1 data hit 组相联 • 介于全相联和直接相联之间

Cache替换算法 • 对直接相联CACHE不存在替换算法问题 • 常见的替换算法 • 随机替换： • LRU： • FIFO： • 每1000条指令失效次数统计 • SPEC CPU2000中的gap, gcc, gzip, mcf, perl, applu, art, equake, lucas, swim 10个程序 • Aplha结构，块大小64KB

Write-Through vs Write-Back • 写命中时采取的策略 • Write-through: update cache and underlying memory • Memory (or other processors) always have latest data, can always discard cached data • Cache control bit: only a valid bit • Simpler management of cache • Write-back: all writes simply update cache • Can’t just discard cached data - may have to write it back to memory • Cache control bits: both valid and dirty bits • lower bandwidth, since data often overwritten multiple times • Better tolerance to long-latency memory?

Write Allocate vs Non-Allocate • 写失效时采取的策略 • Write allocate: allocate new cache line in cache • Usually means that you have to do a “read miss” to fill in rest of the cache-line! • Alternative: per/word valid bits • Write-back一般采用Write-allocate • Write non-allocate (or “write-around”): • Simply send write data through to underlying memory/cache - don’t allocate new cache line! • write-through一般采用write non-allocate

Cache性能分析 • Miss-oriented Approach to Memory Access: • CPIExecution includes ALU and Memory instructions • Separating out Memory component entirely • AMAT = Average Memory Access Time • CPIALUOps does not include memory instructions

Cache性能优化 • 降低MissRate • 三种失效：冷失效、容量失效、冲突失效 • 增加块大小、增加Cache容量、增加相联数目、Way Prediction、编译优化 • 降低MissPenalty • 多级Cache、关键字优先、读失效优先、写合并、Victim Cache • 降低HitTime • 小而简单的Cache、并行访问Cache与TLB、增加Cache访问流水级、Trace Cache • 通过并行操作掩盖访存延迟 • 非阻塞Cache、硬件预取、软件预取

Cache性能优化 • 降低MissRate • 增加块大小、增加Cache容量、增加相联数目、Way Prediction、编译优化 • 降低MissPenalty • 多级Cache、关键字优先、读失效优先、写合并、Victim Cache • 降低HitTime • 小而简单的Cache、并行访问Cache与TLB、增加Cache访问流水级、Trace Cache • 通过并行操作掩盖访存延迟 • 非阻塞Cache、硬件预取、软件预取

Reducing Misses • Classifying Misses: 3 Cs • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called coldstartmissesorfirst reference misses.(Misses in even an Infinite Cache) • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misseswill occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache) • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision missesor interference misses.(Misses in N-way Associative, Size X Cache) • Coherence- Misses caused by cache coherence.

0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 0.02 0 4 8 1 2 16 32 64 128 Compulsory Cache Size (KB) 3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small

100% 1-way 80% 2-way 4-way 8-way 60% Miss Rate per Type 40% Capacity 20% 0% 4 8 1 2 16 32 64 128 Compulsory Cache Size (KB) 3Cs Relative Miss Rate Conflict Flaws: for fixed block size Good: insight => invention

0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 0.02 0 4 8 1 2 16 32 64 128 Compulsory Cache Size (KB) 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict

SPEC CPU2000, Alpha结构

1.通过增加块大小降低失效率 • 利用空间局部性 • 降低冷失效，增加冲突失效以及容量失效 • SPEC92，DECstation 5000 • 小容量Cache块较小

2.通过增加Cache大小提高命中率 • 一级Cache访问直接决定时钟周期 • 尤其是在深亚微米的情况下，连线延迟很大 • PIII一级Cache为16KB，PIV一级Cache为8KB。 • 增加片内Cache大小增加芯片面积 • 有的处理器片内Cache面积占整个芯片面积的80%以上 • 现代处理器二级或三级Cache大小已经达到几MB甚至几十MB。 • 现代通用处理器的一级Cache大小 • HP PA8700：一级Cache为1MB+1.5MB，没有二级Cache • 其他RISC处理器（ Alpha, Power, MIPS, Ultra SPARC） 32/64KB+32/64KB • PIV 12Kop Trace cache+8KB数据Cache（PIII: 16KB+16KB） • 反映出设计人员的不同取舍

3.通过增加相联度降低失效率 • 增加相联度 • 2:1规则：大小为N的直接相联的Cache命中率与大小为N/2的二路组相联的 Cache命中率相当 • 八路组相联的效果已经与全相联的失效率相差很小 • Beware: Execution time is only final measure! • Will Clock Cycle time increase? • Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% • Cache访问可能是整个处理器的关键路径

失效率和平均访存延迟 • Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 1281.10 1.17 1.18 1.20 (Red means A.M.A.T. not improved by more associativity)

4.通过“路猜测”和“伪相联”提高命中率 • 结合直接相联访问时间短和组相联命中率高的优点 • 在多路相联的结构中，每次Cache访问只检查一路，如果不命中再检查其他路 • 两种命中：hit和pseudohit • 直接判断第0路，或进行Way-Prediction：Alpha 21264 ICache路猜测命中率85%，hit时间为1拍，pseudohit时间3拍 • 不用并行访问每一路，可以大幅度降低功耗 • 缺点：流水线控制复杂，在L2以下cache中用得较多，包括 MIPS R10000以及UltraSPARC的L2。 Hit Time Miss Penalty Pseudo Hit Time Time

5.通过软件优化降低失效率 • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions • Reorder procedures in memory so as to reduce conflict misses • Profiling to look at conflicts (using tools they developed) • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in memory • Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap • Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j]= 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j]+ c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: • Read all NxN elements of z[] • Read N elements of 1 row of y[] repeatedly • Write N elements of 1 row of x[] • Capacity Misses a function of N & Cache Size: • 2N3 + N2 => (assuming no conflict; otherwise …) • Idea: compute on BxB submatrix that fits

Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; • B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B +N2 • Conflict Misses Too?

Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

软件优化的效果

降低失效率小结 • 3 Cs: Compulsory, Capacity, Conflict Misses • Reducing Miss Rate • Reduce Misses via Larger Block Size • Reducing Misses via Larger Cache size • Reduce Misses via Higher Associativity • Reducing Misses via Pseudo-Associativity • Reducing Misses by Compiler Optimizations • Remember danger of concentrating on just one parameter when evaluating performance

1.通过关键字优先降低失效延迟 • Don’t wait for full block to be loaded before restarting CPU • Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first • Generally useful only in large blocks block

2.通过读优先降低失效延迟 • Write through with write buffers offer RAW conflicts with main memory reads on cache misses • If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) • Check write buffer contents before read; if no conflicts, let the memory access continue • Write Back? • Read miss replacing dirty block • Normal: Write dirty block to memory, and then do the read • Instead copy the dirty block to a write buffer, then do the read, and then do the write • CPU stall less since restarts as soon as do read

3.通过写合并降低失效延迟 • Write-through的Cache依赖于写缓存（Write Buffer） • 处理器写到WB就算完成写操作，由WB写到下一级存储器 • 注意一致性问题：处理器后续的读操作、外设的DMA操作等 • Write-back的Cache也使用WB来临时存放替换出去的cache块 • 通过把写缓存中对同一Cache块的写操作进行合并来提高写缓存使用率，减少处理器等待 • 注意IO操作不能合并

4.通过Victim Cache降低失效延迟 • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines

5. 通过二级Cache降低失效延迟 • L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 +Miss RateL1x (Hit TimeL2 +Miss RateL2x Miss PenaltyL2) • Definitions: • Local miss rate— misses in this cache divided by the total number of memory accessesto this cache (Miss rateL2) • Global miss rate—misses in this cache divided by the total number of memory accesses generated by theCPU(Miss RateL1 x Miss RateL2) • Global Miss Rate is what matters

Comparing Local and Global Miss Rates • 32 KByte 1st level cache;Increasing 2nd level cache • Global miss rate close to single level cache rate provided L2 >> L1 • Don’t use local miss rate • L2 not tied to CPU clock cycle! • Cost & A.M.A.T. • Generally Fast Hit Times and fewer misses • Since hits are few, target miss reduction

L2 cache block size & A.M.A.T. • 32KB L1, 8 byte pathto memory

Reducing Miss Penalty Summary • Five techniques • Early Restart and Critical Word First on miss • Read priority over write on miss • Merging Write Buffewr • Victim Cache • Second Level Cache • Can be applied recursively to Multilevel Caches • Danger is that time to DRAM will grow with multiple levels in between • First attempts at L2 caches can make things worse, since increased worst case is worse

1. 简化Cache设计 • Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? • Small data cache and clock rate • PIV数据cache从PIII的16KB降低为8KB。 • Direct Mapped, on chip

高性能计算机系统结构

高性能计算机系统结构

Presentation Transcript