1 / 83

高性能计算机系统结构

高性能计算机系统结构. 胡伟武. 本次内容. 存储层次的基本概念 CACHE 分类 CACHE 性能 CACHE 优化技术 常见处理器的存储层次 运用之妙、存乎一心. CPU 与 RAM 的速度差距. µ Proc 60%/yr. 1000. CPU. “ Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year). Performance. 10. DRAM 7%/yr. “ Less’ Law?”. DRAM. 1. 1980. 1981. 1982.

gagan
Télécharger la présentation

高性能计算机系统结构

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 高性能计算机系统结构 胡伟武

  2. 本次内容 • 存储层次的基本概念 • CACHE分类 • CACHE性能 • CACHE优化技术 • 常见处理器的存储层次 • 运用之妙、存乎一心

  3. CPU与RAM的速度差距 µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. “Less’ Law?” DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 • 摩尔定律 • CPU的频率和RAM的容量每18个月翻一番 • 通过存储层次来弥补差距 • 寄存器、CACHE、存储器、IO

  4. 摩尔定律使CPU的内容发生了变化 • 冯诺依曼结构的核心思想 • 存储程序:指令和数据都存放在存储器中 • 计算机的五个组成部分 • 运算器、控制器、存储器、输入、输出 • 运算器和控制器合称中央处理器(CPU) • 为了缓解存储瓶颈,把部分存储器做在片内 • 现在的CPU芯片:控制器+运算器+部分存储器 • 片内CACHE占了整个芯片的很大一部分面积

  5. 控制线 计算机硬件系统的组成 数据线 外存储器 输入设备 输出设备 内存 CPU CACHE 运算器 控制器

  6. Generations of Microprocessors • Time of a full cache miss in instructions executed: 1st Alpha (7000): 340 ns/5.0 ns =  68 clks x 2 or 136 2nd Alpha (8400): 266 ns/3.3 ns =  80 clks x 4 or 320 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 • 1/2X latency x 3X clock rate x 3X Instr/clock Þ ­5X

  7. Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (­cost) (­power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • 2 dies per package: Proc/I$/D$ + L2$ • Caches have no inherent value, only try to close performance gap

  8. 计算机中的存储层次 • 存储层次基本原理 • 程序访问的局部性:时间局部性和空间局部性 • 新型的应用(如媒体)对传统的局部性提出了挑战 • 越小越简单的硬件越快 • 越快的硬件越昂贵

  9. Cache访存请求 CPU读写请求 存储器 处理器 内存返回数据 Cache返回数据 CACHE的结构 • Cache的特征 • Cache的内容是主存储器内容的一个子集 • Cache没有程序上的意义,只是为了降低访存延迟 • 处理器访问Cache和访问存储器使用相同的地址 • Cache的结构特点 • 同时存储数据和地址 • 通过地址的比较判断相应数据是否在Cache中 • 需要考虑所需要的数据不在Cache中的情况 • 替换机制,写策略等

  10. CACHE的类型 • Cache块的位置 • 全相联(Fully Associative) • 组相联( Set Associative) • 直接相联( Direct Mapped) • Cache失效时的替换机制? • 随机替换, LRU, FIFO • 写策略 • 写命中策略:Write Back vs. Write Through • 写失效策略:Write allocate vs. Write non-allocate

  11. 全相联、直接相联、组相联 • 同一单元在不同结构CACHE中的位置 • 直接相联 • 全相联 • 组相联 (a)直接相联 (b)全相联 (c)组相联 内存

  12. Offset Tag C C C C C C C C Mux + 全相联 • 命中率高 • 硬件复杂、延迟大 … hit data

  13. Index Offset Tag C Mux hit data 直接相联 • 硬件简单、延迟最小 • 命中率低

  14. Index Offset Tag Tag Tag Mux Mux data1 data0 Mux C C + hit0 hit1 data hit 组相联 • 介于全相联和直接相联之间

  15. Cache替换算法 • 对直接相联CACHE不存在替换算法问题 • 常见的替换算法 • 随机替换: • LRU: • FIFO: • 每1000条指令失效次数统计 • SPEC CPU2000中的gap, gcc, gzip, mcf, perl, applu, art, equake, lucas, swim 10个程序 • Aplha结构,块大小64KB

  16. Write-Through vs Write-Back • 写命中时采取的策略 • Write-through: update cache and underlying memory • Memory (or other processors) always have latest data, can always discard cached data • Cache control bit: only a valid bit • Simpler management of cache • Write-back: all writes simply update cache • Can’t just discard cached data - may have to write it back to memory • Cache control bits: both valid and dirty bits • lower bandwidth, since data often overwritten multiple times • Better tolerance to long-latency memory?

  17. Write Allocate vs Non-Allocate • 写失效时采取的策略 • Write allocate: allocate new cache line in cache • Usually means that you have to do a “read miss” to fill in rest of the cache-line! • Alternative: per/word valid bits • Write-back一般采用Write-allocate • Write non-allocate (or “write-around”): • Simply send write data through to underlying memory/cache - don’t allocate new cache line! • write-through一般采用write non-allocate

  18. Cache性能分析 • Miss-oriented Approach to Memory Access: • CPIExecution includes ALU and Memory instructions • Separating out Memory component entirely • AMAT = Average Memory Access Time • CPIALUOps does not include memory instructions

  19. Cache性能优化 • 降低MissRate • 三种失效:冷失效、容量失效、冲突失效 • 增加块大小、增加Cache容量、增加相联数目、Way Prediction、编译优化 • 降低MissPenalty • 多级Cache、关键字优先、读失效优先、写合并、Victim Cache • 降低HitTime • 小而简单的Cache、并行访问Cache与TLB、增加Cache访问流水级、Trace Cache • 通过并行操作掩盖访存延迟 • 非阻塞Cache、硬件预取、软件预取

  20. Cache性能优化 • 降低MissRate • 增加块大小、增加Cache容量、增加相联数目、Way Prediction、编译优化 • 降低MissPenalty • 多级Cache、关键字优先、读失效优先、写合并、Victim Cache • 降低HitTime • 小而简单的Cache、并行访问Cache与TLB、增加Cache访问流水级、Trace Cache • 通过并行操作掩盖访存延迟 • 非阻塞Cache、硬件预取、软件预取

  21. Reducing Misses • Classifying Misses: 3 Cs • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called coldstartmissesorfirst reference misses.(Misses in even an Infinite Cache) • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misseswill occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache) • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision missesor interference misses.(Misses in N-way Associative, Size X Cache) • Coherence- Misses caused by cache coherence.

  22. 0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 0.02 0 4 8 1 2 16 32 64 128 Compulsory Cache Size (KB) 3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small

  23. 100% 1-way 80% 2-way 4-way 8-way 60% Miss Rate per Type 40% Capacity 20% 0% 4 8 1 2 16 32 64 128 Compulsory Cache Size (KB) 3Cs Relative Miss Rate Conflict Flaws: for fixed block size Good: insight => invention

  24. 0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 0.02 0 4 8 1 2 16 32 64 128 Compulsory Cache Size (KB) 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict

  25. SPEC CPU2000, Alpha结构

  26. 1.通过增加块大小降低失效率 • 利用空间局部性 • 降低冷失效,增加冲突失效以及容量失效 • SPEC92,DECstation 5000 • 小容量Cache块较小

  27. 2.通过增加Cache大小提高命中率 • 一级Cache访问直接决定时钟周期 • 尤其是在深亚微米的情况下,连线延迟很大 • PIII一级Cache为16KB,PIV一级Cache为8KB。 • 增加片内Cache大小增加芯片面积 • 有的处理器片内Cache面积占整个芯片面积的80%以上 • 现代处理器二级或三级Cache大小已经达到几MB甚至几十MB。 • 现代通用处理器的一级Cache大小 • HP PA8700:一级Cache为1MB+1.5MB,没有二级Cache • 其他RISC处理器( Alpha, Power, MIPS, Ultra SPARC) 32/64KB+32/64KB • PIV 12Kop Trace cache+8KB数据Cache(PIII: 16KB+16KB) • 反映出设计人员的不同取舍

  28. 3.通过增加相联度降低失效率 • 增加相联度 • 2:1规则:大小为N的直接相联的Cache命中率与大小为N/2的二路组相联的 Cache命中率相当 • 八路组相联的效果已经与全相联的失效率相差很小 • Beware: Execution time is only final measure! • Will Clock Cycle time increase? • Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% • Cache访问可能是整个处理器的关键路径

  29. 失效率和平均访存延迟 • Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 1281.10 1.17 1.18 1.20 (Red means A.M.A.T. not improved by more associativity)

  30. 4.通过“路猜测”和“伪相联”提高命中率 • 结合直接相联访问时间短和组相联命中率高的优点 • 在多路相联的结构中,每次Cache访问只检查一路,如果不命中再检查其他路 • 两种命中:hit和pseudohit • 直接判断第0路,或进行Way-Prediction:Alpha 21264 ICache路猜测命中率85%,hit时间为1拍,pseudohit时间3拍 • 不用并行访问每一路,可以大幅度降低功耗 • 缺点:流水线控制复杂,在L2以下cache中用得较多,包括 MIPS R10000以及UltraSPARC的L2。 Hit Time Miss Penalty Pseudo Hit Time Time

  31. 5.通过软件优化降低失效率 • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions • Reorder procedures in memory so as to reduce conflict misses • Profiling to look at conflicts (using tools they developed) • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in memory • Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap • Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

  32. Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality

  33. Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

  34. Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j]= 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j]+ c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

  35. Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: • Read all NxN elements of z[] • Read N elements of 1 row of y[] repeatedly • Write N elements of 1 row of x[] • Capacity Misses a function of N & Cache Size: • 2N3 + N2 => (assuming no conflict; otherwise …) • Idea: compute on BxB submatrix that fits

  36. Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; • B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B +N2 • Conflict Misses Too?

  37. Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

  38. 软件优化的效果

  39. 降低失效率小结 • 3 Cs: Compulsory, Capacity, Conflict Misses • Reducing Miss Rate • Reduce Misses via Larger Block Size • Reducing Misses via Larger Cache size • Reduce Misses via Higher Associativity • Reducing Misses via Pseudo-Associativity • Reducing Misses by Compiler Optimizations • Remember danger of concentrating on just one parameter when evaluating performance

  40. Cache性能优化 • 降低MissRate • 增加块大小、增加Cache容量、增加相联数目、Way Prediction、编译优化 • 降低MissPenalty • 多级Cache、关键字优先、读失效优先、写合并、Victim Cache • 降低HitTime • 小而简单的Cache、并行访问Cache与TLB、增加Cache访问流水级、Trace Cache • 通过并行操作掩盖访存延迟 • 非阻塞Cache、硬件预取、软件预取

  41. 1.通过关键字优先降低失效延迟 • Don’t wait for full block to be loaded before restarting CPU • Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first • Generally useful only in large blocks block

  42. 2.通过读优先降低失效延迟 • Write through with write buffers offer RAW conflicts with main memory reads on cache misses • If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) • Check write buffer contents before read; if no conflicts, let the memory access continue • Write Back? • Read miss replacing dirty block • Normal: Write dirty block to memory, and then do the read • Instead copy the dirty block to a write buffer, then do the read, and then do the write • CPU stall less since restarts as soon as do read

  43. 3.通过写合并降低失效延迟 • Write-through的Cache依赖于写缓存(Write Buffer) • 处理器写到WB就算完成写操作,由WB写到下一级存储器 • 注意一致性问题:处理器后续的读操作、外设的DMA操作等 • Write-back的Cache也使用WB来临时存放替换出去的cache块 • 通过把写缓存中对同一Cache块的写操作进行合并来提高写缓存使用率,减少处理器等待 • 注意IO操作不能合并

  44. 4.通过Victim Cache降低失效延迟 • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines

  45. 5. 通过二级Cache降低失效延迟 • L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 +Miss RateL1x (Hit TimeL2 +Miss RateL2x Miss PenaltyL2) • Definitions: • Local miss rate— misses in this cache divided by the total number of memory accessesto this cache (Miss rateL2) • Global miss rate—misses in this cache divided by the total number of memory accesses generated by theCPU(Miss RateL1 x Miss RateL2) • Global Miss Rate is what matters

  46. Comparing Local and Global Miss Rates • 32 KByte 1st level cache;Increasing 2nd level cache • Global miss rate close to single level cache rate provided L2 >> L1 • Don’t use local miss rate • L2 not tied to CPU clock cycle! • Cost & A.M.A.T. • Generally Fast Hit Times and fewer misses • Since hits are few, target miss reduction

  47. L2 cache block size & A.M.A.T. • 32KB L1, 8 byte pathto memory

  48. Reducing Miss Penalty Summary • Five techniques • Early Restart and Critical Word First on miss • Read priority over write on miss • Merging Write Buffewr • Victim Cache • Second Level Cache • Can be applied recursively to Multilevel Caches • Danger is that time to DRAM will grow with multiple levels in between • First attempts at L2 caches can make things worse, since increased worst case is worse

  49. Cache性能优化 • 降低MissRate • 增加块大小、增加Cache容量、增加相联数目、Way Prediction、编译优化 • 降低MissPenalty • 多级Cache、关键字优先、读失效优先、写合并、Victim Cache • 降低HitTime • 小而简单的Cache、并行访问Cache与TLB、增加Cache访问流水级、Trace Cache • 通过并行操作掩盖访存延迟 • 非阻塞Cache、硬件预取、软件预取

  50. 1. 简化Cache设计 • Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? • Small data cache and clock rate • PIV数据cache从PIII的16KB降低为8KB。 • Direct Mapped, on chip

More Related