1 / 60

M e m o r y

M e m o r y. Computer Performance. It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected. CPI = Cycles per instruction IPC = Instructions per cycle. Program locality. Temporal locality (Data) Recently referenced information Spatial locality

Télécharger la présentation

M e m o r y

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. M e m o r y

  2. Computer Performance • It depends in large measure on the interface between processor and memory. • CPI (or IPC) is affected CPI = Cycles per instruction IPC = Instructions per cycle

  3. Program locality • Temporal locality (Data) • Recently referenced information • Spatial locality • Same address space will be referenced • Sequential locality (instructions) • (spatial locality) next address will be used

  4. Caches • Instruction cache • Data cache • Unified cache • Split cache: I-cache & D-cache

  5. Tradeoffs • Usually cache (L1 and L2) is within CPU chip. • Thus, area is a concern • Design alternatives: • Large I-cache • Equal area: I- and D-cache • Large unified cache • Multi-level split

  6. Some terms • Read: Any read access (by the processor) to cache –instruction (inst. fetch) or data • Write: Any write access (by the processor) into cache • Fetch: read to main mem. to load a block • Access: any reference to memory

  7. Miss Ratio Number of references to cache not found in cache/total references to cache Number of I-cache references not found in cache/ instructions executed

  8. Placement Problem Main Memory Cache Memory

  9. Placement Policies • WHERE to put a block in cache • Mapping between main and cache memories. • Main memory has a much larger capacity than cache memory.

  10. Memory Fully Associative Cache 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 Block number Block can be placed in any location in cache.

  11. Memory Direct Mapped Cache 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 Block number (Block address) MOD (Number of blocks in cache) 12 MOD 8 = 4 Block can be placed ONLY in a single location in cache.

  12. Memory Set Associative Cache 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Set no. 0 1 2 3 4 5 6 7 0 1 2 3 Block number Block number (Block address) MOD (Number of sets in cache) 12 MOD 4 = 0 Block can be placed in one of n locations in n-way set associative cache.

  13. Mem. Addr. cache addr. CPU Name trans. Cache address

  14. : . Cache organization CPU address Data <21> <8> <5> Tag Index blk Valid Tag Data <1> <21> <256> MUX =

  15. TAG • Contains the “sector name” of block tag Data Block Main mem addr =

  16. Valid bit(s) • Indicates if block contains valid data • initial program load into main memory. • No valid data • Cache coherency in multiprocessors • Data in main memory could have been chanced by other processor (or process).

  17. Dirty bit(s) • Indicates if the block has been written to. • No need in I-caches. • No need in write through D-cache. • Write back D-cache needs it.

  18. Write back C P U D cache Main memory

  19. Write through C P U cache Main memory

  20. Cache Performance Access time (in cycles) Tea = Hit time + Miss rate *Miss penalty

  21. Associativity (D-cache misses per 1000 instructions)

  22. Main memory Cache memory CPU

  23. Replacement Policies • FIFO (First-In First-Out) • Random • Least-recently used (LRU) • Replace the block that has been unused for the longest time.

  24. Reducing Cache Miss Penalties • Multilevel cache • Critical word first • Priority to Read over Write • Victim cache

  25. Multilevel caches Main L3 L2 L1 CPU

  26. Multilevel caches • Average access time =Hit timeL1 + Miss rateL1 *Miss penaltyL1 Hit timeL2 + Miss rateL2 *Miss penaltyL2

  27. Critical word first • The missed word is requested to be sent first from next level. Priority to Read over Write misses • Writes are not as critical as reads. • CPU cannot continue if data or instruction is not read.

  28. Victim cache • Small fully associative cache. • Contains blocks that have been discarded (“victims”) • Four entry cache removed 20-90% of conflict misses (4KB direct-mapped).

  29. Pseudo Associative Cache • Direct-mapped cache hit time • If miss • Check other entry in cache • (change the most significant bit) • If found no long wait

  30. Reducing Cache Miss Rate • Larger block size • Larger caches • Higher Associativity • Pseudo-associative cache • Prefetching

  31. Larger Block Size • Large block take advantage of spatial locality. • On the other hand, less blocks in cache

  32. Miss rate versus block size

  33. Example • Memory takes 80 clock cycles of overhead • Delivers 16 bytes every 2 clock cycles (cc) Mem. System supplies 16 bytes in 82 cc 32 bytes in 84 cc 1 cycle (miss) + 1 cycle (read)

  34. Example (cont’d) Average access time =Hit timeL1 + Miss rateL1 *Miss penaltyL1 For a 16-byte block in a 4 KB cache Ave. access time = 1 + (8.57% * 82)=8.0274

  35. Example (cont’d) Average access time =Hit timeL1 + Miss rateL1 *Miss penaltyL1 For a 64-byte block in a 256 KB cache (miss rate 0.51%) Ave. access time = 1 + (0.51% * 88)= 1.449

  36. Ave. mem. access time versus block size

  37. Two-way cache (Alpha)

  38. Higher Associativity • Having higher associativity is NOT for free • Slower clock may be required • Thus, there is a trade-off between associativity (with higher hit rate) and faster clocks (direct-mapped).

  39. Associativity example • If clock cycle time (cct) increases as follows: • cct2-way = 1.1cct1-way • cct4-way = 1.12cct1-way • cct8-way = 1.14cct1-way • Miss penalty is 50 cycles • Hit time is 1 cycle

  40. Pseudo Associative Cache • Direct-mapped cache hit time • If miss • Check other entry in cache • (change the most significant bit) • If found no long wait

  41. Pseudo Associative Hit time Pseudo hit time Miss penalty time

  42. Hardware prefetching • What about fetching two blocks in a miss. • Alpha AXP 21064 • Problem: real misses may be delayed due to prefetching

  43. Fetch Policies • Memory references: • 90% reads • 10% writes • Thus, read has higher priority

  44. Fetch • Fetch on miss • Demand fetching • Prefetching • Instructions (I-Cache)

  45. Hardware prefetching • What about fetching two blocks in a miss. • Alpha AXP 21064 • Problem: real misses may be delayed due to prefetching

  46. Compiler Controlled Prefetching • Register prefetching • Load values in regs before they are needed • Cache prefetching • Loads data in cache only This is only possible if we have nonblocking or lockup-free caches.

More Related