1 / 22

COMPSYS 304

COMPSYS 304. Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland. Iolanthe at 13 knots on Cockburn Sound, WA. Memory Bottleneck. State-of-the-art processor f = 3 GHz t clock = 330ps 1-2 instructions per cycle

brady-white
Télécharger la présentation

COMPSYS 304

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn Sound, WA

  2. Memory Bottleneck • State-of-the-art processor • f = 3 GHz • tclock = 330ps • 1-2 instructions per cycle • ~25% memory reference • Memory response • 4 instructions x 330ps • ~1.2ns needed! • Bulk semiconductor RAM • 100ns+ for a ‘random’ access! • Processor will spend most of its time waiting for memory!

  3. Cache • Small, fast memory • Typically ~50kbytes (1998) • 2 cycle access time • Same die as processor • “Off-chip” cache possible • Custom cache chip closely coupled to processor • Use fast static RAM (SRAM) rather thanslower dynamic RAM • Several levels possible • 2nd level of the memory hierarchy • “Caches” most recently used memory locations “closer” to the processor • closer = closer in time

  4. Cache • Etymology • cacher(French) = “to hide” • Transparent to a program • Programs simply run slower without it • Modern processors rely on it • Reduces the cost of main memory access • Enables instruction/cycle throughput • Typical program • ~25% memory accesses

  5. Cache • Relies upon locality of reference • Programs continually use - and re-use -the same locations • Instructions • loops, • common subroutines • Data • look-up tables • “working” data sets

  6. Cache - operation • Memory requests checked in cache first • If the word sought is in the cache,it’s read from cache (or updated in cache) • Cache hit • If not, request is passed to main memoryand data is read (written) there • Cache miss VA PA MMU PA Main Mem CPU Cache D or I D or I

  7. Cache - operation • Hit rates of 95% are usual • Cache: 16 kbytes • Effective Memory Access Time • Cache: 2 cycles • Main memory: 10 cycles • Average access: 0.95*2 + 0.05*10 = 2.4cycles

  8. Cache - organisation • Direct-mapped cache • Each word in the cache has a tag • Assume • cache size - 2kwords • machine words - p bits • byte-addressed memory • m = log2 ( p/8 ) bits not used to address words • m = 2 for 32-bit machines Address format p bits p-k-m k m tag cache address byte address

  9. Cache - organisation A cache line • Direct-mapped cache data tag 2klines memory p p-k-m Hit? p-k-m k m CPU tag cache address byte address Memory address

  10. Cache - Direct Mapped • Conflicts • Two addresses separated by 2k+mwill hit the same cache location • 32-bit machine, 64kbyte (16kword) cache • m = 2, k = 14 • Any program or data set larger than 64kb will generate conflicts • On a conflict, the ‘old’ word is flushed • Unmodified word • ( Program, constant data ) • overwritten by the new data from memory • Modified data needs to be written back to memory before being overwritten

  11. Cache - Conflicts • Modified or dirty words • When a word is modified in cache • Write-back cache • Only writes data back when needed • Misses • Two memory accesses • Write modified word back • Read new word • Write-through cache • Low priority write to main memory is queued • Processor is delayed by read only • Memory write occurs in parallel with other work • Instruction and necessary data fetches take priority

  12. Cache - Write-through or write-back? • Write-through • Allows an intelligent bus interface unitto make efficient use of a serious bottle-neck Processor - memory interface(Main memory bus) • Reads (instruction and data) need priority! • They stall the processor • Writes can be delayed • At least until the location is needed! • More on intelligent system interface units later • but ...

  13. Cache - Write-through or write-back? • Write-through • Seems a good idea! • but ... • Multiple writes to the same location waste memory bus bandwidth • Typical programsrun better with write-back caches • however • Often you can easily predict which will be best • Some processors (eg PowerPC) allow you to classify memory regions as write-back or write-through

  14. Cache - more bits • Cache lines need some status bits • Tag bits + .. • Valid • All set to false on power up • Set to true as words are loaded into cache • Dirty • Needed by write-back cache • Write- through cache always queues thewrite, so lines are never ‘dirty’

  15. Cache - Improving Performance • Conflicts ( addresses 2k+m bytes apart ) • Degrade cache performance • Lower hit rate • Murphy’s Law operates • Addresses are never random! • Some locations ‘thrash’ in cache • Continually replaced and restored

  16. Cache - Fully Associative • All tags are compared at the same time • Words can use any cache line

  17. Cache - Fully Associative • Associative • Each tag is compared at the same time • Any match  hit • Avoids ‘unnecessary’ flushing • Replacement • Least Recently Used - LRU • Needs extra status bits • Cycles since last accessed • Hardware cost high • Extra comparators • Wider tags • p-m bits vsp-k-m bits

  18. Cache - Set Associative 2-way set associative Each line - two words two comparators only

  19. Cache - Set Associative • n-way set associative caches • n can be small: 2, 4, 8 • Best performance • Reasonable hardware cost • Most high performance processors • Replacement policy • LRU choice from n • Reasonable LRU approximation • 1 or 2 bits • Set on access • Cleared / decremented by timer • Choose cleared word for replacement

  20. Cache - Locality of Reference • Temporal Locality • Same location will be referenced again soon • Access same data again • Program loops - access same instruction again • Caches described so far exploit temporal locality • Spatial Locality • Nearby locations will be referenced soon • Next element of an array • Next instruction of a program

  21. Cache - Line Length • Spatial Locality • Use very long cache lines • Fetch one datum • Neighbours fetched also • PowerPC 601 (Motorola/Apple/IBM)first of the single chip Power processors • 64 sets • 8-way set associative • 32 bytes per line • 32 bytes (8 instructions) fetched into instruction buffer in one cycle • 64 x 8 x 32 = 16k byte total

  22. Cache - Separate I- and D-caches • Unified cache • Instructions and Data in same cache • Two caches - • * Instructions * Data • Increases total bandwidth • MIPS R10000 • 32Kbyte Instruction; 32Kbyte Data • Instruction cache is pre-decoded! (32  36bits) • Data • 8-word (64byte) line, 2-way set associative • 256 sets • Replacement policy?

More Related