510 likes | 837 Vues
Princess Sumaya University for Technology. Computer Architecture. Dr. Esam Al_Qaralleh. Memory Hierarchy Design. 5.1 Introduction. Memory Hierarchy Design. Motivated by the principle of locality - A 90/10 type of rule Take advantage of 2 forms of locality
E N D
Princess Sumaya University for Technology Computer Architecture Dr. Esam Al_Qaralleh
Memory Hierarchy Design • Motivated by the principle of locality - A 90/10 type of rule • Take advantage of 2 forms of locality • Spatial - nearby references are likely • Temporal - same reference is likely soon • Also motivated by cost/performance structures • Smaller hardware is faster: SRAM, DRAM, Disk, Tape • Access vs. bandwidth variations • Fast memory is more expensive • Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level
Memory relevance in Computer Design ? • A computer’s performance is given by the number of instructions executed per time unit • The time for executing an instruction depends on: • The ALU speed (I.e. the data-path cycle duration) • The time it takes for each instruction to load/store its operands/result from/into the memory (in brief, the time to access memory) • The processing speed (CPU speed) grows faster than the memory speed. As a result the CPU speed cannot be fully exploited. This speed gap leads to an Unbalanced System !
DRAM/CPU Gap • CPU performance improves at 55%/year • In 1996 it was a phenomenal 18% per month • DRAM - has improved at 7% per year
Unit of Transfer / Addressable Unit • Unit of Transfer:Number of bits read from, or written into memory at a time • Internal : usually governed by data bus width • External : usually a block of words e.g 512 or more. • Addressable unit:smallest location which can be uniquely addressed • Internal : word or byte • External : device dependent e.g. a disk “cluster”
Access Method • Sequential • Data is stored in records, access is in linear sequence (tape) • Direct • Data blocks have a unique and direct access, data within block is sequential (disk) • Random • Data has unique and direct access (ram) • Associative • Data retrieved based on (partial) match rather than address (cache)
Sequential Access Method • Start at the beginning – read through in order • Access time depends on location of data and previous location • e.g. tape start first location . . . read to here location of interest
Direct Access Method • Individual blocks have unique address • Access is by jumping to vicinity plus sequential search (or waiting! e.g. waiting for disk to rotate) • Access time depends on target location and previous location • e.g. disk . . . jump to here block i read to here
Random Access Method • Individual addresses identify specific locations • Access time independent of location or previous access • e.g. RAM . . . read here
Associative Access Method • Data is located by a comparison with contents of a portion of the store. ( Content Addressable Memory). • Access time is independent of location or previous access • e.g. Cache
Registers • Used for storing data • Register files • Multiple read/write ports possible • Example: 32-bit wide by 16-bit deep, dual-port parallel read, single port parallel write register file 32 bits . . . 16 words 32
Implementing Registers Using Logic Gates • Flip-flops • Simple SR latch: • Flip-flops • JK, D, T • Clocked • Master-slave (edge-triggered) S S R Q Q’ 1 1 Q Q’ 1 0 0 1 0 1 1 0 0 0 x x Q S Q R Q Q R
f f S R Implementing Registers in CMOS • Direct gate implementation too costly • A master-slave JK flip-flop uses 38 CMOS transistors • Directly implement in transistors • Example: clocked SR FF Q Q Q
Internal Organization of Memory Chips b b ¢ b b ¢ b b ¢ 7 7 1 1 0 0 • • • W 0 FF FF A • • • 0 W 1 A 1 Address Memory • • • • • • • • • • • • • • • • • • cells decoder A 2 A 3 • • • W 15 R / W Sense / Write Sense / Write Sense / Write circuit circuit circuit CS Data input /output lines: b b b 7 1 0 Figure 5.2.Organization of bit cells in a memory chip.
A Memory Chip 5-bit row address W 0 W 1 32 ´ 32 5-bit memory cell decoder array W 31 Sense / Write circuitry 10-bit address 32-to-1 R / W output multiplexer and CS input demultiplexer 5-bit column address Data input/output Organization of a 1K 1 memory chip.
Static Memories • The circuits are capable of retaining their state as long as power is applied. b b ¢ T T 1 2 X Y Word line Bit lines A static RAM cell.
Static Memories • CMOS cell: low power consumption
Asynchronous DRAMs • Static RAMs are fast, but they cost more area and are more expensive. • Dynamic RAMs (DRAMs) are cheap and area efficient, but they can not retain their state indefinitely – need to be periodically refreshed. Bit line Word line T C A single-transistor dynamic memory cell
A Dynamic Memory Chip R A S Row Addr. Strobe Row Row 4096 ´ ( 512 ´ 8 ) address decoder cell array latch A ¤ A CS Sense / Write 20 - 9 8 - 0 circuits R / W Column Column address decoder latch C A S D D 7 0 Column Addr. Strobe Internal organization of a 2M ´ 8 dynamic memory chip.
Cache • The first level of the memory hierarchy encountered once the address leaves the CPU • Persistent mismatch between CPU and main-memory speeds • Exploit the principle of locality by providing a small, fast memory between CPU and main memory -- the cache memory • Cache is now applied whenever buffering is employed to reuse commonly occurring terms (ex. file caches) • Caching – copying information into faster storage system • Main memory can be viewed as a cache for secondary storage
General Hierarchy Concepts • At each level - block concept is present (block is the caching unit) • Block size may vary depending on level • Amortize longer access by bringing in larger chunk • Works if locality principle is true • Hit - access where block is present - hit rate is the probability • Miss - access where block is absent (in lower levels) - miss rate • Mirroring and consistency • Data residing in higher level is subset of data in lower level • Changes at higher level must be reflected down - sometime • Policy of sometime is the consistency mechanism • Addressing • Whatever the organization you have to know how to get at it! • Address checking and protection
Physical Address Structure • Key is that you want different block sizes at different levels
Latency and Bandwidth • The time required for the cache miss depends on both latency and bandwidth of the memory (or lower level) • Latency determines the time to retrieve the first word of the block • Bandwidth determines the time to retrieve the rest of this block • A cache miss is handled by hardware and causes processors following in-order execution to pause or stall until the data are available
Predicting Memory Access Times • On a hit: simple access time to the cache • On a miss: access time + miss penalty • Miss penalty = access time of lower + block transfer time • Block transfer time depends on • Block size - bigger blocks mean longer transfers • Bandwidth between the two levels of memory • Bandwidth usually dominated by the slower memory and the bus protocol • Performance • Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty • Memory-stall-cycles = IC * Memory-reference-per-instruction * Miss-Rate * Miss-Penalty
Headaches of Memory Hierarchies • CPU never knows for sure if an access will hit • How deep will a miss be - i. e. miss penalty • If short then the CPU just waits • If long then probably best to work on something else – task switch • Implies that the amount can be predicted with reasonable accuracy • Task switch better be fast or productivity/efficiency will suffer • Implies some new needs • More hardware accounting • Software readable accounting information (address trace)
Four Standard Questions • Block Placement • Where can a block be placed in the upper level? • Block Identification • How is a block found if it is in the upper level? • Block Replacement • Which block should be replaced on a miss? • Write Strategy • What happens on a write? Answer the four questions for the first level of the memory hierarchy
Block Placement Options • Direct Mapped • (Block address) MOD (# of cache blocks) • Fully Associative • Can be placed anywhere • Set Associative • Set is a group of n blocks -- each block is called a way • Block first mapped into a set (Block address) MOD (# of cache sets) • Placed anywhere in the set • Most caches are direct mapped, 2- or 4-way set associative
Block Identification • Each cache block carries tags • Address Tags: which block am I? • Physical address now: address tag## set index## block offset • Note relationship of block size, cache size, and tag size • The smaller the set tag the cheaper it is to find • Status Tags: what state is the block in? • valid, dirty, etc. Many memory blocks may map to the same cache block Physical address =r + m + n bits r (address tag) m (set index) n(block offset) 2m addressable sets in the cache 2n bytesper block
Block Identification (Cont.) Physical address = r + m + n bits r (address tag) m n 2m addressable sets in the cache 2n bytesper block Caches have an address tag on each block frame that gives the block address. A valid bit to say whether or not this entry contains a valid address. The block frame address can be divided into the tag filed and the index field.
Block Replacement • Random: just pick one and chuck it • Simple hash game played on target block frame address • Some use truly random • But lack of reproducibility is a problem at debug time • LRU - least recently used • Need to keep time since each block was last accessed • Expensive if number of blocks is large due to global compare • Hence approximation is often used = Use bit tag and LFU • FIFO Only one choice for direct-mappedplacement
Data Cache Misses Per 1000 Instructions 64 byte blocks on a Alpha using 10 SPEC2000
Short Summaries from the Previous Figure • More-way associative is better for small cache • 2- or 4-way associative perform similar to 8-way associative for larger caches • Larger cache size is better • LRU is the best for small block sizes • Random works fine for large caches • FIFO outperforms random in smaller caches • Little difference between LRU and random for larger caches
Improving Cache Performance • MIPS mix is 10% stores and 37% loads • Writes are about 10%/(100%+10%+37%) = 7% of overall memory traffic, and 10%/(10%+37%)=21% of data cache traffic • Make the common case fast • Implies optimizing caches for reads • Read optimizations • Block can be read concurrent with tag comparison • On a hit the read information is passed on • On a miss the - nuke the block and start the miss access • Write optimizations • Can’t modify until after tag check - hence take longer
Write Options • Write through: write posted to cache line and through to next lower level • Incurs write stall (use an intermediate write buffer to reduce the stall) • Write back • Only write to cache not to lower level • Implies that cache and main memory are now inconsistent • Mark the line with a dirty bit • If this block is replaced and dirty then write it back • Pro’s and Con’s both are useful • Write through • No write on read miss, simpler to implement, no inconsistency with main memory • Write back • Uses less main memory bandwidth, write times independent of main memory speeds • Multiple writes within a block require only one write to the main memory
Different Memory-Hierarchy Consideration for Desktop, Server, Embedded System • Servers • More context switches increase compulsory miss rates • Desktops are concerned more with average latency, whereas servers are also concerned about memory bandwidth • The importance of protection escalates • Have greater bandwidth demands • Embedded systems • Worry about worst-case performance: caches improve average-case performance • Power and battery life less HW less HW-intensive optimization • Protection role is diminished • Often no disk storage • Write-back is more attractive
The Alpha AXP 21264 Data Cache • The cache contains 65,536 bytes of data in 64-byte blocks with two-way set associative placement (total 512 sets in the cache), write back, and write allocate on a write miss • The 44-bit physical address is divided into three fields: the 29-bit Tag, 9-bit Index, and 6-bit block offset • Although each block is 64 bytes, 8 bytes within a block is accessed per time • 3 bits from the block offset are used to index the proper 8 bytes