1 / 45

Computer Architecture

Princess Sumaya University for Technology. Computer Architecture. Dr. Esam Al_Qaralleh. Memory Hierarchy Design. 5.1 Introduction. Memory Hierarchy Design. Motivated by the principle of locality - A 90/10 type of rule Take advantage of 2 forms of locality

gamma
Télécharger la présentation

Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Princess Sumaya University for Technology Computer Architecture Dr. Esam Al_Qaralleh

  2. Memory Hierarchy Design

  3. 5.1 Introduction

  4. Memory Hierarchy Design • Motivated by the principle of locality - A 90/10 type of rule • Take advantage of 2 forms of locality • Spatial - nearby references are likely • Temporal - same reference is likely soon • Also motivated by cost/performance structures • Smaller hardware is faster: SRAM, DRAM, Disk, Tape • Access vs. bandwidth variations • Fast memory is more expensive • Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level

  5. Memory relevance in Computer Design ? • A computer’s performance is given by the number of instructions executed per time unit • The time for executing an instruction depends on: • The ALU speed (I.e. the data-path cycle duration) • The time it takes for each instruction to load/store its operands/result from/into the memory (in brief, the time to access memory) • The processing speed (CPU speed) grows faster than the memory speed. As a result the CPU speed cannot be fully exploited. This speed gap leads to an Unbalanced System !

  6. DRAM/CPU Gap • CPU performance improves at 55%/year • In 1996 it was a phenomenal 18% per month • DRAM - has improved at 7% per year

  7. Levels in A Typical Memory Hierarchy

  8. Sample Memory Hierarchy

  9. Unit of Transfer / Addressable Unit • Unit of Transfer:Number of bits read from, or written into memory at a time • Internal : usually governed by data bus width • External : usually a block of words e.g 512 or more. • Addressable unit:smallest location which can be uniquely addressed • Internal : word or byte • External : device dependent e.g. a disk “cluster”

  10. Access Method • Sequential • Data is stored in records, access is in linear sequence (tape) • Direct • Data blocks have a unique and direct access, data within block is sequential (disk) • Random • Data has unique and direct access (ram) • Associative • Data retrieved based on (partial) match rather than address (cache)

  11. Sequential Access Method • Start at the beginning – read through in order • Access time depends on location of data and previous location • e.g. tape start first location . . . read to here location of interest

  12. Direct Access Method • Individual blocks have unique address • Access is by jumping to vicinity plus sequential search (or waiting! e.g. waiting for disk to rotate) • Access time depends on target location and previous location • e.g. disk . . . jump to here block i read to here

  13. Random Access Method • Individual addresses identify specific locations • Access time independent of location or previous access • e.g. RAM . . . read here

  14. Associative Access Method • Data is located by a comparison with contents of a portion of the store. ( Content Addressable Memory). • Access time is independent of location or previous access • e.g. Cache

  15. Registers • Used for storing data • Register files • Multiple read/write ports possible • Example: 32-bit wide by 16-bit deep, dual-port parallel read, single port parallel write register file 32 bits . . . 16 words 32

  16. Implementing Registers Using Logic Gates • Flip-flops • Simple SR latch: • Flip-flops • JK, D, T • Clocked • Master-slave (edge-triggered) S S R Q Q’ 1 1 Q Q’ 1 0 0 1 0 1 1 0 0 0 x x Q S Q R Q Q R

  17. f f S R Implementing Registers in CMOS • Direct gate implementation too costly • A master-slave JK flip-flop uses 38 CMOS transistors • Directly implement in transistors • Example: clocked SR FF Q Q Q

  18. Internal Organization of Memory Chips b b ¢ b b ¢ b b ¢ 7 7 1 1 0 0 • • • W 0 FF FF A • • • 0 W 1 A 1 Address Memory • • • • • • • • • • • • • • • • • • cells decoder A 2 A 3 • • • W 15 R / W Sense / Write Sense / Write Sense / Write circuit circuit circuit CS Data input /output lines: b b b 7 1 0 Figure 5.2.Organization of bit cells in a memory chip.

  19. A Memory Chip 5-bit row address W 0 W 1 32 ´ 32 5-bit memory cell decoder array W 31 Sense / Write circuitry 10-bit address 32-to-1 R / W output multiplexer and CS input demultiplexer 5-bit column address Data input/output Organization of a 1K  1 memory chip.

  20. Static Memories • The circuits are capable of retaining their state as long as power is applied. b b ¢ T T 1 2 X Y Word line Bit lines A static RAM cell.

  21. Static Memories • CMOS cell: low power consumption

  22. Asynchronous DRAMs • Static RAMs are fast, but they cost more area and are more expensive. • Dynamic RAMs (DRAMs) are cheap and area efficient, but they can not retain their state indefinitely – need to be periodically refreshed. Bit line Word line T C A single-transistor dynamic memory cell

  23. A Dynamic Memory Chip R A S Row Addr. Strobe Row Row 4096 ´ ( 512 ´ 8 ) address decoder cell array latch A ¤ A CS Sense / Write 20 - 9 8 - 0 circuits R / W Column Column address decoder latch C A S D D 7 0 Column Addr. Strobe Internal organization of a 2M ´ 8 dynamic memory chip.

  24. 5.2 Review of the ABCs of Caches

  25. 36 Basic Terms on Caches

  26. Cache • The first level of the memory hierarchy encountered once the address leaves the CPU • Persistent mismatch between CPU and main-memory speeds • Exploit the principle of locality by providing a small, fast memory between CPU and main memory -- the cache memory • Cache is now applied whenever buffering is employed to reuse commonly occurring terms (ex. file caches) • Caching – copying information into faster storage system • Main memory can be viewed as a cache for secondary storage

  27. General Hierarchy Concepts • At each level - block concept is present (block is the caching unit) • Block size may vary depending on level • Amortize longer access by bringing in larger chunk • Works if locality principle is true • Hit - access where block is present - hit rate is the probability • Miss - access where block is absent (in lower levels) - miss rate • Mirroring and consistency • Data residing in higher level is subset of data in lower level • Changes at higher level must be reflected down - sometime • Policy of sometime is the consistency mechanism • Addressing • Whatever the organization you have to know how to get at it! • Address checking and protection

  28. Physical Address Structure • Key is that you want different block sizes at different levels

  29. Latency and Bandwidth • The time required for the cache miss depends on both latency and bandwidth of the memory (or lower level) • Latency determines the time to retrieve the first word of the block • Bandwidth determines the time to retrieve the rest of this block • A cache miss is handled by hardware and causes processors following in-order execution to pause or stall until the data are available

  30. Predicting Memory Access Times • On a hit: simple access time to the cache • On a miss: access time + miss penalty • Miss penalty = access time of lower + block transfer time • Block transfer time depends on • Block size - bigger blocks mean longer transfers • Bandwidth between the two levels of memory • Bandwidth usually dominated by the slower memory and the bus protocol • Performance • Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty • Memory-stall-cycles = IC * Memory-reference-per-instruction * Miss-Rate * Miss-Penalty

  31. Block Sizes, Miss Rates & Penalties, Accesses

  32. Headaches of Memory Hierarchies • CPU never knows for sure if an access will hit • How deep will a miss be - i. e. miss penalty • If short then the CPU just waits • If long then probably best to work on something else – task switch • Implies that the amount can be predicted with reasonable accuracy • Task switch better be fast or productivity/efficiency will suffer • Implies some new needs • More hardware accounting • Software readable accounting information (address trace)

  33. Four Standard Questions • Block Placement • Where can a block be placed in the upper level? • Block Identification • How is a block found if it is in the upper level? • Block Replacement • Which block should be replaced on a miss? • Write Strategy • What happens on a write? Answer the four questions for the first level of the memory hierarchy

  34. Block Placement Options • Direct Mapped • (Block address) MOD (# of cache blocks) • Fully Associative • Can be placed anywhere • Set Associative • Set is a group of n blocks -- each block is called a way • Block first mapped into a set  (Block address) MOD (# of cache sets) • Placed anywhere in the set • Most caches are direct mapped, 2- or 4-way set associative

  35. Block Placement Options (Cont.)

  36. Block Identification • Each cache block carries tags • Address Tags: which block am I? • Physical address now: address tag## set index## block offset • Note relationship of block size, cache size, and tag size • The smaller the set tag the cheaper it is to find • Status Tags: what state is the block in? • valid, dirty, etc. Many memory blocks may map to the same cache block Physical address =r + m + n bits r (address tag) m (set index) n(block offset) 2m addressable sets in the cache 2n bytesper block

  37. Block Identification (Cont.) Physical address = r + m + n bits r (address tag) m n 2m addressable sets in the cache 2n bytesper block Caches have an address tag on each block frame that gives the block address. A valid bit to say whether or not this entry contains a valid address. The block frame address can be divided into the tag filed and the index field.

  38. Block Replacement • Random: just pick one and chuck it • Simple hash game played on target block frame address • Some use truly random • But lack of reproducibility is a problem at debug time • LRU - least recently used • Need to keep time since each block was last accessed • Expensive if number of blocks is large due to global compare • Hence approximation is often used = Use bit tag and LFU • FIFO Only one choice for direct-mappedplacement

  39. Data Cache Misses Per 1000 Instructions 64 byte blocks on a Alpha using 10 SPEC2000

  40. Short Summaries from the Previous Figure • More-way associative is better for small cache • 2- or 4-way associative perform similar to 8-way associative for larger caches • Larger cache size is better • LRU is the best for small block sizes • Random works fine for large caches • FIFO outperforms random in smaller caches • Little difference between LRU and random for larger caches

  41. Improving Cache Performance • MIPS mix is 10% stores and 37% loads • Writes are about 10%/(100%+10%+37%) = 7% of overall memory traffic, and 10%/(10%+37%)=21% of data cache traffic • Make the common case fast • Implies optimizing caches for reads • Read optimizations • Block can be read concurrent with tag comparison • On a hit the read information is passed on • On a miss the - nuke the block and start the miss access • Write optimizations • Can’t modify until after tag check - hence take longer

  42. Write Options • Write through: write posted to cache line and through to next lower level • Incurs write stall (use an intermediate write buffer to reduce the stall) • Write back • Only write to cache not to lower level • Implies that cache and main memory are now inconsistent • Mark the line with a dirty bit • If this block is replaced and dirty then write it back • Pro’s and Con’s  both are useful • Write through • No write on read miss, simpler to implement, no inconsistency with main memory • Write back • Uses less main memory bandwidth, write times independent of main memory speeds • Multiple writes within a block require only one write to the main memory

  43. Different Memory-Hierarchy Consideration for Desktop, Server, Embedded System • Servers • More context switches  increase compulsory miss rates • Desktops are concerned more with average latency, whereas servers are also concerned about memory bandwidth • The importance of protection escalates • Have greater bandwidth demands • Embedded systems • Worry about worst-case performance: caches improve average-case performance • Power and battery life  less HW  less HW-intensive optimization • Protection role is diminished • Often no disk storage • Write-back is more attractive

  44. The Alpha AXP 21264 Data Cache • The cache contains 65,536 bytes of data in 64-byte blocks with two-way set associative placement (total 512 sets in the cache), write back, and write allocate on a write miss • The 44-bit physical address is divided into three fields: the 29-bit Tag, 9-bit Index, and 6-bit block offset • Although each block is 64 bytes, 8 bytes within a block is accessed per time • 3 bits from the block offset are used to index the proper 8 bytes

  45. The Alpha AXP 21264 Data Cache (Cont.)

More Related