CSC 4250 Computer Architectures

CSC 4250Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy

Main Memory • Assume the following performance: • 4 clock cycles to send address • 56 clock cycles for access time per word • 4 clock cycles to send a word of data • Choose a cache block of 4 words. Then • miss penalty is 4×(4+56+4) or 256 clock cycles • bandwidth is 32/256 or 1/8 byte per clock cycle

Fig. 5.27. Three examples of bus width, memory width, & memory interleaving to achieve higher memory bandwidth

Techniques for Higher Bandwidth • Wider main memory ─ Quadrupling the width of the cache and the memory will quadruple the memory bandwidth. With a main memory width of 4 words, the miss penalty would drop from 256 cycles to 64 cycles. • Simple interleaved memory ─ Sending an address to four banks permits them all to read simultaneously. The miss penalty is now 4+56+(4×4) or 76 clock cycles.

Example • What can interleaving and wide memory buy? • Consider the following machine: Block size = 1 word Memory bus width = 1 word Miss rate = 3% Memory accesses per instruction = 1.2 Cache miss penalty = 64 cycles Average CPI (ignoring cache misses) = 2 • If we change the block size to 2 words, the miss rate falls to 2%, and a 4-word block has a miss rate of 1.2%. • What is the improvement in performance of interleaving two ways and four ways versus doubling the width of memory and the bus?

Solution (1) • CPI for computer using 1-word blocks = 2 + (1.2×3%×64) = 4.30 • Since the clock cycle time and instruction time won’t change in this example, we calculate performance improvement by just comparing CPI. • Increasing the block size to 2 words gives these options: • 64-bit bus and memory, no interleaving = 2 + (1.2×2%×2×64) = 5.07 • 64-bit bus and memory, interleaving = 2 + (1.2×2%×(4+56+8)) = 3.63 • 128-bit bus and memory, no interleaving = 2 + (1.2×2%×1×64) = 3.54 • Thus, doubling the block size slows down the straightforward implementation (5.07 versus 4.30), while interleaving or wider memory is 1.19 or 1.22 times faster, respectively.

Solution (2) • Increasing the block size to 4 words gives these options: • 64-bit bus and memory, no interleaving = 2 + (1.2×1.2%×4×64) = 5.69 • 64-bit bus and memory, interleaving = 2 + (1.2×1.2%×(4+56+16)) = 3.09 • 128-bit bus and memory, no interleaving = 2 + (1.2×1.2%×2×64) = 3.84 • Again, the larger block hurts performance for the simple case (5.69 versus 4.30), although the interleaved 64-bit memory is now fastest ─ 1.39 times faster versus 1.12 for the wider memory and bus.

Interleaved Memory • Interleaved memory is logically a wide memory, except that accesses to banks are staged over time to share internal resources. • How many banks should be included? • One metric, used in vector computers, is • Number of banks ≥ Number of clock cycles to access word in bank

Virtual Memory • At any instant in time computers are running multiple processes, each with its own address space. • It is too expensive to dedicate a full address space worth of memory for each process, especially since many processes use only a small part of their address space. • We need a way to share a smaller amount of physical memory among many processes. • One way, virtual memory, divides physical memory into blocks and allocate them to different processes. • There must be a protection scheme that restricts a process to the blocks belonging only to that process.

Fig. 5.31. A program in its contiguous virtual address space

Comparison with Caches • Page or segment is used for block. • Page fault or address fault is used for miss. • The CPU produces virtual addresses that are translated by a combination of hardware and software to physicaladdresses, which access main memory. This process is called memory mapping or address translation. • Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by the operating system. • The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size.

Figure 5.32. Typical ranges of parameters for caches and virtual memory

Figure 5.33. How paging and segmentation divide a program

Fig. 5.34. Paging versus segmentation • Why two words per address for segment?

Four Questions • Where can a block be placed in main memory? Miss penalty is high. So, choose direct-mapped, fully associative, or set associative? • How is a block found if it is in main memory? Paging and segmentation (tag, index, offset fields). • Which block should be replaced on a virtual memory miss? Random, LRU, or FIFO. • What is the write policy? Write through, write back, write allocate, or no-write allocate?

Paging • Paging uses a data structure that is indexed by the page number. This structure contains the physical address of the block. The offset is concatenated to the physical page address. • The structure takes the form of a page table. Indexed by the virtual page number, the size of the table equals the number of pages in the virtual address space. • Given a 32-bit virtual address, 4 KB pages, and four bytes per page table entry, the size of the page table would be (232/212)×22 = 222 or 4 MB.

Figure 5.35. Mapping of virtual address to physical address via page table • How can we reduce address translation time?

Figure 5.35. Again • Use these values: 64-bit virtual address, 8KB page size. • What is the number of entries in page table? • What is the size of page table?

Fig. 5.39. Mapping of Alpha virtual address

Alpha 21264 Memory Management 1 • 64-bit address space • 43-bit virtual address • Three segments: • seg0: bits 63-43 = 00…0 • seg1: bits 63-43 = 11…1 • kseg • Segment kseg is reserved for operating system • User processes use seg0 • Page tables reside in seg1

Alpha 21264 Memory Management 2 • PTE (page table entry) is 64 bit (8 bytes) • Each page table has 1,024 PTE’s • Page size is thus 8KB • Virtual address is 43 bits (why?) • Physical page number is 28 bits • Physical address is thus 41 bits (why?) • Possible to increase page size to 16, 32, or 64KB • If page size = 64KB, then virtual and physical addresses become 55 and 44 bits, resp. (why?)

Fig. 5.43. Overview of Alpha 21264 memory hierarchy

CSC 4250 Computer Architectures