Final Exam Review

Final Exam Review

Exam Format • It will cover material after the mid-term (Cache to multiprocessors) • It is similar to the style of mid-term exam • We will have 6-7 questions in the exam • One question: true/false or short questions which covers general topics. • 5-6 other questions require calculation

Memory Systems

Faster Processor Secondary Storage (Disk) Control Main Memory (DRAM) L2 Off-Chip Cache L1 On-Chip Cache Datapath Registers Speed: Size: Cost: Memory Hierarchy - the Big Picture • Problem: memory is too slow and/or too small • Solution: memory hierarchy Larger Capacity Slowest Fastest Biggest Smallest Lowest Highest

Probability of reference Address Space 0 2n - 1 Why Hierarchy Works • The principle of locality • Programs access a relatively small portion of the address space at any instant of time. • Temporal locality: recently accessed instruction/data is likely to be used again • Spatial locality: instruction/data near recently accessed /instruction data is likely to be used soon • Result: the illusion of large, fast memory

Cache Design & Operation Issues • Q1: Where can a block be placed cache? (Block placement strategy & Cache organization) • Fully Associative, Set Associative, Direct Mapped. • Q2: How is a block found if it is in cache? (Block identification) • Tag/Block. • Q3: Which block should be replaced on a miss? (Block replacement) • Random, LRU. • Q4: What happens on a write? (Cache write policy) • Write through, write back.

Q1: Block Placement • Where can block be placed in cache? • In one predetermined place - direct-mapped • Use fragment of address to calculate block location in cache • Compare cache block with tag to test if block present • Anywhere in cache - fully associative • Compare tag to every block in cache • In a limited set of places - set-associative • Use address fragment to calculate set • Place in any block in the set • Compare tag to every block in set • Hybrid of direct mapped and fully associative

Q2: Block Identification • Every cache block has an address tag and index that identifies its location in memory • Hit when tag and index of desired word match(comparison by hardware) • Q: What happens when a cache block is empty?A: Mark this condition with avalid bit Valid Tag/index Data 1 0x00001C0 0xff083c2d

Cache Replacement Policy • Random • Replace a randomly chosen line • LRU (Least Recently Used) • Replace the least recently used line

Write-through Policy 0x1234 0x1234 0x1234 0x5678 0x5678 0x1234 Processor Cache Memory

Write-back Policy 0x1234 0x1234 0x1234 0x5678 0x9ABC 0x5678 0x1234 0x5678 Processor Cache Memory

Cache PerformanceAverage Memory Access Time (AMAT), Memory Stall cycles • The Average Memory Access Time (AMAT): The number of cycles required to complete an average memory access request by the CPU. • Memory stall cycles per memory access: The number of stall cycles added to CPU execution cycles for one memory access. • For an ideal memory: AMAT = 1 cycle, this results in zero memory stall cycles. • Memory stall cycles per average memory access = (AMAT -1) • Memory stall cycles per average instruction = Memory stall cycles per average memory access x Number of memory accesses per instruction = (AMAT -1 ) x ( 1 + fraction of loads/stores) Instruction Fetch

Cache Performance • Unified cache: For a CPU with a single level (L1) of cache for both instructions and data and no stalls for cache hits: CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x Clock cycle time CPU time = IC x [CPI execution + Memory accesses/instruction x Miss rate x Miss penalty ] x Clock cycle time • Split Cache: For a CPU with separate or split level one (L1) caches for instructions and data and no stalls for cache hits: CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x Clock cycle time Mem Stall cycles per instruction = Instruction Fetch Miss rate x Miss Penalty + Data Memory Accesses Per Instruction x Data Miss Rate x Miss Penalty

Memory Access TreeFor Unified Level 1 Cache CPU Memory Access L1 Hit: % = Hit Rate = H1 Access Time = 1 Stalls= H1 x 0 = 0 ( No Stall) L1 Miss: % = (1- Hit rate) = (1-H1) Access time = M + 1 Stall cycles per access = M x (1-H1) L1 AMAT = H1 x 1 + (1 -H1 ) x (M+ 1) = 1 + M x ( 1 -H1) Stall Cycles Per Access = AMAT - 1 = M x (1 -H1) M = Miss Penalty H1 = Level 1 Hit Rate 1- H1 = Level 1 Miss Rate

Memory Access TreeFor Separate Level 1 Caches CPU Memory Access Instruction Data L1 Instruction L1 Hit: Access Time = 1 Stalls = 0 Instruction L1 Miss: Access Time = M + 1 Stalls Per access: %instructions x (1 - Instruction H1 ) x M Data L1 Miss: Access Time : M + 1 Stalls per access: % data x (1 - Data H1 ) x M Data L1 Hit: Access Time: 1 Stalls = 0 Stall Cycles Per Access = % Instructions x ( 1 - Instruction H1 ) x M + % data x (1 - Data H1 ) x M AMAT = 1 + Stall Cycles per access

Cache Performance (various factors) • Cache impact on performance • With and without cache • Processor clock rate • Which one performs better: unified or split • Assuming same size • What is the effect of cache organization on cache performance: 1-way, 8-way set associative • Tradeoffs between hit-time and hit-rate

Cache Performance (various factors) • What is the affect of write policy on cache performance: Write back or write through – write allocate vs. no-write allocate • Stall Cycles Per Memory Access = % reads x (1 - H1 ) x M + % write x M • Stall Cycles Per Memory Access = (1-H1) x ( M x % clean + 2M x % dirty ) • What is the effect of cache levels on performance: • Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M • Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1) x (1-H2) x H3 x T3 + (1-H1)(1-H2) (1-H3)x M

Performance Equation To reduce CPUtime, we need to reduce Cache Miss Rate

Reducing Misses (3 Cs) • Classifying Cache Misses: 3 Cs • Compulsory—(Misses even in infinite size cache) • Capacity—(Misses due to size of cache) • Conflict—(Misses due to associative and size of cache) • How to reduce the 3 Cs (Miss rate) • Increase Block Size • Increase Associativity • Use a Victim Cache • Use a Pseudo Associative Cache • Use a prefetching technique

Performance Equation To reduce CPUtime, we need to reduce Cache Miss Penalty

4 bytes 4 bytes CPU CPU Bus Bus Cache Cache Bus 1 1 1 1 25 25 25 25 1 1 1 1 Memory Bus Bus Bus Bus Memory3 Memory2 Memory1 Memory0 Memory Interleaving – Reduce miss penalty Interleaving Default Begin accessing one word, and while waiting, start accessing other three words (pipelining) Must finish accessing one word before starting the next access (1+25+1)x4 = 108 cycles 30 cycles Requires 4 separate memories, each 1/4 size Interleaving worksperfectly with caches Spread out addresses among the memories

Memory Interleaving: An Example Given the following system parameters with single cache level L1: Block size=1 word Memory bus width=1 word Miss rate =3% Miss penalty=27 cycles (1 cycles to send address 25 cycles access time/word, 1 cycles to send a word) Memory access/instruction = 1.2 Ideal CPI (ignoring cache misses) = 2 Miss rate (block size=2 word)=2% Miss rate (block size=4 words) =1% • The CPI of the base machine with 1-word blocks = 2+(1.2 x 0.03 x 27) = 2.97 • Increasing the block size to two words gives the following CPI: • 32-bit bus and memory, no interleaving = 2 + (1.2 x .02 x 2 x 27) = 3.29 • 32-bit bus and memory,interleaved = 2 + (1.2 x .02 x (28)) = 2.67 • Increasing the block size to four words; resulting CPI: • 32-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 4 x 27) = 3.29 • 32-bit bus and memory,interleaved = 2 + (1.2 x 0.01 x (30)) = 2.36

Cache vs. Virtual Memory • Motivation for virtual memory (Physical memory size, multiprogramming) • Concept behind VM is almost identical to concept behind cache. • But different terminology! • Cache: Block VM: Page • Cache: Cache Miss VM: Page Fault • Caches implemented completely in hardware. VM implemented in software, with hardware support from CPU. • Cache speeds up main memory access, while main memory speeds up VM access • Translation Look-Aside Buffer (TLB) • How to calculate the size of page tables for a given memory system • How to calculate the size of pages given the size of page table

Virtual Memory: Definitions • Key idea: simulate a larger physical memory than is actually available • General approach: • Break address space up into pages • Each program accesses a working set of pages • Store pages: • In physical memory as space permits • On disk when no space left in physical memory • Access pages using virtual address Individual Pages Memory Map Disk Physical Memory Virtual Memory

I/O Systems

I/O concepts • Disk Performance • Disk latency = average seek time + average rotational delay + transfer time + controller overhead • Interrupt-driven I/O • Memory-mapped I/O • I/O channels: • DMA (Direct Memory Access) • I/O Communication protocols • Daisy chaining • Polling • I/O Buses • Synchronous vs. asynchronous

RAID Systems • Examined various RAID architectures: RAID0-RAID5: Cost, Performance (BW, I/O request rate) • RAID-0: No redundancy • RAID-1: Mirroring • RAID-2: Memory-style ECC • RAID-3: bit-interleaved parity • RAID-4: block-interleaved parity • RAID-5: block-interleaved distributed parity

Storage Architectures • Examined various Storage architectures (Pros. And Cons): • DAS - Directly-Attached Storage • NAS - Network Attached Storage • SAN - Storage Area Network

Multiprocessors

Motivation • Application needs • Amdhal’s law • T(n) = • As n  , T(n)  • Gustafson’s law • T'(n) = s + n*p; T'() !!!! 1 s+p/n 1 s

Flynn’s Taxonomy of Computing • SISD (Single Instruction, Single Data): • Typical uniprocessor systems that we’ve studied throughout this course. • SIMD (Single Instruction, Multiple Data): • Multiple processors simultaneously executing the same instruction on different data. • Specialized applications (e.g., image processing). • MIMD (Multiple Instruction, Multiple Data): • Multiple processors autonomously executing different instructions on different data.

MB MB P/C P/C Cache Cache NIC NIC Bus/Custom-Designed Network Shared Memory Multiprocessors Shared Memory

MPP (Massively Parallel Processing)Distributed Memory Multiprocessors MB : Memory Bus NIC : Network Interface Circuitry MB MB P/C P/C LM LM NIC NIC Custom-Designed Network

Cluster LD : Local Disk IOB : I/O Bus MB MB P/C P/C M M Bridge Bridge LD LD IOB IOB NIC NIC Commodity Network (Ethernet, ATM, Myrinet)

Grid P/C P/C P/C P/C IOC IOC Hub/LAN Hub/LAN NIC NIC LD LD SM SM SM SM Internet

Multiprocessor concepts • SIMD Applications (Image processing) • MIMD • Shared memory • Cache coherence problems • Bus scalability problems • Distributed memory • Interconnection networks • Cluster of workstations

Preparation Strategy • Read this review to focus your preparation • 1 general question • 5-6 other questions • Around 50% for memory systems • Around 50% I/O and multiprocessors • Go through the lecture notes • Go through the “training problems” • We will have more office hours for help • Good luck

Final Exam Review