Memory Hierarchy- Power 5 Processor

Memory Hierarchy-Power 5 Processor By Ajay Mathews Cheriyan Jian Wang Shoaib Akram

Topics Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching

POWER-X Introduction IBM POWER: - Performance Optimization With Enhanced RISC Used in many of IBM’s servers, workstations and supercomputers POWER to POWER3: (1990~1998) POWER4: (2001) - dual-core chips, up to 8 cores, most powerful at the time POWER5: (2004) - memory system improvement - support simultaneous multithreading(SMT) POWER6: (may, 2007) - advanced interchip communication technology - double performance of POWER5 POWER7: (currently in development)

POWER5 Chip Overview P2 P1 L2 MC L3 Ctrl

L1 Cache L1 caches are not shared between processors, - each core has its own cache - LRU replacement (VS FIFO in POWER4) L1 I-cache: - size: 64KB/processor - associative: 2-way (VS direct mapped in POWER4) - line size: 128 bytes - write policy: N/A L1 D-cache: - size: 32KB/processor - associative: 4-way (VS 2-way in POWER4) - line size: 128 bytes - write policy: write-through - transfer rate: 4 words/cycle

L2 Cache Memory address L3 controller and L3 • three identical slices • shared between processors • (Memory address) mod 3 = slice ID • three independent controllers • each controller manages one slice • each can deliver 32B/cycle(60.8GB/sec) to L1 • Total Size: 1.9MB (VS 1.4MB) • Associative: 10-way (VS 8-way) • Latency: 13 cycles, 6.8 ns (VS 12/7.1) • Line size: 128 bytes • Write policy: write back • Transfer rate: 4 words/cycle

L3 Cache • 3 identical slices • Each attaches to one of L2 slice • off-chip, shared between processors • L3 directory/control is on-chip • This design reduce the off-chip delay • Total Size: 36MB (VS 32MB) • Associative: 12-way (VS 8-way) • Latency: 87 cycles, 45.8 ns (VS 123/72.3) • Line size: 256 bytes • Write policy: write back • Transfer rate: <1 words/cycle

Structure Optimizations on POWER4(1) • L3 removed from path between chip-MC: - why? heavy traffic on FBC (16 chips) can reduce latency to the L3 (physically closer to CPUs) - L3 act as victim cache of L2 - now operate at ½ the processor clock rate, while this rate is 1/3 in POWER4 - this opt increases bandwidth by ½, and reduce latency by roughly 1/3 * FBC—Fabric Bus Controller

MC integrated into the chip: - different path Proc-L3 and Proc-memory - Increase potential operational parallelism and bandwidth - significantly reduce latency, eliminating communi delays to external controller The benefit of these optimization: - latency L3: 72.3ns VS 45.8ns - latency memory: 206ns VS 116ns - bandwidth(4P): 8.37GB/s VS 17.9GB/s Structure Optimizations on POWER4(2)

Next Topic Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching

Memory Subsystem Path followed by requests L2 Controller  Fabric Controller  Memory Controller P P P P L2 L2 FBC FBC MC MC To physical memory

Memory Controller read/write reorder queues Schedular–selects operations from queues FIFO based arbiter queue

Memory Controller Separate queues Increasing capacity increases clock cycle Read and write reordering is done differently Eight enteries per queue Centralized FIFO queue Prevenents CPU stalls when memory controller under stress Scheduler Selects operations from queues.

Interface to Physical Memory Buses operate at twice the DRAM speed Memory protection. ECC Memory scrubbing Meomory Controller read/write/command buses Matches bus-width between MC and DIMMs SMI (*4) DIMM (*2) DIMM (*2) Dual In-memory modules

Memory Requests & Multichips -Ring Topology -Snooping Mechanism -Combined Response -Fabric Buses P ------- AB SB RB DB

Next Topic Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching

Virtual Memory • 64 bit virtual address and 50 bit real address • Two steps to address translation • Effective address translated to virtual address using a 64 entry segment lookaside buffer(SLB) • Virtual address translated to real address using page table. • Page table cached in a 1024 entry, 4 way set associative TLB

Virtual Memory (Contd.) • For fast translation , 2 first level translation tables are used-1for instructions and 1 for data • They provide fast effective to real address translation.SLB and TLB looked up only if miss in the first level translation. • Data Translation Table – 128 entry fully associative array • Instruction Translation Table – 64 entry two way set associative array. • Entries in both tables are tagged with thread number and are not shared between threads. • TLB entries can be shared between threads.

Prefetching When load instructions miss sequential cache lines , prefetch engine initiates accesses to following cache lines before being referenced by future load instructions. L1 data cache prefetch initiated when load references data from new cache line and a new line is loaded into L2 from memory. Latency for retrieving data from memory is more. So prefetch engine requests data from memory 12 lines ahead of the line being referenced by the load.

Prefetching (Contd.) • Hardware ramps up prefetch slowly requiring an additional 2 sequential misses before it reaches steady state prefetch sequencing. • Software prefetching is also supported to indicate the number of lines to prefetch using hardware. • Advantages • Improves performance by eliminating the initial ramp up. • Only required number of lines are prefetched. • Eight software prefetch streams are supported per processor.

Prefetching (Contd.) • Upon a cache miss: • Biased guess is made as to the direction of that stream • Guess is based upon where in the cache line the address associated with that miss occurred • If it is in the first 3/4, then the direction is guessed as ascending • If in the last 1/4, the direction is guessed descending. • Instruction prefetch is also present in Power5 processor with 4 instruction prefetch buffers ( 2 per thread).

Main Memory Both DDR and DDR2 DIMMs can be used with Power5 SMI chips are provided to support the connection of the DIMMs with the processor. Support is provided for 2 (2 SMI mode) or 4 (4 SMI mode) chips. Each SMI chip has two ports and each port can support up to 2 DIMMs

THANK YOU

Memory Hierarchy- Power 5 Processor

Memory Hierarchy- Power 5 Processor

Presentation Transcript

Memory Hierarchy

Chapter 5: Memory Hierarchy Design

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy