2014-2-20 John Lazzaro (not a prof - “John” is always OK)

www-inst.eecs.berkeley.edu/~cs152/ CS 152 Computer Architecture and Engineering Lecture 10 -- Cache I 2014-2-20 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love Play:

Processor Input Control Memory Datapath Output Today: Caches and the Memory System Static Memory: Used in cache designs. Short Break Memory Hierarchy: Technology motivation for caching.

Static Memory: Circuit remembers as long as the power is on. Static Memory Circuits Dynamic Memory: Circuit remembers for a fraction of a second. Non-volatile Memory: Circuit remembers for many years, even if power is off.

Preliminaries

Vdd symbol Vin Vout Inverters: Building block for SRAM Vin Vout

Inverter: Die Cross Section Vout Vin Vin oxide oxide n+ n+ p+ p+ n+ n-well p- Vin Vout

Recall: Our simple inverter model ... pFET. A switch. “On” if gate is grounded. “1” Correctly predicts logic output for simple static CMOS circuits. “0” “1” “0” “1” nFET. A switch. “On” if gate is at Vdd. “0” Extensions to model subtler circuit families, or to predict timing, have not worked well ...

In reality, Vin = Vout settles to a stable value, defined as Vth, where nFET and pFET current match. We wire the output of the inverter to drive its input. What happens? Vth Can we figure out Vth, without solving tedious equations? When the 0/1 model is too simple ... I sd Logic simulators based on our too-simple model predict this circuit will oscillate! This prediction is incorrect. Vin Vout I ds

Intersection defines Vth pFET I sd Note: Ignores second-order effects. Graphical equation solving ... Vth nFET I ds I sd Vout Vin I ds Vin = Vout Recall: Graphs from power and energy lecture ...

Recall: Transistors as water valves If electrons are water molecules, transistorstrengths(W/L) are pipe diameters, and capacitors are buckets ... “1” A “on” p-FET fills up the capacitor with charge. “0” Time Water level “1” A “on” n-FET empties the bucket. “0” Time Water level

Small amounts of noise on Vin causesIds > Isdor Isd > Ids ... andoutput bucket randomly fills and empties. Result: Vout randomlyflips between logic 0 and logic 1. What happens when we break tie wire? Tie wire broken I ds I sd I sd Vout Vin I ds Vth Vin left free to float.

SRAM 1971 state of the art. Intel 2102, a 1kb, 1 MHz static RAM chip with 6000 nFETs transistors in a 10 μm process.

“Column” Word Line “Row” Vdd “Bit Line” Recall DRAM cell: 1 T + 1 C “Word Line” “Row” Bit Line “Column”

Gnd Vdd x y Vdd Gnd Why? We can use the redundant representation to compensate for noise and leakage. Idea: Store each bit with its complement x “Row” y

x y I ds I sd Case #1: y = Gnd, y = Vdd ... x “Row” y Gnd Vdd

x y I ds I sd Case #2: y = Vdd, y = Gnd ... x “Row” y Gnd Vdd

“Cross- coupled inverters” x y noise Gnd noise Vdd Vdd Gnd Vth Vth Combine both cases to complete circuit y x

Cell has both transistor types Vdd AND Gnd Capacitors are usually “parasitic” capacitance of wires and transistors. SRAM Challenge #1: It’s so big! SRAM area is 6-10X DRAM area, same generation ... More contacts, more devices, two bit lines ...

Intel SRAM core cell (45 nm) Bit Lines Word Lines

When word line goes high, bitlines “fight” with cell inverters to “flip the bit” -- must win quickly! Solution: tune W/L of cell & driver transistors Initial state Vdd Bitline drives Gnd Initial state Gnd Bitline drives Vdd Challenge #2: Writing is a “fight”

When word line goes high on read, cell inverters must drive large bitline capacitance quickly, to preserve state on its small cell capacitances Cell state Vdd Bitline a big capacitor Cell state Gnd Bitline a big capacitor Challenge #3: Preserving state on read

Architects specify number of rows and columns. Add muxes to select subset of bits Word and bit lines slow down as array grows larger! Parallel Data I/O Lines SRAM array: like DRAM, but non-destructive Write Driver Write Driver Write Driver Write Driver For large SRAMs: Tile a small array, connect with muxes, decoders.

SRAM is much faster: transistors drive bitlines on reads. SRAM easy to design in logic fabrication process (and premium logic processes have SRAM add-ons) Big win for DRAM SRAM advantages DRAM has a 6-10X density advantage at the same technology generation. SRAM has deterministic latency: its cells do not need to be refreshed. SRAM vs DRAM, pros and cons

RAM Compilers On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers. Compile-time parameters set number of bits, aspect ratio, ports, etc.

Flip Flops Revisited

Recall: Static RAM cell (6 Transistors) noise noise Gnd Vdd Vdd Vth Gnd Vth “Cross- coupled inverters” x x!

Holds value Sampling circuit Recall: Positive edge-triggered flip-flop A flip-flop “samples” right before the edge, and then “holds” value. D Q 16 Transistors: Makes an SRAM look compact! What do we get for the 10 extra transistors? Clocked logic semantics.

clk = 0 clk’ = 1 Sensing: When clock is low A flip-flop “samples” right before the edge, and then “holds” value. D Q Sampling circuit Holds value Outputs last value captured. Will capture new value on posedge.

clk = 1 clk’ = 0 Capture: When clock goes high A flip-flop “samples” right before the edge, and then “holds” value. D Q Sampling circuit Holds value Outputs value just captured. Remembers value just captured.

CLK 0->1 CLK == 0 Capture D, pass value to Q Sense D, but Q outputs old value. Flip Flop delays: clk-to-Q ? setup ? hold ? D Q CLK setup hold clk-to-Q

Q D D Q D Q CLK From flip-flops to latches ... Sampling circuit Holds value D Q Latch-based design: Break up the flip-flop circuit into two latch state elements. Then, add combinational logic between the latches. Latches are good for making small memories. Saves half the area over using D flip-flops.

Break Play:

The Memory Hierarchy

60% of the area of this CPU is devoted to SRAM cache. But the role of cache in computer design has varied widely over time.

CPU: 1000 ns Apple ][ (1977) DRAM: 400 ns Steve Wozniak Steve Jobs 1977: DRAM faster than microprocessors

Circuit in 250 nm technology (introduced in 2000) Same circuit in 180 nm technology (introduced in 2003) Each dimension 30% smaller. Area is 50% smaller 0.7 x L nm L nanometers long Since then: Technology scaling ... Logic circuits use smaller C’s, lower Vdd, and higher kn and kp to speed up clock rates.

Assume Ccell = 1 fF DRAM scaled for more bits, not more MHz Bit line may have 2000 nFet drains, assume bit line C of 100 fF, or 100*Ccell. Ccell holds Q = Ccell*(Vdd-Vth) When we dump this charge onto the bit line, what voltage do we see? dV = [Ccell*(Vdd-Vth)] / [100*Ccell] dV = (Vdd-Vth) / 100 ≈ tens of millivolts! In practice, scale array to get a 60mV signal.

CPU 60% per yr 2X in 1.5 yrs The power wall DRAM 9% per yr 2X in 10 yrs Gap grew 50% per year 1980-2003, CPU speed outpaced DRAM ... Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”. Performance (1/latency) 10000 CPU 1000 100 10 DRAM 2005 1980 1990 2000 Year

To CPU Large, slow From CPU Small, fast Caches: Variable-latency memory ports Data in upper memory returned with lower latency. Data in lower level returned with higher latency. Data Address

Queues as a building block for memory systems Avoid blocking by using a queue (a First-In, First-Out buffer, or FIFO) to communicate between two sub-systems.

Variable-latency port that doesn’t stall on a miss CPU makes a request by placing the following items in Queue1: Queue 1 Queue 2 CMD: Read, write, etc ... MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit. TAG: 9-bit number identifying the request. MADDR: Memory address of first byte. STORE-DATA: For stores, the data to store. From CPU To CPU

This cache is used in an ASPIRE CPU (Rocket) From CPU To CPU When request is ready, cache places the following items in Queue2: Queue 1 Queue 2 TAG: Identity of the completed command. LOAD-DATA: For loads, the requested data. CPU saves info about requests, indexed by TAG. Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.

Replace with Instruction Cache and Data Cache of DRAM main memory Cache replaces data, instruction memory IF (Fetch) ID (Decode) EX (ALU) MEM WB IR IR IR IR Mux,Logic A Y R M M B

32 KB Data Cache 32 KB Instruction Cache Recall: Intel ARM XScale CPU (PocketPC) 180 nm process (introduced 2003)

ARM CPU 32 KB instruction cache uses 3 million transistors Typical miss rate: 1.5% DRAM interface uses 61 pins that toggle at 100 MHz

Managed by compiler Managed by OS, hardware, application Managed by hardware iMac G5 1.6 GHz $1299.00 2005 Memory Hierarchy: Apple iMac G5 Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access

90 nm, 58 M transistors 512K L2 L1 (64K Instruction) Registers (1K) L1 (32K Data) PowerPC 970 FX

Latency: A closer look Read latency: Time to return first byte of a random access Architect’s latency toolkit: (1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.

Can we add two pipeline stages? A4 256 A3 A2 A7 A6 A5 { { 3 3 256 256 OE Data output is 32 bits DEMUX 3 1 OE MUX ... D0-D31 256 32 i.e. 4 bytes OE Recall: Adding pipeline stages to memory Before we pipelined, slow! Only read behavior shown. A7-A0: 8-bit read address OE --> Tri-state Q outputs! Q Byte 0-31 Byte 32-63 Q ... ... Byte 224-255 Q Each register holds 32 bytes (256 bits)

Thus, push to faster DRAM interfaces 1 of 8192 decoder 16384 columns Select requested bits, send off the chip 16384 bits delivered by sense amps 13-bit row address input 8192 rows Recall: Reading an entire row for later use What if we want all of the 16384 bits? In row access time (55 ns) we can do 22 transfers at 400 MT/s. 16-bit chip bus -> 22 x 16 = 352 bits << 16384 Now the row access time looks fast! 134 217 728 usable bits (tester found good bits in bigger array)

2014-2-20 John Lazzaro (not a prof - “John” is always OK)