LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi§ Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi* University of Utah and *HP Labs §Currently with ARM

Memory Reliability • Datacenters are the backbone of the web-connected infrastructure • Reliability is essential • Memory reliability is a major concern [Schroeder et al., SIGMETRICS ‘09] • among the most error-prone parts of a server • Even a few uncorrectable errors will require DIMM replacement • ranks near the top of component replacements in datacenters • Increases downtime • Increases operational cost Source: Nagios

Some Numbers A single server blade Datacenter 16 blades per enclosure X 4 enclosures per rack X 10 racks per container X 40 containers per datacenter = ~64 x 1015 DRAM cells 2 Billion DRAM cells per chip X 36 DRAM chips per DIMM X 2 DIMMs per channel X 4 Channels per processor X 4 processors per blade = ~2.5 x 1012 DRAM cells Assume MTTF per cell is the age of the universe ~14 Billion Years Blade DRAM MTTF = 2 days Datacenter DRAM MTTF = 7 seconds

Target Reliability • High-end servers commonly have high reliability expectations • Single Symbol Correct Double Symbol Detect • One symbol == one DRAM chip (“Chipkill”) • Today’s systems employ symbol-based ECC codes

Problems with Existing Solutions • Increased access granularity • Every data access is spread across 36 DRAM chips • JEDEC standards define minimum access granularity per chip • Massive overfetch of data at multiple levels • Wastes energy • Wastes bandwidth • Reduced rank-level parallelism • x4 device width restriction • fewer ranks for given DIMM real estate • Reliability level: 1 failed chip out of 36

A New Approach: LOT-ECC • Operate on a single rank of x8 memory: 9 chips • and support 1 failed chip out of 9 • Multiple tiers of localized protection • Tier 1: Local Error Detection (checksum) • Tier 2: Global Error Correction (parity) • T3 & T4 to handle specific failure cases • Data mapping handled by memory controller with firmware support • Transparent to OS, caches, etc. • Strictly commodity DRAM used • Significant power and performance benefits

Tier 1 – Local Error Detection (LED) Chip 7 Chip 8 Chip 0 • Standard x72 DIMM (Nine x8 parts): Eight data + One ECC • We use all 9 chips for both data and ECC • 64 bits per chip per burst – 57 data + 7 checksum

Tier 1 – Local Error Detection (LED) • 57 bits * 9 = 513 • Only 1 cache line read at a time • 57 bits/chip on first 8 chips; 56 bits on 9th chip • 1 bit extra on the 9th chip • Use in a different tier of protection • No performance impact on memory reads or writes • LED ops occur in parallel with data ops • Note that LED is local to each chip • Need to pin-point exact failed chip, not simply detect an error in the rank

Tier 2 – Global Error Correction (GEC) LG0 LG7 LA8 LG8 LA7 LG1 LA1 LA0 ✔ ✖ A0 A1 A7 A8 . . ✔ 57 bits 7 bits ✖ PA 0-6 G0 G1 PA 7- 13 G7 PA 49-55 G8 PA 56 PPA Chip 7 Chip 8 Chip 0 Chip 1 A, B, C, D, E, F, G, H – Cache Lines, each comprised of segments X0 through X8 Data LXN – L1 Local Error Detection for Cache Line X, Segment N LED [PX0:PXN]– L2 Global Error Correction across segments X0 through X8 GEC PPX – Parity across GEC segments PX0-6 through PX49-55

The Devil is in the Details.. • ..and the details are in the paper! • Need to detect and correct additional errors in GEC region • Parity is 57 bits; write granularity is 72 bits • Use the remaining 15 bits wisely, add two more tiers of protection Surplus bit borrowed from data + LED 7b 1b 7b 1b 7b 1b 7b 1b T4 T4 T4 PPA T4 PA0-6 PA7-13 .. PA49-55 PA56 Chip 0 Chip 1 Chip 7 Chip 8

Optimizing Write Behavior • Every write has to update its GEC bits • Already borrowing one bit from [data + LED] to use in the GEC • Put them all in the same DRAM row! • Guaranteed row-buffer hit • Data mapping handled by the memory controller A0 B0 A7 B7 LA8 LB8 LG8 LA7 LH7 LB7 LG7 LH0 LH8 LB0 LH8 LG8 LB8 LA8 LG0 LH0 LA0 LG0 LB0 LA0 57 bits 7 bits A8 B8 G0 H0 G7 H7 PA 0-6 PB0-6 PH0-6 PA 49-55 PB 49-55 PH 49-55 PPA PPB PPH G8 H8 Chip 7 Chip 8 Chip 0 A8 A0 PPA PA 0-6 B8 B0 PB0-6 PPB 57 bits 7 bits 57 bits 7 bits G8 G0 PG0-6 PPG H0 H8 PH0-6 PPH

GEC Coalescing • DDR3 burst of 8 forces 72 bytes per access • GEC per cache line is only 72 bits • With sufficient locality, one GEC write can potentially cover 8 data writes • In reality, each write becomes 1 + δ writes (for 0.125 < δ ≤ 1) • Note that even with δ = 1, benefits of row-buffer hit remain • Write typically buffered at the memory controller to avoid bus turnaround overheads • Controller can re-order accesses to accommodate coalescing • Results show three cases: Basic design (δ = 1), Simple coalescing (measured δ), and Oracular design (δ = 0.125)

Constructing the LED Code • Use a 7-bit ECC code to detect errors in 57 data bits • We choose a 7-bit 1’s complement checksum • Paper details code operation and computes FIT • single-bit, double-bit, row, column, row-column, pin, chip, multiple random, combinations • Very small rate of undetected errors • Caused by very specific, uncommon bit-flip combinations • Less than 5E-5 FIT! • Captures some failure modes NOT captured by existing mechanisms (failure of 2 chips out of 18, errors in >2 chips/rank, etc.)

Checksum Design • Not all error combinations actually occur in DRAM • Small number of failure modes with specific root causes • Code’s effectiveness under those failures is important • Current symbol-based codes guarantee capturing 100% of SSC-DSD errors • At huge power and performance penalties • Likely overkill • Not scalable as error rates increase • Use strong yet practical codes + RAS features • Example: Proactive patrol scrubbing will capture a majority of soft errors; may not coincide with hard errors

Evaluation Methodology • Performance analysis: In-house DRAM simulator • Models refresh, address/command bus, data bus, banks/ranks/channels contention, read/write queues • Power analysis: Micron power calculator spreadsheet • Reflects timing parameters assumed for performance simulations • Bus utilization and bank utilization numbers obtained from performance simulations • Accounts for activation power, read/ write power, termination power, and background power • Includes low-power sleep modes

Evaluation Platforms • Xeon 7500-like system • 8 DDR3 channels, 2 DIMMs/channel • Dual-ranked x4 or Quad-ranked x8 DIMMs • “Lockstep mode” is the only supported mode • Two ranks operate together to provide a 144-bit bus • Wasted bandwidth by masking out half the burst, OR • Forced prefetching • Also evaluate Xeon 5500-like systems • 3 DDR3 channels, 3 DIMMs/channel • “Lockstep mode” wastes one channel entirely, gangs other two • Evaluate five design points each • Baseline symbol-based SSC-DSD • Virtualized ECC (Yoon & Erez, ASPLOS ’10) • LOT-ECC with no coalescing, simple coalescing, oracular coalescing

Power Results 7500 -43%

Power Results 5500 -32%

Performance Results 7500 Latency Reduction: LOT-ECC 4.6% +GEC Coalescing 7.7% Oracular 16.2%

Performance Results 5500 Latency Reduction: LOT-ECC 42.9% +GEC Coalescing 46.9% Oracular 57.3%

Storage Overhead • For each 64-byte cache line • 63 bits of LED checksum • 57 bits of GEC parity • 7 bits of T3 code • 9 bits of T4 code • Total storage overhead of 26.5% • Current ECC implementations and DIMMs already accept 12.5% through extra chip • Additional 14% in data memory via firmware • Memory capacity is cheap if commodity • Better to spend on this than power/performance

Key Contributions • Multi-tiered protection design to keep fault tolerance contained to fewer chips • Unique data layout tailored to the access mechanism of commodity DRAM systems • Exploit row-buffer efficiency • co-locate data and all tiers of fault-tolerance codes • Mitigates overheads of additional writes typical in parity-based systems • Coalescing optimization to further minimize impact of parity writes

Key Benefits • Power Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes (43% memory power savings) • Performance Gains: More rank-level parallelism, reduced access granularity (7.7% memory latency reduction) • Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently • Flexibility: Works with a single rank of x4 DRAMs or more efficient x8 DRAMs • Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS

BACKUP SLIDES

Tier 2 – Global Error Correction (GEC) • GEC is a parity written across the cache line segments in each chip • LED has already pinpointed erroneous segment • Error correction is trivial • Storing the parity • A portion of memory set aside to hold GEC • Handled by memory controller + firmware • No impact on reads unless error is detected • GEC also self contained (single cache line) • No read-before-write

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems