ERD/ERA Status Report

ERD/ERA Status Report Ralph K. Cavin, III March 18, 2009 Brussels

Outline • Is there a Carnot-like theorem for computation? • e.g., a limit on rate of information throughput/power consumed? • The MIND architecture benchmarking activity for novel devices • Memory Architectures • Inference Architectures

Idealized Study of MIPS/Watt Limits • Chose a simple one-bit, four instruction processor • All transistors operate at ~kT switching energy • Interconnects dissipate energy at ~kT/gate length • Transistor average fan-out is three

I2 I1 X S1 ALU Z Y S5 S2 6 C1 6 1 S6 C0 98 6 144 1 S3 6 1 1 S4 2-4 DEC 1 12+ 6 1 Minimal Turing Machine Memory Program Counter 2-bit Counter 24 Red numbers = # transistors CPU Total: 314 devices

Turing Machine Implementation: Generic floorplan and energetics n=314 Von Neumann threshold: Joyner tiling: amin= 1.5 nm Operational energy of the Minimal Turing Machine Per full CPU operation:

Minimal Turing Machine: A summary Devices: 314 Device density: 5.61012 cm-2 Energy per cycle Time per cycle ~2 ps Power: 2mW Power density : ~30 kW/cm2 BITS=density x freq. = 1014 bit/s MIPS: 2105

Computing Power:MIPS (m) vs. BIT (b) Sources:The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores 106 W/cm2 1019 bit/s 108 MIPS 30 W Brain Instructions per sec 106 BIT:

Summary • The Minimal Turing Machine lies on the different performance trajectoriefrom conventional computers • It has slope to meet brain performance • More detailed physics based analysis is needed • System thermodynamics of computation • Carnot’s equivalent for Computational Engine? • Lessons from Biological Computation? • Candidates for beyond-CMOS nano-electronics should be evaluated in the context of system scaling • e.g. spintronic minimal Turing Machine?

Post-CMOS Switch Assessment using Architectural Criteria NRI Focus Centers Kerry Bernstein, IBMFebruary 2009 Update

Two Perspectives on post-CMOS Devices • Short Term – Switches that supplement CMOS and are CMOS-compatible, supporting performance via hardware acceleration • Long Term – Switches that replace CMOSfor general purpose high performance compute applications

Four Premises • CMOS is not going away anytime soon. Charge (state variable), and the MOSFET (fundamental switch) will remain the preferred HPC solution until new switches appear as the long term replacement solution in 10-20 years. • Hdwre Accelerators execute selected functions faster than software performing it on the CPU.Accelerators are responsible for substantial improvements in thruput. • Alternative switches often exhibit emergent, idiosyncratic behavior. We should exploit them.Certain physical behaviors may emulate selected HPC instruction sequences. Some operations may be superior to digital solutions. • New switches may improve high-utility acceleratorsThe shorter term supplemental solution (5-15 years) improves or replaces accelerators “built in CMOS and designed for CMOS”, either on-chip, or on-planar, or on-3D-stack

Hierarchical Benchmarking

Deliverables for NRI Researchers • Derive values for the conventional quantitative ITRS benchmarks shown in Benchmark 1. • Derive quantitative values and qualitative entries for architecture benchmarks shown in Benchmark2a and 2b • Identify specific logic operations performed elegantly by your switch: where physical device behavior complements desired logic operation. Determine the equivalent IPC, power of that function performed in the new switch, as shown in Benchmark 3example.Determine the actual IPC, and Operations/Watt had the function been performed via software in the CPU.

Benchmark 1: Device Metrics Defined byITRS ERD Working Group CapturesFundamentalDevice Properties

Benchmark 2: Architectural Metrics 2a. Quantitative 2b. Qualitative

Hierarchy of Limits for Communicating Various Computational State Variables Azad Naeemi, Georgia Tech Delay versus Length for Various Transport Mechanisms New State Variables will impactcommunication and fan-out

Benchmark 3: Accelerator Equivalent PerformanceGoal: extend performance after scaling by: - Moving more function into accelerators - Finding new switches that match function effectively EquivalentIPC - MIPS/Watt - Ops/Jouleof switch in application Quantum BTBTFET MQCA H.264 Comp …………….. Crypto Compare apples-to-apples, independent of particular strength

Matching Logic Functions & New Switch Behaviors New Switch Ideas Popular Accelerators Single Spin Spin Domain Tunnel-FETs NEMS MQCA Molecular Bio-inspired CMOL Excitonics Encrypt / Decrypt Compr / Decompr Reg. Expression Scan Discrete COS Trnsfrm Bit Serial Operations H.264 Std Filtering DSP, A/D, D/A Viterbi Algorithms Image, Graphics ? Example: Cryptography Hardware Acceleration Operations required: Rotate, Byte Alignmt, EXORs, Multiply, Table Lookup Circuits used in Accel: Transmission Gates (“T-Gates”) New Switch Opportunity: A number of new switches (i.e. T-FETs) don’t have thermionic barriers: won’t suffer from CMOS Pass-gate VT drop, Body Effect, or Source-Follower delay. Potential Opportunity: Replace 4 T-Gate MOSFETs with 1 low power switch.

Hdwre Accelerator Example - AES Crypto 2.8E-4 Bernstein, 1/25/09 • Example of HPC Hdwre Accelerator contribution to power, area, instruction retirement rate, energy efficiency improvement. • Purdue Emerging Technology Evaluator (PETE) metric is convolution of power/energy, delay, and area. • IPC and Ops/nJprovides apples-to-apples comparison of new switches.

ERD and Memory Architectures Paul Franzon Department of Electrical and Computer Engineering paulf@ncsu.edu 919.515.7351

DARPA Exascale Architecture Study • Goal: • Determine research needs for ~2015 1000 Petaflop computer, and smaller equivalents • Major Conclusions: • Major challenge #1: Power efficiency • Communication • Overhead in computation • Major Challenge #2: Resiliency • Completing computation in presence of permanent and transient faults • Major Challenge #3: Performance Scaling • Performance scaling limited by software, communications bisection bandwidth, and memory speed

Improving Power Efficiency • Critical Needs: • Reduce power SRAM replacements • 45 nm L1 Cache: 3.6 pJ/bit • Note: re-architecting in 3D can save ~50% • What is the potential for an ERD to reduce to 0.3 pJ/bit? • Note: Would require low-swing on bit lines, while retaining speed and low SET rate • Reduced power switched interconnect • Esp. packet routed interconnect (NOC) • What is the potential for a memory-style ERD to be used for fast switchable interconnect? • Flash devices can do this for static reconfiguration BUT faster switching devices will be needed for dynamic reconfiguration

Resiliency • Blue Gene system reliability: • Most of the DRAM failures are due to DIMM socket failures, not device failures • Critical need: Sub-system level checkpointing and roll-back

Resiliency • ERD requirement: • Tightly embedded Flash-like state “capture” memory for checkpointing • Requirements: • Tightly embedded, e.g. Shadow registers, with minimum process change • Slow read/write OK • ~10 M writes minimum extrinsic reliability requirement

ERD for Computation 1. Metrics for cache replacement

ERD For Computation 2. Metrics for programmed routability

ERD for Computation 3. Metrics for Local Check-pointing memory

Conclusions • In future computing, both General Purpose and Application Specific, the bottleneck is not in logic operations but in memory, communications, and reliability • Opportunities arise for memory style devices to solve these bottlenecks: • Low power SRAM replacement • Ultra-low swing, routable interconnect replacement • Local non-volatile memory as an aid to resiliency

Late News-1 • The Memory Wall for multi-core • In general purpose multi-core processors, the tradeoff for L1-L3 between memory bandwidth and memory size is dramatic. • At constant BW, two cores may require as much as 8x memory of one core • At 2x BW, two cores require only about 2x memory of a single core system • Kerry Bernstein “New Dimensions in Performance, Feb. 2009

Late News - 2 • Workshop: “Technology Maturity for Adaptive, Massively Parallel Computing” – March 2009, Portland Oregon http://www.technologydashboard.com/adaptivecomputing/. • General theme: Inference Architectures and Technology • Karlheinz Meier, U. Heidleberg, “VLSI Implementation of Very Large Scale Neuromorphic Circuits – Achievements, Challenges, Hopes” • Progress in architectures is being made but many technology challenges remain. (Complexity) • Can Emerging Research Devices accelerate realization of Inference Architectures?

ERA Looking Ahead • Continue work on ERD Architectural Benchmarking • Work with NRI MIND benchmarking effort • Develop section on memory architectures for Emerging Research Memories • Look at role of ERD/ERM in novel architectures where unique properties can provide substantial leverage; e.g. inference architectures

ERD/ERA Status Report