Intel

Intel Labs Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram Kumar Krishnamurthy Microprocessor Research Labs Intel Corporation, Hillsboro, OR ram.krishnamurthy@intel.com July 11, 2005

Outline • General Technology and Circuit Challenges Beyond 65nm: • Switching and active leakage energy • Leakage tolerance and robustness • On-chip interconnect scaling • Process parameter variations and tolerance • Execution core thermal/power density • Emerging trends in wireless and embedded DSP industry • Circuit solutions: • Active and standby leakage power reduction strategies • Multi-supply design: switching + leakage power benefits • Energy-efficient arithmetic circuit technologies • HW accelerators for specialized DSP applications

Technology Scaling 101 1 1 1 0.49 0.7 0.7

Leakage vs. Switching Power 250 250 Active Power 200 200 Active Leakage power 150 150 Power (Watts) 100 100 50 50 0 0 250nm 180nm 130nm 90nm 65nm Technology • From a mP perspective, but true for DSPs too • Ioff increase 3-5X per generation • Active leakage power > 50% of total power

On-chip Interconnect Performance Interconnect Delay 30% per generation 30% per generation Delay (ns) 0.001 0.01 0.1 1.0 Typical Gate Delay 250 200 150 100 50 Technology Node (nm) • RC/mm increases 40-60% per generation • Local inter-gate wires dominate critical-path delays • Global wire lengths not scaling by 0.7x

200 (180nm CMOS, 110°C) Frequency 150 30% 1.4 20X 1.3 Number of dies 1.2 100 1.1 Fast corner 1.0 10 15 20 Leakage 0.9 5 0 50 0 0 1 2 3 4 5 7 Normalized IOFF Process Variation Tolerance • Significant variation in IOFF (hence Fmax spread) • Worsening with process scaling • Excess leakage dies: lack in robustness • Low leakage dies: over-designed for robustness Process parameter variation tolerant design techniques

DSP Application Demands VOICE DATA and APPLICATIONS Capability > 200 MIPS > 100 MIPS ASIC DSP Hardware ASIC DSP Hardware Assist < 50 MIPS • 2.5G:GPRS • EDGE • IS-95B • 3G: • WCDMA • 2G: • GSM • PDC • IS-95 2003 2005 2001 Time • Smart cell-phones: $2B in ’02  $15B in ’06! • Huge demand for high-performance DSPs

Multimedia, Graphics, Enterprise… 200+ MIPS 64+ MB Flash 16+ MB RAM • Speech recognition • Multimedia • Large files and applications Capability > 100 MIPS 16+ MB Flash 8+ MB RAM Multimedia Graphics • Color Screen • Audio • Graphics • Secure remote access < 50 MIPS 4+ MB Flash 0.5+ MB RAM Enterprise • Email • Full OS and user interface • Browser • Suite of apps • Full OS • GUI • Browser • Suite of apps • Simple User interface • Calendar • Notepad OS, Services and Apps 2003 2005 2001 Time • Market is hungry for DSP MIPS (if you deliver, they will use it!)

Typical Performance Requirements Total required memory MHz per task 1000 64MB 200 - 300 MHz • MPEG 4 Playback 32 MB 150 - 200 MHz • MP3 encode 100 80 - 100 MHz • MPEG 4 Playback 40 - 80 MHz • MP3 Playback Pocket PC 16MB 25 - 50 MHz • Robust handwriting recognition 8 - 12 MHz • Voice 128-bit encryption and decryption 10 8MB • Graphical Browser - small screen • ASCII Browser < 4MB

So, How Do We Meet This Surging Demand Within Given Power Envelope? 1000 Dedicated HW ASIC 100 Configurable Processor/Logic Berkeley’s Pleiades: 10-80 MOPS/mW Energy Efficiency in MOPS/mW 10 Digital Signal Processors or other ASIPs 1-2 MOPS/mW 1.0 SA110: 0.4 MOPS/mW Embedded Processors 0.1 Flexibility (Coverage) Courtesy: Prof. J. Rabaey, UC Berkeley • Energy vs. Flexibility Trade-off

Energy and Area Efficiency Courtesy: Prof. Teresa Meng, Stanford

MOPS/mW Distinction: General-purpose vs. Dedicated Courtesy: Prof. B. Brodersen, UC Berkeley, ISSCC’02 • DSP functions are more throughput-oriented • Amenable for parallelism and pipelining (better power-performance optimization)

Emerging Trends in DSP Industry Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates 1000 Specialized hardware 100 Normalized power efficiency Programmable DSP 10 Embedded/mP Microprocessors: Best flexibility 1 Flexibility Prof. L. Clark, CICC 2002 [2]

Emerging Trends in DSP Industry Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates 1000 Specialized hardware Microprocessors add specialized HW and coprocessors with DSP functionality 100 Normalized power efficiency Programmable DSP 10 Embedded/mP Microprocessors: Best flexibility 1 Flexibility

Example Case Study IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3] • 153-380MHz, 53-500mW in 180nm CMOS, 1.0-1.8V • 5.84M transistors, 36mm2 • Dedicated DES-Encryption and Speech processing accelerators Encryption and Speech Processing Specialized HW

Emerging Trends in DSP Industry Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates 1000 Specialized hardware DSPs add microcontroller functionality and specialized HW accelerators 100 Normalized power efficiency Programmable DSP 10 Embedded/mP Microprocessors: Best flexibility 1 Flexibility

Example Case Study TI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4] • 600MHz, 4.8GOPS, 718mW in 130nm CMOS, 1.2V • Dedicated Viterbi and Turbo decoding co-processor HW • 64M transistors • Integrated DMA controller, PCI, 1MB L2, 16K I$ & D$ Viterbi and Turbo Co-processors

Specialized Hardware Accelerators • Specialized (fixed function) hardware is 10-100x more efficient than general purpose processors: Why? • Trades hardware for power • Allows very low clock rates • Essential for some wireless functions • Viterbi and Turbo decoding, speech recognition, encryption • Allows custom algorithms and coefficients to limit power • Use shifts instead of multiplies • Cost is flexibility • Fixed algorithms and coefficients • As new applications and wireless standards emerge, is this enough? • How does this cover the application space?

Reconfigurable Processors FPGA – Fine Grain Reconfigurable Fabric • Fine-grain gate-level functions • Array of MUXes to implement any N-input boolean function • Speed sacrificed for generality Course Grain Reconfigurable Fabric • Moderate grain function blocks • Collections of Add, Mpy, Mux, … • Interconnect overhead is moderate to low • If functions and connectivity are known, can be highly optimized Courtesy: Prof. F. Kurdahi, UCI

Generic Reconfigurable Architecture Datapath Tiles Registers Configuration Control Array of Fine/Coarse Grain Datapath Tiles and Registers

How Do Reconfigurable Processors Work? • Execute one algorithm/ protocol at any given time • Each algorithm is ‘configured’ from the building blocks • Time between subsequent configurations: ~1-10ms • Configuration Control unit decides which algorithm to execute when Protocol 1 ~1-10 ms ~1-10 ms Time

How Do Reconfigurable Processors Work? • Execute one algorithm/ protocol at any given time • Each algorithm is ‘configured’ from the building blocks • Time between subsequent configurations: ~1-10ms • Configuration Control unit decides which algorithm to execute when ~1-10 ms Protocol 2 ~1-10 ms Time

How Do Reconfigurable Processors Work? • Execute one algorithm/ protocol at any given time • Each algorithm is ‘configured’ from the building blocks • Time between subsequent configurations: ~1-10ms • Configuration Control unit decides which algorithm to execute when ~1-10 ms ~1-10 ms Protocol 3 Time

Standby Leakage Reduction: Sleep Transistor design • Motivation: Cut off power supply in sleep-mode • Insert “sleep” transistor between main supply and functional unit’s supply rails • Latches tied to main supply rails: retain state sleep transistor Virtual Vcc Functional Unit Virtual Vss sleep transistor Standby leakage benefit for 5% delay penalty

Vbp Vdd + Ve Vbn - Ve Switching + Leakage Reduction: Forward Body Bias Vbp Vdd + Ve Vbn - Ve A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6] 20% power reduction at 1GHz 8%  frequency at iso-power 20X  idle-mode leakage

VCCcore1 VCCcore2 VCCcore3 VCCcore4 Multi-Vcc Usage Model VRM2 VRM1 VRM3 VRM4 • Optimize performance and power with parallelism and voltage

100 80 60 Leakage 40 Normalized 20 0 0 0.3 0.6 0.9 1.2 1.5 Voltage (V) Switching + Leakage Reduction:Multi Supply Design • Active leakage benefit with lower supply voltage • Exponential subthreshold and gate leakage reduction R. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7] 130nm L1 cache leakage Measured Leakage in 1.2V, 130nm process 12 w.c. corner 10 8 Subthreshold lkg 6 Leakage Energy (Normalized) 79% 4 Nominal corner Gate lkg 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 VCC (V)

Frequency 30% 1.4 20X 1.3 1.2 1.1 1.0 10 15 20 Leakage 0.9 5 0 Adaptive Vcc: Variation-tolerant Circuits • Motive: change Vcc adaptively to reduce impact of parameter variations • Large Fmax vs. leakage spread (worsening with scaling) • Lower Vdd on leakage-limited circuits (subject to stability limits) • Higher Vdd on speed-limited circuits (subject to reliability limits) 100% Fixed Vdd: 1.05V Adaptive Vdd: 20mV resolution 80% 60% Die count 40% 20% 0% 0.85 0.9 0.95 1 1.05 Frequency Bin 100% Adaptive Vdd + body bias Adaptive Vdd + WID body bias 80% 60% Die count 40% 20% 0% 0.85 0.9 0.95 1 1.05 J. Tschanz et al, 2002 Symp. VLSI Circuits Frequency Bin

Viterbi Decoder Organization Branch Path Metric Traceback Metric Unit Unit (PMU) Unit (TBU) (BMU) • BMU calculates errors for all branches • PMU accumulates errors and outputs transitions with minimum error • TBU traces minimum error path back to get best estimate of original input One of the most performance and power critical algorithms in wireless baseband DSP Encoded Branch Decoded Transitions Bits Error Bits

90nm CMOS Implementation 90nm dual-Vt 7-metal CMOS technology 64-state radix-2 design: 40mW at 500Mbps, 1.2V PM memory PM memory PM memory TB memory BMU BMU 8 ACS TB control Path memory PM memory PM memory PM memory BMU BMU TB memory ACS 230µm x 210 µm Traceback 260µm x 510 µm M. Anders et al, 2004 VLSI Circuits Symp. [10]

Summary • Fastest reported 64-state Viterbi accelerator • Total power at 2 GHz (500Mbps) is 40mW (1.2V) • Lowest power 802.11a implementation • Total power at 216 MHz (54Mbps) is 5mW (0.7V) M. Anders et al, 2004 VLSI Circuits Symp. [10]

Streaming Media Accelerators: 32-bit MAC [ISSCC’03] • 5GHz 32-bit multiply-accumulate unit • Targeted for special purpose streaming processors/graphics S. Vangal et al, ISSCC’03 [11]

32 32 32 32-bit MAC Architecture Overview Scan Reg MAC Scan Out FIFO A FIFO B FIFO C x 32 Vectors Deep FIFO Control + Scan Reg Scan In • Single-cycle 5GHz 32-bit MAC loop • New Multiplier and Accumulator ALU circuit techniques

TCP/IP Off-load Accelerator [ISSCC’03] • 10GHz TCP/IP offload accelerator unit • Targeted for 10Gbps Ethernet packet processing accel. Y. Hoskote et al, ISSCC’03 [12]

10GHz TCP/IP Execution Core ROB Key TCB CLB ALU TCB 6 264 96 index input Working register Rcv buffer • At-speed packet processing execution core for 10Gbps Next address Branch address Start address 32 32 Scratch registers PC Pipelined ALU 9 112 decode IR ALU output Instr ROM 10GHz sparse-tree ALU

Conclusions • Several Technology and Circuit Challenges Beyond 65nm • Switching and active leakage energy • Leakage tolerance and robustness • On-chip interconnect scaling • Process parameter variations and tolerance • Execution core thermal/power density • Emerging trends in DSP industry • Specialized hardware accelerators and co-processors • Reconfigurable engines • Circuit solutions: • Active and standby leakage power reduction strategies • Multi-supply design: switching + leakage power benefits • Energy-efficient arithmetic circuit technologies • DSP HW accelerators for Viterbi, Streaming media, TCP/IP

[1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE Custom Integrated Circuits Conference 2002, pp. 125-128. • [2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference 2003, pp. 171-176. • [3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC 2002, pp. 340-341. • [4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57. • Reconfigurable processors: • http://brass.cs.berkeley.edu/ • http://www.eng.uci.edu/comp.arch/ • http://www.pactcorp.com/xneu/px_xpphw.html • [5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt allocation and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219. • [6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”, Symposium on VLSI Circuits 2002, pp. 312-315. • [7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI Circuits 2002, pp. 128-129. • [8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI Circuits 2002, pp. 126-127. • [9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004, pp. 162-163. • [10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits 2004, pp. 174-175. • [11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS”, ISSCC 2003, pp. 334-335. • [12] Y. Hoskote et al, “A 10GHz TCP/IP offload accelerator for 10Gb/s Ethernet in 90nm CMOS”, ISSCC 2003, pp. 258-259. References

Intel

Intel

Presentation Transcript

Intel

intel

Intel Ultimate

Intel Ultimate

INTEL COREI3 INTEL COREI5 INTEL COREI7

Intel Confidential

Intel

Intel Confidential

Intel Confidential

Intel 8086

Intel Confidential

Intel Confidential

Intel 8086

Intel and Intel in Ireland

Intel 8051

INTEL 8086

Intel Confidential

INTEL 8085

Rebekka Porath, Intel (rebekka.porath@intel )