Optimizing Power @ Design Time Interconnect and Clocks

Optimizing Power @ Design TimeInterconnect and Clocks

Chapter Outline • Trends and bounds • An OSI approach to interconnect optimization • Physical layer • Data link and MAC • Network • Application • Clock distribution

ITRS Projections [Source: ITRS Roadmap, 2004, 2005]

Increasing Impact of Interconnect • Interconnect is now exceeding transistors in • Latency • Power dissipation • Manufacturing complexity • Direct consequence of scaling

Communication Dominant Part of Power Budget CLB Control I/O Drivers I/O Interconnect 5% 15% 10% 9% Execution 15% Units 21% Clock 65% 40% Clocks 20% Caches FPGA mProcessor Clock I/O Logic Memory Signal processor

Idealized Wire Scaling Model

Distribution of Wire Lengths on Chip © IEEE 1998 [Ref: J. Davis, C&S’98]

Technology Innovations Reduce resistivity (e.g. Copper) Reduce dielectric permittivity (e.g. Aerogels or air) © IEEE 1998 Novel interconnect media (carbon nanotubes, optical) Reduce wirelengths through 3D-integration (Pictures courtesy of IBM and IFC FCRP)

-6 10 J -9 10 J -12 10 J -15 10 J -18 10 J 3 Pt ~ 1/S p Logic Scaling 0 10 -3 10 -6 10 Power [W], P -9 10 -12 10 -15 10 -12 -9 -6 -3 0 10 10 10 10 10 Delay [s], tp [Ref: J. Davis, Proc’01]

-2 2 L ~ S t -18 -15 -12 -9 -6 -3 10 10 10 10 10 10 Interconnect Scaling 10 -5 10 10 -4 8 10 10 6 -3 10 10 L-2t = 10-5 [s/cm-2] (F = 0.1µ) -7 10 4 -2 10 10 (1µ) -9 10 (Length)[cm], L (Length)-2 [cm-2], L-2 -11 (10µ) 10 2 -1 10 10 (100µ) -13 10 (1000µ) -0 0 10 10 -2 10 10 2 10 -4 10 Delay [s], t [Ref: J. Davis, Proc’01]

Lower Bounds on Interconnect Energy Shannon’s theorem on maximum capacity of communication channel C: capacity in bits/sec B: bandwidth Ps:average signal power Claude Shannon Valid for an “infinitely long” bit transition (C/B→0)Equals 4.10-21J/bit at room temperature [Ref: J. Davis, Proc’01]

Reducing Interconnect Power/Energy • Same philosophy as with logic: reduce capacitance, voltage (or voltage swing) and/or activity • A major difference: sending a bit(s) from one point to another is fundamentally a communications /networking problem, and it helps to consider it as such. • Abstraction layers are different: • For computation: device, gate, logic, micro-architecture • For communication: wire, link, network, transport • Helps to organize along abstraction layers, well understood in the networking world: the OSI protocol stack

OSI Protocol Stack Presentation/Application • Reference model for wired and wireless protocol design — Also useful guide for conception and optimization of on-chip communication • Layered approach allows for orthogonalization of concerns and decomposition of constraints • No requirement to implement all layers of the stack • Layered structure must not necessarily be maintained in final implementation Session Transport Network Data Link Physical [Ref: M. Sgroi, DAC’01]

The Physical Layer Presentation/Application Transmit bits over physical interconnect medium (wire) • Physical medium • Material choice, repeater insertion • Signal waveform • Discrete levels, pulses, modulated sinusoids • Voltages • Reduced swing • Timing, synchronization Session Transport Network Data Link Physical So far, on-chip communication almost uniquely “level-based”

Repeater Insertion Optimal receiver insertion results in wire delay linear with L with RdCd and rwcw intrinsic delays of inverter and wire, respectively But: At major energy cost!

Repeater Insertion ─ Example • 1 cm Cu wire in 90 nm technology (on intermediate layers) • rw = 250 W/mm; cw = 200 fF/mm • tp = 0.69rwcwL2 = 3.45 nsec • Optimal driver insertion: • tpopt= 0.5 nsec • Requires insertion of 13 repeaters • Energy per transition 8 times larger than just charging the wire (6 pJverus 0.75 pJ)! • It pays to back off!

Wire Energy-Delay Trade-off (dMin, eMax) 1 0.9 L = 1cm (Cu) 90 nm CMOS 0.8 0.7 Repeater overhead 0.6 eNorm 0.5 0.4 wire energy only 0.3 0.2 0.1 1 2 3 4 5 6 7 8 dNorm

Multi-dimensional Optimization 1.2 1.1 • Design parameters: Voltage, number of stages, buffer sizes • Voltage scaling has largest impact, followed by selection of number of repeaters • Transistor sizing secondary. 1 0.9 VDD(V) 0.8 0.7 0.6 0.5 12 10 8 Number of stages 6 4 2 0 1 2 3 4 5 6 7 8 dNorm

Reduced Swing • Ebit = CVDDVswing • Concerns: • Overhead (area, delay) • Robustness (supply noise, crosstalk, process variations) • Repeaters? Transmitter (TX) Receiver (RX)

Traditional Level Converter • Requires two discrete voltage levels • Asynchronous level conversion adds extra delay [Ref: H. Zhang, TVLSI’00]

Avoiding Extra References VTC Transient [Ref: H. Zhang, VLSI’00]

Differential (Clocked) Signaling • Allows for very low swings (200 mV) • Robust • Quadratic energy savings • But: doubling the wiring, extra clock signal, complexity [Ref: T. Burd, UCB’01]

Lower Bound on Signal Swing? • Reduction of signal swing translates into higher power dissipation in receiver – trade-off between wire and receiver energy dissipation • Reduced SNR impacts reliability – current on-chip interconnect strategies require Bit Error Rate (BER) of zero (in contrast to communication and network links) • Noise source: power supply noise, crosstalk • Swings as low as 200 mV have been reported [Ref: Burd’00], 100 mV definitely possible • Further reduction requires crosstalk suppression shielding folding GND GND GND

Quasi-Adiabatic Charging • Uses stepwise approximation of adiabatic (dis)charging • Capacitors acting as “charge reservoir” • Energy drawn from supply reduced by factor N V VDD VDD/ N t CT1 CT2 CTN-1 [Ref: L. Svensson, ISLPED’96]

Charge Redistribution Schemes VDD E B1 B1 RX1 B1 = 1 3VDD/4 1 P B1 B1 VDD/2 E B0 0 RX0 B0 P VDD/4 B0 = 0 B0 B0 E GND Precharge Eval Precharge • Charge recycled from top to bottom • Precharge phase equalizes differential lines • Energy/bit = 2C(VDD/N)2 • Challenges: Receiver design, noise margins [Ref: H. Yamauchi, JSSC’95]

Alternative Communication Schemes • Example: Capacitively-driven wires • Offers some compelling advantages • Reduced swing • Swing is VDD/(n+1) without extra supply • Reduced load • Allows for smaller driver • Reduced delay • Capacitor pre-emphasizes edges Pitchfork capacitors exploit sidewall capacitance [Ref: D. Hopkins, ISSCC’07]

Signaling Protocols Network Globally Asynchronous self-timed handshaking protocol din dout reqin reqout ackin ackout Din ProcessorModule (mProc, ALU, MPY, SRAM…) REQ in Allows individual modulesto dynamically trade-off performance for energy-efficiency done

Signaling Protocols Network GloballyAsynchronous din dout reqin reqout ackin ackout Physical Layer Interface Module Din REQ in din dout clk done Clk ProcessorModule (mProc, ALU, MPY, SRAM…) done Locally synchronous

The Data Link /Media Access Layer Reliable transmission over physical link and sharing interconnect medium between multiple sources and destinations (MAC) • Bundling, serialization, packetizing • Error detection and correction • Coding • Multiple-access schemes Presentation/Application Session Transport Network Data Link Physical

Coding Encoder Decoder N N + k N TX Link RX • Adding redundancy to communication link (extra bits) to: • Reduce transitions (activity encoding) • Reduce energy/bit (error-correcting coding)

Activity Reduction Through Coding Example: Bus-Invert Coding N + 1 Encoder Decoder N N Denc D D Invert bit p • Data word D inverted if Hamming distance from previous is larger than N/2. [Ref: M. Stan, TVLSI’95]

Bus-Invert Coding Reg Denc D D p Decode P L Bus Encode • Gain: • 25 % (at best – for random data) • Overhead: • Extra wire (and activity) • Encoder, decoder • Not effective for correlated data [Ref: M. Stan, TVLSI’95]

Other Transition Coding Schemes • Advanced bus-invert coding (e.g. partition bus into sub-components) (e.g. [M.Stan, TVLSI’97]) • Coding for address busses ( which often display sequentiality) (e.g. [L. Benini, DATE’98]) • Full-fledged channel coding, borrowed from communication links (e.g. [S. Ramprasad, TVLSI’99]) Coding to reduce impact of Miller capacitance between neighboring wires [Ref: Sotiriadis, ASPDAC’01] Maximum capacitance transition – can be avoided by coding

Error-Correcting Codes N + k Encoder Decoder N N Denc D D e.g. Example: (4,3,1) Hamming Code P1P2B3P4B5B6B7 B3 wrong • Adding redundancy allows for more aggressive scaling of signal swings and/or timing • Simpler codes such as Hamming prove most effective with P1 + B3 + B5 + B7 = 0 1 P2 + B3 + B6 + B7 = 0 1 = 3 P4 + B5 + B5 + B7 = 0 0

Media Access • Sharing of physical media over multiple data streams increases capacitance and activity (see Chapter 5), but reduces area • Many multi-access schemes known from communications • Time domain:Time-Division Multiple Access (TDMA) • Frequency domain: narrow band, code division multiplexing • Buses based on Arbitration-based TDMA most common in today’s ICs

Arbitration Command Bus Protocols and Energy • Some Lessons from the Communications world: • When utilization is low, simple schemes are more effective • When traffic is intense, reservation of resources minimizes overhead and latency (collisions, resends) • Combining the two leads to energy efficiency • Example : SiliconBackplaneMicroNetwork Current Slot • Independent arbitration for every cycle includes two phases: • Distributed TDMA for guaranteed latency/bandwidth • Round robin for random access [Courtesy: Sonics, Inc]

The Network Layer Presentation/Application Topology-independent end-to-end communication over multiple data links (routing, bridging, repeaters) • Topology • Static versus dynamic configuration / routing Session Transport Network Data Link Physical Becoming more important in today’s complex multi-processor designs “The Network-on-a-Chip (NOC)” [Ref: G. De Micheli, Morgan-Kaufman’06]

Network-on-a-Chip (NoC) • Dedicated networks with reserved links preferable for high traffic channels – but: limited connectivity, area overhead • Flexibility an increasing requirement in multi (many) –core chip implementations or

The Network Trade-off’s Interconnect-oriented architecture trades off flexibility, latency, energy and area-efficiency through the following concepts • Locality - eliminate global structures • Hierarchy - expose locality in communication requirements • Concurrency/Multiplexing Very Similar to Architectural Space Trade-off’s Network-on-a-Chip Dedicated wiring [Courtesy: B. Dally, Stanford]

Networking Topology • Homogeneous • Crossbar, Butterfly, Torus, Mesh,Tree, … • Heterogeneous • Hierarchy Crossbar Tree Mesh (FPGA)

Mesh Energy x Delay Binary Tree Manhattan Distance Network Topology Exploration Short connections in tree are redundant Mesh Energy x Delay Binary Tree Inverse clustering complements mesh Mesh + Inverse Manhattan Distance [Ref: V. George, Springer’01]

Circuit-Switched versus Packet Based • On-Chip Reality: Wires (bandwidth) are relatively cheap, buffering and routing expensive • Packet-switched approach versatile • Preferred approach in large networks • But … routers come with large overhead • Case study Intel: 18% of power in link, 82% in router • Circuit-switched approach attractive for high-data rate quasi-static links • Hierarchical combination often preferred choice C C C C C C Bus Bus R R C C C C Bus C C C C C C R R Bus Bus Hierarchical circuit and packet switched networks for longer connections Bus to connect over short distances C C C C

Example: The Pleiades Network-on-a-Chip • Configurable platformfor low-energy communication and signal-processing applications (See Chapter 5) • Allows for dynamic task-level reconfiguration of process network Configuration Bus ArithmeticModule ArithmeticModule ArithmeticModule Configurable Interconnect ConfigurableLogic ConfigurableLogic mP Network Interface DedicatedArithmetic Energy-efficient flexible network essential to the concept Configuration [Ref: H. Zhang, JSSC’00]

Cluster Cluster Pleiades Network Layer Hierarchical reconfigurable mesh network Level-1 Mesh Level-2 Mesh Universal Switchbox Hierarchical Switchbox • Network statically configured at start of session and ripped up at end • Structured approach reduces interconnect energy with factor 7 • over straightforward cross-bar

Top Layers of the OSI Stack Presentation/Application • Abstracts communication architecture to system and performs data formatting and conversion • Establishes and maintains end-to-end communications • flow control, message reordering, packet segmentation and reassembly Session Transport Network Data Link Physical Example: Establish, maintain and rip-up connections in dynamically reconfigurable Systems-on-a-Chip – Important in power-management

What About Clock Distribution? • Clock easily the most energy-consuming signal of a chip • Largest length • Largest fanout • Most activity (a = 1) • Skew control adding major overhead • Intermediate clock repeaters • De-skewing elements • Opportunities • Reduced swing • Alternative clock distribution schemes • Avoiding a global clock altogether

Reduced-Swing Clock Distribution • Similar to reduced-swing interconnect • Relatively easy to implement • But: Extra-delay in flip-flop’s adds directly to clock period © IEEE 1995 VDD NMOS clock PMOS clock GND Regular 2-phase clock VDD PMOS clock NMOS clock GND Example: half-swing clock distribution scheme Half-swing clock [Ref: H. Kojima, JSSC’95]

Alternative Clock Distribution Schemes Example: Transmission-Line Based Clock Distribution Canceling skew in perfect transmission line scenario © IEEE 2006 [Ref: V. Prodanov, CICC’06]

Summary • Interconnect important component of overall power dissipation • Structured approach with exploration at different abstraction layers most effective • Lot to be learned from communications and networking community – yet, techniques must be applied judiciously • Cost relationship between active and passive components different • Some exciting possibilities for the future: 3D-integration, novel interconnect materials, optical or wireless I/O

Books and Book Chapters T. Burd, “Energy-Efficient Processor System Design,” http://bwrc.eecs.berkeley.edu/Publications/2001/THESES/energ_eff_process-sys_des/index.htm, UCB, 2001. G. De Micheli and L. Benini, “Networks on Chips: Technology and Tools,” Morgan-Kaufman, 2006. V. George and J. Rabaey, “Low-energy FPGAs: Architecture and Design”, Springer 2001. J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital Integrated Circuits: A Design Perspective,” 2nded, Prentice Hall 2003. C. Svensson, “Low-Power and Low-Voltage Communication for SoC’s,” in C. Piguet, Low-Power Electronics Design, Ch. 14, CRC Press, 2005. L. Svensson, “Adiabatic and Clock-Powered Circuits,” in C. Piguet, Low-Power Electronics Design, Ch. 15, CRC Press, 2005. G. Yeap, “Special Techniques”, in Practical Low Power Digital VLSI Design, Ch 6., Kluwer Academic Publishers, 1998. Articles L. Benini et al, “Address bus encoding techniques for system-level power optimization,” Proceedings DATE’98, pp. 861-867, Paris, February 1998 T. Burd et al., “A Dynamic Voltage Scaled Microprocessor System,” IEEE ISSCC Digest of Technical Papers, pp. 294-295, Feb. 2000. M. Chang et al, “CMP Network-on-Chop Overlaid with Multi-Band RF Interconnect”, International Symposium on High-Performance Computer Architecture, Febr. 2008. D.M. Chapiro, “Globally Asynchronous Locally Synchronous Systems,” PhD thesis, Stanford University, 1984. References

Optimizing Power @ Design Time Interconnect and Clocks