On-Chip Interconnect Trend and Design Optimization

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA

Outlines • Global Interconnect Technologies • RC Trees and Transmission Lines • Prefix Adder Synthesis • Modeling • FPGA Interconnect Architecture • Modeling • Interconnect Architecture • Non-Manhattan Wire Arrangement

Interconnect Technologies • Introduction • On-Chip Global Interconnection • Global Wire Modeling • Performance Comparison

Introduction – Performance Impact • Interconnect delay determines the system performance [ITRS08] • 542ps for 1mm minimum pitch Cu global wire w/o repeater @ 45nm • ~150ps for 10 level FO4 delay @ 45nm [Ho2001] “Future of Wire”

Introduction – Power Dissipation • Interconnects consume a significant portion of power • 1-2 order larger in magnitude compared with gates • Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07] • Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04] • About 1/3 burned on the global wires.

Introduction – Technology Trend Scaling trend of PUL wire resistance and capacitance Copper resistivity versus wire width • On-Chip Interconnect Scaling • Dimension shrinks • Wire resistance increases -> RC delay • Increasing capacitive coupling -> delay, power, noise, etc. • Performance of global wires decreases w/ technology scaling.

Organization of On-Chip Global Interconnections

Multi-Dimensional Design Consideration • Preliminary analysis results assuming 65nm CMOS process. • Application-oriented choice • Low Latency T-TL or UT-TL -> Single-Ended T-lines • High Throughput R-RC • Low Power PE-TL or UE-TL • Low Noise PE-TL or UE-TL • Low Area/Cost R-RC Differential T-lines For each architecture, the more area the pentagon covers, the better overall performance is achieved.

On-Chip Global Interconnect Schemes (1) • R-RC structure • Repeater size/Length of segments • Adopt previous design methodology [Zhang07] • UT-TL structure • Full swing at wire-end • Tapered inverter chain as TX • T-TL structure • Optimize eye-height at wire-end • Non-Tapered inverter chain as TX Repeated RC wires (R-RC) Un-Terminatedand Terminated T-Line (UT-TLand T-TL)

On-Chip Global Interconnect Schemes (2) Un-Equalized andPassive-Equalized T-Line (UE-TLandPE-TL) • Driver side: Tapered differential driver • Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain • Passive equalizer: parallel RC network • Design Constraint: enough eye-opening (50mV) needed at the wire-end

Effects of driver impedance and termination resistance on step response Optimal Rload • Larger driver impedance leads to slower rise edge and lower saturation voltage • Larger termination resistance causes sharper rise edge but with larger reflection

Bit-rate: 50Gbps Rs=11.06ohm, Rd=350ohm, Cd=0.38pF, RL=107.69ohm

Global Wire Modeling – Single-Ended & Differential On-Chip T-lines • Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high. • Top-layer thick wires used -> dimension maintains as technology scales. • LC-mode behavior dominant Determine the bit rate • Smallest wire dimensions that satisfy eye constraint • Notice PE-TL needs narrower wire -> Equalization helps to increase density.

Global Wire Modeling – RC wires and T-lines • RC wire modeling • T-line 2D-R(f)L(f)C parameter extraction • T-line Modeling • R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height. • Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue. • Distributed Π model composed of wire resistance and capacitance • Closed-form equations [Sim03] to calculate 2D wire capacitance 2D-C Extraction Template 2D-R(f)L(f) Extraction Template

Performance Analysis – Definitions • Normalized delay (unit: ps/mm) • Propagation delay includes wire delay and gate delay. • Normalized energy per bit (unit: pJ/m) • Bit rate is assumed to be the inverse of propagation delay for RC wires • Normalized throughput (unit: Gbps/um)

Performance Analysis – Latency • Variables: technology-defined parameters • Supply voltage: Vdd (unit: V) • Dielectric constant: • Min-sized inverter FO4 delay: (unit: ps) • R-RC structure (min-d) • is roughly constant • FO4 delay scales w/ scaling factor S • T-line structures • Sum of wire delay and TX delay • Wire delay • TX delay improved w/ FO4 delay Decreasing w/ technology scaling! Increasing w/ technology scaling!

Performance Analysis – Energy per Bit • Same variables defined before Constant ! • R-RC structure (min-d) • Vdd reduces as technology scales • reduces as technology scales • T-line structures • Sum of power consumed on wire and TX. • Power of T-line • Power of TX circuit • FO4 delay reduces exponentially Energy decreases w/ technology scaling! Energy decreases w/ larger slope!!

Performance Analysis – Throughput • Same variables defined before • R-RC structure (min-d) • Assuming wire pitch • FO4 delay reduces exponentially • T-line structures • TX bandwidth • Neglect the minor change of wire pitch • K1 = 0, for UT-TL • FO4 delay reduces exponentially Throughput increases by 20% per generation! Throughput increases by 43% per generation !!

Design Framework for On-Chip T-line Schemes • Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure. • Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.

Experimental Settings • Design objective: min-d • Technology nodes: 90nm-22nm • Five different global interconnection structures • Wire length: 5mm • Parameter extraction • 2D field solver CZ2D from EIP tool suite of IBM • Tabular model or synthesized model • Transistor models • Predictive transistor model from [Uemura06] • Synopsys level 3 MOSFET model tuned according to ITRS roadmap • Simulation • HSPICE 2005 • Modeling and Optimization • Linear or non-linear regression/SQP routine • MATLAB 2007

Performance Metric: Normalized Delay – Results and Comparison • Technology trends • R-RC ↑ • T-line schemes ↓ • T-line structures • Outperform R-RC beyond 90nm • Single-ended: lowest delay • At 22nm node • R-RC: 55ps/mm • T-lines: 8ps/mm (85% reduction) • Speed of light: 5ps/mm • Linear model • < 6% average percent error

Performance Metric: Normalized Energy per Bit – Results and Comparison • Technology trends • R-RC and T-lines ↓ • T-lines reduce more quickly • T-line structures • Outperform R-RC beyond 45nm • Differential: lowest energy. • Single-ended similar to R-RC. • T-TL > UT-TL • At 22nm node • R-RC: 100pJ/m • Single-ended: 60% reduction • Differential: 96% reduction • Linear model • < 12% average percent error • Error for T-TL and PE-TL • RL and passive equalizers.

Performance Metric: Normalized Throughput – Results and Comparison • Technology trends • R-RC and T-lines ↑ • T-lines increase more quickly • T-line structures • Outperform R-RC beyond 32nm • Differential better than single-ended • At 22nm node • R-RC: 12Gbps/um • T-TL: 30% improvement • UE-TL: 75% improvement • PE-TL: ~ 2X of R-RC • Linear model • < 7% average percent error

Signal Integrity – single-ended T-lines Worst-case switching pattern for peak noise simulation Using w.c. pattern Using single or multiple PRBS patterns • UT-TL structure • 380mV peak noise at 1V supply voltage w/ 7ps rise time • SI could be a big issue as supply voltage drops • T-TL less sensitive to noise • At the same rise time, ~ 50% reduction of peak noise • Peak noise ↓ as technology scales

Signal Integrity – differential T-lines Worst-case switching pattern for peak noise simulation • More reliable • Termination resistance • Common-mode noise reduction • Peak noise • Within ~10mV range • Eye-Heights • UE-TL • Eye reduces as bit rate ↑ • Harder to meet constraint. • PE-TL • > 70mV eye even at 22nm node • Equalization does help!

Summary (cont’) Low-Latency Application (ps/mm) Low-Energy Application (pJ/m) Tech Node Tech Node Schemes Schemes High-Throughput Application (Gbps/um) Low-Noise Application Tech Node Tech Node Schemes Schemes Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.

Summary of Global Interconnect • Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm. • A simple linear model provided to link • Architecture-level performance metrics • Technology-defined parameters • Some observations from experimental results • T-line structures have potential to replace R-RC at future node • Differential T-lines are better than single-ended • Low-power/High-throughput/Low-noise • Equalization could be utilized for on-chip global interconnection • Higher throughput density, improve signal integrity • Even w/ lower energy dissipation (passive equalizations)

Prefix Adder Synthesis • Motivation • Prefix Adder Formulation • Area/Timing/Power Models • Mixed-Radix (2,3,4) Adders • ILP Formulation • Experimental Results

Power Dynamic power Power gating Static power Activity Probability Gate Cap Wire Cap Input arrival time Output require time Buffer insertion Gate sizing Physical placement Signal slope Timing Detail routing Area Motivation: Prefix Adder • Increasing impact of physical design • and concern of power. Logical Levels Fanouts Wire Tracks

Prefix Adder Formulation • Input: two n-bit binary numbers and , one bit carry-in • Output: n-bit sum and one bit carry out • Prefix Addition: Carry generation & propagation

Prefix Addition – Formulation Pre-processing: Prefix Computation: Post-processing:

3 2 4 1 4:1 3:1 2:1 1 Prefix Adder – Prefix Structure Graph ai bi Pre-processing gpi gp generator Prefix Computation GP[i, j] GP[j-1, k] GP[i, k] GP cell G[i:0] Post-processing pi si sum generator

Area Model • Distinguish physical placement from logical structure, but keep the bit-slice structure. Bit position Bit position 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 Logical level Physical level Physical view Logical view Compact placement

Timing Model • Cell delay calculation: Effort Delay Intrinsic Delay Logical Effort Electrical Effort = Cout/Cin =(fanouts+wirelength) / size Intrinsic properties of the cell

Power Model • Total power consumption: Dynamic power + Static Power • Static power: leakage current of device Psta = *#cells • Dynamic power: current switching capacitance Pdyn =   Cload •  is the switching probability  = j (j is the logical level*) * Vanichayobon S, etc, “Power-speed Trade-off in Parallel Prefix Circuits”

Interval Adjacency Constraint (column id, logic level)

Linearization for Interval Adjacency Constraint Left interval bound equal to column index Linearize Pseudo Linear

ILP Formulation Overview • Structure variables: • GP cells • Connections (wires) • Physical positions • Capacitance variables: • Gate cap • Vertical wire cap • Horizontal wire cap ILP Power Objective ILOG CPLEX • Timing variables: • Input arrival time • Output arrival time Optimal Solution

Experiments – 16-bit Uniform Timing

Min-Power Radix-2 Adder (delay= 22, power = 45.5FO4 ) 16 15 14 13 12 11 10 9 5 4 3 2 1 8 7 6 16 15 14 13 12 11 10 9 5 4 3 2 1 8 7 6

Min-Power Radix-2&4 Adder (delay=18, power = 29.75FO4 ) 16 15 14 13 12 11 10 9 5 4 3 2 1 8 7 6 16 15 14 13 12 11 10 9 5 4 3 2 1 8 7 6 Radix-2 Cell Radix-4 Cell

Min-Power Mixed-Radix Adder (delay=20, power = 28.0FO4) 16 15 14 13 12 11 10 9 5 4 3 2 1 8 7 6 16 15 14 13 12 11 10 9 5 4 3 2 1 8 7 6 Radix-2 Cell Radix-3 Cell Radix-4 Cell

Experiments – 64-bit Hierarchical Structure (Mixed-Radix) • Handle high bit-width applications • 16x4 and 8x8

FPGA Global Routing Architecture • Synthesis Flow • Formulation • Experimental Results

Synthesis Flow

Formulation

FPGA Global Routing Architecture

Energy Model: Wires • 0.18um tech node, grid length = 0.5mm • 4 types of wires: RC wires with spacing and transmission

W Energy and Area Model: Switch Box • Switch Area Model • Fs: Number of switches connected to each wire entering a switch box • f: Total flow incoming a switch box • Ns: Per-bit number of switches inside a switch box • Energy Model • Pu: energy of a single switch • Ps: Per-bit switch energy

On-Chip Interconnect Trend and Design Optimization