Guang Sun 1 ,2 , Shih-Hung Weng 2 , Chung- Kuan Cheng 2 , Bill Lin 2 , Lieguang Zeng 1

An On-Chip Global Broadcast Network with Dense Equalized TrAnsmIssionLines (DETAIL) in the 1024-Core Era • Guang Sun1,2, Shih-Hung Weng2, Chung-Kuan Cheng2, • Bill Lin2, Lieguang Zeng1 • 1Tsinghua University, 2University of California, San Diego • 14th IEEE/ACM International Workshop on System-Level Interconnect Prediction (SLIP 2012), June 3, 2012

Outline • Introduction • Chip bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

Chip Bottlenecks • Transistor and wire scaling is problematic • Transistor scaling has less and less room • Wire scaling is not equally good. • Voltage and threshold scaling is problematic • Leakage power becomes dominating • Single core is no more faster • Power consumption, heat dissipation • Performance game is restricted by power game

Computation Trend • single-core  many-core. • Market Application Rush • multi-core is beneficial. • Why multi-core? • Parallelism is power efficient • Single core frequency is leveling off around 3 GHz • The investment to make single cores more powerful is not paying back

Computation Trend • single-core  many-core. • Market Application Rush • multi-core is beneficial. • Why manycore? • Parallelism is power efficient • Single core frequency is leveling off 3 GHz • The investment to make single cores more powerful is not paying back

Computation Trend • the number of cores will increase in the next decade(ITRS) • from 25 to 300 for stationary computers • from 10 to 500 for networking applications • from 64 to 900 for consumer portable devices

m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Interconnect Trend • from bus to Network on Chip (NoC) Bus NoC

Interconnect Trend • Bus  NoC • Drawbacks of traditional bus-based on-chip interconnect architectures • bandwidth • energy consumption • global interconnect • scalability • reliability. • NoC: a new design methodology • overcome the drawbacks of traditional bus-based architectures • meet the requirements of on-chip communication

Summary: Introduction • Chip is becoming more and more complex • Chip bottlenecks • Transistor and wire scaling • Voltage and threshold scaling • Single core frequency scaling • Power consumption, heat dissipation • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC

Outline • Introduction • SoC bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

Scalability challenge Efficient for small cores # Issues for thousand of cores Many routing hops between long distance cores Competing messages over common network paths Generate substantial contentions and queue delays Conventional Electrical Mesh

Conventional Electrical Mesh Scalability challenge Programming challenge Even a simple function like broadcasting common data to all cores is difficult to perform efficiently 1-to-N broadcast implemented by N unicasts Flood the network ---- high contention, large latency, low throughput , high power

Broadcast is important • Broadcast and multicast traffics • cover up to 52.4% of total traffics [T.Krishna’11]. • Broadcast and all-to-all communication operations present great challenges • in popular coherence and synchronization protocols • A cheap broadcast communication mechanism can make programming easier [G. Kurian ’10]. • Enables convenient programming models (e.g., shared memory) • Reduces the need to carefully manage locality

Summary: Motivation • Conventional Mesh for 1024 cores • Scalability challenge • Programming challenge • Broadcast is important • in popular coherence and synchronization protocol • cover up to large percent of total traffics • A cheap broadcast communication mechanism can make programming easier

m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Candidate solutions Electrical Mesh Interconnect Optical Interconnects [Jason Miller ‘ 10] Optical Broadcast WDM Interconnect

Candidate solutions Optical Interconnects [Jason Miller ‘ 10] Wireless Interconnects [Suk-Bok Lee‘ 10]

Candidate solutions Optical Interconnects [Jason Miller ‘ 10] Wireless Interconnects [Suk-Bok Lee‘ 10] 3D IC Interconnects [Yuan Xie‘ 10]

Transmission line (T-line) solution • widely studied recently • tackle global communication challenges • delivers signals with the speed of light (Low Latency) • wave propagation • consumes much less power • eliminates full-swing charges and discharges on the wire and gate capacitance.

Comparison of different interconnects • Equalized dense T-line outperforms other interconnects • in latency, power and throughput

Our Contribution Implement dense (Low Pitch) equalized T-line optimization Develop high-performance tree-based global broadcast interconnect Propose a hierarchical architecture for efficient 1024-core all-to-all communication

Equalized On-Chip Global Link • Overall structure • Tapered current-mode logic (CML) drivers • Terminated differential on-chip T-line • Continuous-time linear equalizer (CTLE) receiver • Sense-amplifier based latch

Low Energy-per-Bit Optimization Flow Pre-designed CML driver Pre-designed CTLE receiver Driver-Receiver Co-Design Initial Solution Change variables • [ISS,RT,RL,RD,CD,Vod] Cost-Function • Veye/Power Co-Design Cost Function Estimation Step-Response Based Eye Estimation SPICE generated T-line step response Receiver Step-Response using CTLE modeling Internal SQP (Sequential Quadratic Programming) routine to generate best solution Best set of design variables in terms of overall energy-per-bit

Optimization Result

Hierarchical Interconnect Architecture for 1024-core

Hierarchical Interconnect Architecture for 1024-core TBNet : • global T-line Broadcasting Network

Hierarchical Interconnect Architecture for 1024-core EBNet: • Electrical Broadcast Network Emesh: • Electrical mesh network

Hierarchical Interconnect Architecture for 1024-core Emesh: • ideal for predictable, local point-to-point communication within the cluster(RC- Based)

Hierarchical Interconnect Architecture for 1024-core TBNet: • long-range, low-latency and power-efficient communication • (T-line Based)

Physical Structure for TBNet

Evaluation • Suppose chip 20mm X 20mm (in 16nm technology) • Each cluster: 2.5mm X 2.5 mm • Pitch of T-line : 2.6 um • Width of T-line pair: 7.8 um (including P/G wire) • Two upper metal layers used for T-lines • Width of 1 TBNet-1 + 1 TBNet-2 in each cluster: • 72 x 7.8 um • Each cluster accommodates 4 TBNet-1 + 4 TBNet-2 • 2.5 mm / (72 x 7.8um)

>62 T for 32x32 mesh (Need 31+31=62 hops) Evaluation results • 20 Gbps per T-line • Each TBNet has 64 sources: 20 bps x 64=1280Gbps • Whole chip: 1280Gbps x 4 x 2 =10.24Tbps • 0.08 pJ/bit per segment • Each TBNet has 63 segments: 0.08 pJ/bit x 63=5.04pJ/bit • 40.8ps/mm per segment • Broadcast length: 34 segments: 40.8 x34= 1.4ns • < 2T(Suppose core 1GHZ)

Summary: our design • T-line • Low latency • Power efficient • High throughput • Hierarchical Interconnect Architecture for 1024-core • Dedicated links for each broadcast source • High performance for both global and local communication

Future work • Verify the impact of different factors • power supply noise, process variability and clock jitter. • This paper provided an evaluation for architecture capacity. A detailed behavior level model should be developed in next step • An architecture level simulation with real application will be implemented.

Conclusion • Broadcast is very important in future ManycoreNoC Interconnect architecture • Implement Dense Equalized Transmission-line (DETAIL) optimization • Global broadcast interconnect • Low latency & power efficient & high throughput • Propose a hierarchical architecture for efficient 1024-core all-to-all communication

Thank You! Q & A

Guang Sun 1 ,2 , Shih-Hung Weng 2 , Chung- Kuan Cheng 2 , Bill Lin 2 , Lieguang Zeng 1