1 / 38

Guang Sun 1 ,2 , Shih-Hung Weng 2 , Chung- Kuan Cheng 2 , Bill Lin 2 , Lieguang Zeng 1

An On-Chip Global Broadcast Network with D ense E qualized T r A nsm I ssion L ines ( DETAIL ) in the 1024-Core Era. Guang Sun 1 ,2 , Shih-Hung Weng 2 , Chung- Kuan Cheng 2 , Bill Lin 2 , Lieguang Zeng 1 1 Tsinghua University, 2 University of California, San Diego

sovann
Télécharger la présentation

Guang Sun 1 ,2 , Shih-Hung Weng 2 , Chung- Kuan Cheng 2 , Bill Lin 2 , Lieguang Zeng 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An On-Chip Global Broadcast Network with Dense Equalized TrAnsmIssionLines (DETAIL) in the 1024-Core Era • Guang Sun1,2, Shih-Hung Weng2, Chung-Kuan Cheng2, • Bill Lin2, Lieguang Zeng1 • 1Tsinghua University, 2University of California, San Diego • 14th IEEE/ACM International Workshop on System-Level Interconnect Prediction (SLIP 2012), June 3, 2012

  2. Outline • Introduction • Chip bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

  3. Outline • Introduction • Chip bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

  4. Chip Bottlenecks • Transistor and wire scaling is problematic • Transistor scaling has less and less room • Wire scaling is not equally good. • Voltage and threshold scaling is problematic • Leakage power becomes dominating • Single core is no more faster • Power consumption, heat dissipation • Performance game is restricted by power game

  5. Computation Trend • single-core  many-core. • Market Application Rush • multi-core is beneficial. • Why multi-core? • Parallelism is power efficient • Single core frequency is leveling off around 3 GHz • The investment to make single cores more powerful is not paying back

  6. Computation Trend • single-core  many-core. • Market Application Rush • multi-core is beneficial. • Why manycore? • Parallelism is power efficient • Single core frequency is leveling off 3 GHz • The investment to make single cores more powerful is not paying back

  7. Computation Trend • the number of cores will increase in the next decade(ITRS) • from 25 to 300 for stationary computers • from 10 to 500 for networking applications • from 64 to 900 for consumer portable devices

  8. m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Interconnect Trend • from bus to Network on Chip (NoC) Bus NoC

  9. Interconnect Trend • Bus  NoC • Drawbacks of traditional bus-based on-chip interconnect architectures • bandwidth • energy consumption • global interconnect • scalability • reliability. • NoC: a new design methodology • overcome the drawbacks of traditional bus-based architectures • meet the requirements of on-chip communication

  10. Summary: Introduction • Chip is becoming more and more complex • Chip bottlenecks • Transistor and wire scaling • Voltage and threshold scaling • Single core frequency scaling • Power consumption, heat dissipation • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC

  11. Outline • Introduction • SoC bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

  12. Scalability challenge Efficient for small cores # Issues for thousand of cores Many routing hops between long distance cores Competing messages over common network paths Generate substantial contentions and queue delays Conventional Electrical Mesh

  13. Conventional Electrical Mesh Scalability challenge Programming challenge Even a simple function like broadcasting common data to all cores is difficult to perform efficiently 1-to-N broadcast implemented by N unicasts Flood the network ---- high contention, large latency, low throughput , high power

  14. Broadcast is important • Broadcast and multicast traffics • cover up to 52.4% of total traffics [T.Krishna’11]. • Broadcast and all-to-all communication operations present great challenges • in popular coherence and synchronization protocols • A cheap broadcast communication mechanism can make programming easier [G. Kurian ’10]. • Enables convenient programming models (e.g., shared memory) • Reduces the need to carefully manage locality

  15. Summary: Motivation • Conventional Mesh for 1024 cores • Scalability challenge • Programming challenge • Broadcast is important • in popular coherence and synchronization protocol • cover up to large percent of total traffics • A cheap broadcast communication mechanism can make programming easier

  16. Outline • Introduction • SoC bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

  17. m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Candidate solutions Electrical Mesh Interconnect Optical Interconnects [Jason Miller ‘ 10] Optical Broadcast WDM Interconnect

  18. Candidate solutions Optical Interconnects [Jason Miller ‘ 10] Wireless Interconnects [Suk-Bok Lee‘ 10]

  19. Candidate solutions Optical Interconnects [Jason Miller ‘ 10] Wireless Interconnects [Suk-Bok Lee‘ 10] 3D IC Interconnects [Yuan Xie‘ 10]

  20. Transmission line (T-line) solution • widely studied recently • tackle global communication challenges • delivers signals with the speed of light (Low Latency) • wave propagation • consumes much less power • eliminates full-swing charges and discharges on the wire and gate capacitance.

  21. Comparison of different interconnects • Equalized dense T-line outperforms other interconnects • in latency, power and throughput

  22. Our Contribution Implement dense (Low Pitch) equalized T-line optimization Develop high-performance tree-based global broadcast interconnect Propose a hierarchical architecture for efficient 1024-core all-to-all communication

  23. Equalized On-Chip Global Link • Overall structure • Tapered current-mode logic (CML) drivers • Terminated differential on-chip T-line • Continuous-time linear equalizer (CTLE) receiver • Sense-amplifier based latch

  24. Low Energy-per-Bit Optimization Flow Pre-designed CML driver Pre-designed CTLE receiver Driver-Receiver Co-Design Initial Solution Change variables • [ISS,RT,RL,RD,CD,Vod] Cost-Function • Veye/Power Co-Design Cost Function Estimation Step-Response Based Eye Estimation SPICE generated T-line step response Receiver Step-Response using CTLE modeling Internal SQP (Sequential Quadratic Programming) routine to generate best solution Best set of design variables in terms of overall energy-per-bit

  25. Optimization Result

  26. Hierarchical Interconnect Architecture for 1024-core

  27. Hierarchical Interconnect Architecture for 1024-core TBNet : • global T-line Broadcasting Network

  28. Hierarchical Interconnect Architecture for 1024-core EBNet: • Electrical Broadcast Network Emesh: • Electrical mesh network

  29. Hierarchical Interconnect Architecture for 1024-core Emesh: • ideal for predictable, local point-to-point communication within the cluster(RC- Based)

  30. Hierarchical Interconnect Architecture for 1024-core TBNet: • long-range, low-latency and power-efficient communication • (T-line Based)

  31. Physical Structure for TBNet

  32. Evaluation • Suppose chip 20mm X 20mm (in 16nm technology) • Each cluster: 2.5mm X 2.5 mm • Pitch of T-line : 2.6 um • Width of T-line pair: 7.8 um (including P/G wire) • Two upper metal layers used for T-lines • Width of 1 TBNet-1 + 1 TBNet-2 in each cluster: • 72 x 7.8 um • Each cluster accommodates 4 TBNet-1 + 4 TBNet-2 • 2.5 mm / (72 x 7.8um)

  33. >62 T for 32x32 mesh (Need 31+31=62 hops) Evaluation results • 20 Gbps per T-line • Each TBNet has 64 sources: 20 bps x 64=1280Gbps • Whole chip: 1280Gbps x 4 x 2 =10.24Tbps • 0.08 pJ/bit per segment • Each TBNet has 63 segments: 0.08 pJ/bit x 63=5.04pJ/bit • 40.8ps/mm per segment • Broadcast length: 34 segments: 40.8 x34= 1.4ns • < 2T(Suppose core 1GHZ)

  34. Summary: our design • T-line • Low latency • Power efficient • High throughput • Hierarchical Interconnect Architecture for 1024-core • Dedicated links for each broadcast source • High performance for both global and local communication

  35. Outline • Introduction • SoC bottlenecks • Computation Trend: single-core  many-core • Interconnect Trend: Bus  NoC • Motivation • Mesh not efficient for broadcast • Broadcast is important • Our design • Transmission Line optimization • Hierarchical Interconnect Architecture for 1024-core • Future work and Conclusion

  36. Future work • Verify the impact of different factors • power supply noise, process variability and clock jitter. • This paper provided an evaluation for architecture capacity. A detailed behavior level model should be developed in next step • An architecture level simulation with real application will be implemented.

  37. Conclusion • Broadcast is very important in future ManycoreNoC Interconnect architecture • Implement Dense Equalized Transmission-line (DETAIL) optimization • Global broadcast interconnect • Low latency & power efficient & high throughput • Propose a hierarchical architecture for efficient 1024-core all-to-all communication

  38. Thank You! Q & A

More Related