1 / 29

Self-Tuned Congestion Control for Multiprocessor Networks

Self-Tuned Congestion Control for Multiprocessor Networks. Mithuna Thottethodi Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu Department of Computer Sciences Duke University, Durham, North Carolina. Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

pink
Télécharger la présentation

Self-Tuned Congestion Control for Multiprocessor Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Tuned Congestion Control for Multiprocessor Networks Mithuna Thottethodi Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu Department of Computer Sciences Duke University, Durham, North Carolina Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Appeared in the 7th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001

  2. Network Saturation

  3. router Why Network Saturation? • Tree saturation • Deadlock cycles • New packets block older packets • Backpressure take 1000s of cycles to propagate back

  4. CPUs Router Why Do We Care? • Computation power per router is increasing • More aggressive speculation • Simultaneous Multithreading • Chip Multiprocessors • “Unstable” behavior makes designers very nervous

  5. So, what’s the solution? • Throttle • stop injecting packets when you hit a “threshold” • “threshold” = % full network buffers • But • Local estimate of threshold insufficient • Saturation point differs for communication patterns • Questions • How do we collect global estimate of % full network buffers? • How do we “tune” the threshold to different patterns?

  6. Outline • Overview • Multiprocessor Network Basics • Deadlocks & virtual channels • Adaptive routing & Duato’s theory • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects

  7. router A Multiprocessor Network

  8. 1  3 1 2 1  3 2  4 4  2 1 2 4 3 2  4 4  2 3  1 4 3 Virtual Channels (red & yellow) Deadlocked 3  1 Deadlock Avoidance

  9. Virtual Channels (VC) 1  3 1 One Buffer Per VC 4  2 2  4 4 3 3  1 Logically, red and yellow networks (deadlock-free)

  10. Duato’s Theory • Adaptive network for high performance • deadlock-prone • Deadlock-free network when adaptive network deadlocks • drop down to deadlock-free when router is congested • Implemented with different virtual channels • adaptive virtual channels • deadlock-free virtual channels (escape channels)

  11. Outline • Overview • Multiprocessor Network Basics • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects

  12. Global Estimate of Congestion • % of full buffers in entire network • more & more buffers occupied when network saturates • throttle network when % full buffers cross threshold • Advantages • simple aggregation • empirical observation: works well • Disadvantages • doesn’t detect localized congestion • threshold differs for communication patterns (we solve this)

  13. Gather Global Information • Global Information • % full network buffers in an “interval” • % packets or flits delivered during an “interval” • Constraint • gather time << backpressure buildup time (1000s of cycles) • Mechanisms • piggybacking • meta-packets • side-band signal

  14. Sideband: Dimension-wise Aggregation Each hop takes h cycles on the sideband After 2 hops, aggregation in one dimenstion done 2 such phases Total gather time = 2 * 2 * h = 4h cycles For k-ary, n-cubes, gather-time (g) = n * k * h / 2 For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles Phase I Phase 2

  15. Outline • Overview • Multiprocessor Network Basics • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects

  16. Currently throttling? Drop in Bandwidth > 25% Yes No No No Change Increment Yes Decrement Decrement Dynamic Detection of Threshold(Hill Climbing) B Throughput A C 0 % full buffers (%) Threshold … we may still creep into saturation (later)

  17. Summary of Approach • Global Knowledge of a Network • Collect % full network buffers and overall throughput • Dimension-wise aggregation, g-cycle snapshots • Aggregation via sideband signals • Dynamically detect throttling threshold • Threshold = % of full network buffers • Self-tuned using hill climbing • Reset if hill climbing fails

  18. Outline • Overview • Multiprocessor Network Basics • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects

  19. Methodology • Flitsim 2.0 Simulator (Pinkston’s group at USC) • warmup for 10k cycles, simulate for 50k cycles • Network architecture • 16x16 two-dimensional torus (16-ary, 2-cube) • Full-duplex links • Packet size = 16 flits • Wormhole routing • Deadlock avoidance (paper has deadlock recovery results) • Router architecture • 3 virtual channels per physical channel • Each virtual channel buffer holds 8 flits • 1 cycle central arbitration, 1 cycle switching

  20. Input Traffic • Packet Generation Frequency • “attempt” to send one packet per packet regeneration interval • Traffic Patterns • Random destination • Perfect Shuffle: an-1an-2... a1a0 an-2an-3 ... a0an-1 • Butterfly: an-1an-2... a1a0 a0an-2 … a1an-1 • Bit Reversal: an-1an-2... a1a0 a0a1... an-2an-1

  21. Throttling Algorithms • Base • no throttling • ALO (At Least One) • Lopez, Martinez, and Duato, ICPP, August, 1998 • Throttling based on local estimation of congestion • Inject new packet only if • “useful” physical channel has all virtual channels free, or • at least one virtual channel on every “useful” channel is free • Tune (this work)

  22. Tuning Parameters • Total number of network buffers = 256 * 3 * 4 = 3072 • Gather time (g) = n * k * h / 2 = 32 cycles • Sideband communication latency (h) = 2 cycles • Sideband communication bandwidth = 25 bits (!) • # network buffers = 3072 = 12 bits • max throughput = g * 256 * 1 = 8192 = 13 bits • Tuning frequency = once every 96 cycles • Initial threshold value = 1% ~= 30 buffers • Threshold increment = 1%, decrement = 4%

  23. Random Pattern Beyond saturation point, Tune outperforms ALO and Base

  24. Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles) Tune fairly insensitive to delayed collection of information

  25. Static Threshold Choice Optimal Thesholds different for random and butterfly Tune performs close to the best static threshold

  26. With Bursty Load Tune outperforms ALO random bit reversal shuffle butterfly

  27. Avoiding Local Maxima • What if steady decrease in bandwidth < 25%? • potential to “creep” into saturation • Solution: remember global maxima • max = maximum throughput seen in any tuning period • Nmax = number of full buffers at max • Tmax = threshold at max • Reset threshold min(Tmax, Nmax) if throughput < 50% max • If “r” consecutive resets don’t fix the problem, then restart • hypothesis: communication pattern has changed

  28. Threshold Reset Necessary Hill Climbing + Local Maxima Hill Climbing Hill Climbing + Local Maxima Hill Climbing Packet Rengeration Interval = 10 cycles

  29. Summary • Network Saturation is a severe problem • advent of powerful processors, SMT, and CMPs • “unstable” behavior makes designers nervous • We propose throttling based on global knowledge • aggregate global knowledge (% full buffers,throughput) • throttle when % full buffers exceed threshold • tune threshold for communication patters & offered load

More Related