Designing High-Capacity Packet Switches with Slower Memories: A Practical Approach

Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown (sundaes,nickm)@stanford.edu Departments of Electrical Engineering & Computer Science, Stanford University http://klamath.stanford.edu/pps

Motivation To design and analyze: • an architecture of a very high capacity packet switch • in which the memories run slower than the line rate” [Ref: S. Iyer, A. Awadallah, N. McKeown, “Analysis of Packet Switch with Memories Running Slower than the Line Rate, Proc. Infocom, Tel Aviv, Mar 2000.] Stanford University

What limits capacity of packet switches today? • Memory bandwidth for packet buffers • Shared memory: B = 2NR • Input queued: B = 2R • Switch Arbitration • At the line rate R • Packet Processing • At the line rate R Stanford University

The building blocks we’d like to use: Large NxN Switch R R Slower NxN Switches R R How can we scale the capacity of switches? What we’d like: R R NxN R R Stanford University

Why this might be a good idea • Larger Capacity • Slower than the line rate • Buffering • Arbitration • Packet Processing • Redundancy Stanford University

R/k 1 R/k R R 2 R R … R R … k Observations and Questions • Random load-balancing: • It’s hard to predict system performance. • Flow-by-flow load-balancing: • Worst-case performance is very poor. • Can we do better? • What if we switch packet by packet? • Can we achieve 100% throughput • Can we give delay guarantees? Stanford University

Architecture of a PPS Definition: A PPS is comprised of multiple identical lower-speed packet-switches operating independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a demultiplexor across the slower packet-switches, then recombined by a multiplexor at the output. We call this “parallel packet switching” Stanford University

1 2 3 N=4 Architecture of a PPS Demultiplexor OQ Switch Multiplexor (sR/k) (sR/k) R R 1 1 Multiplexor Demultiplexor R R OQ Switch 2 2 Demultiplexor Multiplexor R R 3 OQ Switch Demultiplexor Multiplexor R R k=3 N=4 (sR/k) (sR/k) Stanford University

Output Queued Switch 1 1 R R 2 2 R R Internal BW = 2NR R R N N R R We will compare it to an OQ Switch • Why? • There is no internal contention • No queueing at the inputs • They give the minimum delay • They can give QoS guarantees Stanford University

Definition • Relative Queueing Delay • This is defined as the increased queueing delay faced by a cell in the PPS relative to the delay it receives in a shadow output queued switch • It includes the time difference attributed only due to queueing • A switch is said to emulate an OQ switch if the relative queueing delay is zero Stanford University

Shadow OQ Switch t’ t” C t R R C t’ C R R R R R R Yes No =? PPS C C C C A PPS which a bounded relative delay t” –t’ < Constant Stanford University

Problem Statement Redefined Motivation: “To design and analyze an architecture of a very high capacity packet switch in which the memories run slower than the line rate,whichpreserves the good properties of an OQ switch” This talk: Expanding the capacity of a FIFO packet switch, with a bounded relative queueing delay, using the PPS architecture. Stanford University

5 4 3 2 1 5 4 1 1 4 5 4 1 5 4 1 2 2 2 2 5 4 4 3 3 2 2 1 1 4 3 2 2 1 1 1 1 3 2 3 3 2 3 3 N=4 3 A Bad Scenario for the PPS Layer 1 R R 1 1 R/3 R R Layer 2 2 2 R R 3 R/3 Layer 3 R R N=4 R/3 Stanford University

Parallel Packet SwitchResult Theorem: • If S >= 2 then a PPS can emulate a FIFO OQ switch for all traffic patterns. Stanford University

Is this Practical? • Load Balancing Algorithm • Is Centralized • Requires N2 communication complexity • Ideally we want a distributed algorithm • Speedup • A speedup of 2 is required • We would ideally like no speedup Stanford University

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 N=4 Load Balancing in a PPS Layer 1 R R 3 2 1 1 R/3 R R Layer 2 2 3 2 1 2 R R 3 2 1 3 R/3 Layer 3 R R 3 2 1 N=4 R/3 Stanford University

12 9 8 7 2 1 13 8 5 4 3 1 4 3 2 1 14 13 10 6 3 16 14 11 9 6 2 8 6 5 7 15 12 10 7 12 11 10 9 15 11 5 16 14 13 15 16 4 Distribution of Cells from a Demultiplexor Demultiplexor: Input 1 • Cells from every input to a every output are sent to the center stage switches in a round robin manner FIFOs for all k=3 layers R/3 R R/3 R/3 “No more than 4 consecutive cells can go to the same FIFO i.e. center stage switch” Stanford University

Modification to the PPS • Relax the relative queueing bound • Allows a distributed load balancing arbiter • Run an independent load balancing algorithm on each demultiplexor • Eliminates N2 Communication Complexity • Keep small & bounded delay buffers at the demultiplexor • Eliminates speedup in the links between the demultiplexor and the center stage switches Stanford University

1 1 2 2 3 3 N=4 Cells as seen by the Multiplexor Layer 1 R R 1 1 1 1 3 2 1 1 R/3 R R Layer 2 2 3 2 1 2 2 2 2 2 R R 3 2 1 3 R/3 Layer 3 3 3 3 3 R R 3 2 1 N=4 R/3 Stanford University

Solution • Read • cells from the corresponding queues (which may be out of order) based on the arrival time from all center stage switches to maintain throughput • Introduce • a small and bounded re-sequencing buffer at the multiplexor to re-order cells and send them in sequence • Tolerate • a bounded delay relative to the shadow FIFO OQ switch Stanford University

Properties of the PPS Demultiplexor • Demultiplexors • Cells arrive at combined rate R over all k FIFOs • Each cell has a property: output • Cells to same output are inserted into the k FIFOs in RR. • Cells are written into each FIFO buffer at leaky bucket rate of less than R/Ck + N • Cells are read from each FIFOs at constant service rate R/k • Max delay faced by a cell is N internal time slots Stanford University

Relative Queueing Delay faced by a Cell • Demultiplexors • A maximum relative queueing delay of N internal time slots is encountered by a cell • Multiplexors • A maximum relative queueing delay of N internal time slots is encountered by a cell • Total Relative Queueing Delay • 2N time slots Stanford University

Buffered PPSResults • A PPS with a completely distributed algorithm and no speedup with a buffer of size Nk, can emulate a FIFO output queued switch for all traffic patterns within a relative queueing delay bound of 2N internal time slots I.e. 2Nk time slots. Stanford University

Conclusion • Its possible to expand the capacity of a FIFO packet switch using multiple slower speed packet switches. • There remain a couple of open questions • Making QoS practical. • Making multicasting practical. Stanford University

Designing High-Capacity Packet Switches with Slower Memories: A Practical Approach

Designing High-Capacity Packet Switches with Slower Memories: A Practical Approach

Presentation Transcript

Max-Min Fair Bandwidth Allocation Algorithms for Packet Switches

Making Knowledge Management Practical

Making Secure Computation Practical

Packet-Mode Emulation of Output-Queued Switches

Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches

Packet-Mode Emulation of Output-Queued Switches

High-Capacity Packet Switches

Variable Packet Size Buffered Crossbar (CICQ) Switches

Making researched information practical

Packet Switches with Output Buffers and Shared Buffer

Packet Switches with Output and Shared Buffer

The Parallel Packet Switch

EE384Y: Packet Switch Architectures Part II Load-balanced Switches

Aspects of practical parallel programming Parallel programming models Data parallel

The Parallel Packet Switch

Packet Switches with Output Buffers and Shared Buffer

High Speed Stable Packet Switches

Making recommendations: Practical Considerations

High Speed Stable Packet Switches

Practical Optical Packet Routers