Switching and Router Design 2007. 10

Switching and Router Design2007. 10

A generic switch

Classification • Packet vs. circuit switches • packets have headers and samples don’t • Connectionless vs. connection oriented • connection oriented switches need a call setup • setup is handled in control plane by switch controller • connectionless switches deal with self-contained datagrams

Requirements • Capacity of switch is the maximum rate at which it can move information, assuming all data paths are simultaneously active • Primary goal: maximize capacity • subject to cost and reliability constraints • Circuit switch must reject call if can’t find a path for samples from input to output • goal: minimize call blocking • Packet switch must reject a packet if it can’t find a buffer to store it awaiting access to output trunk • goal: minimize packet loss • Don’t reorder packets

Outline • Circuit switching • Packet switching • Switch generations • Switch fabrics • Buffer placement (Architecture) • Schedule the fabric / crossbar / backplane • Routing lookup

Packet switching • In a packet switch, For every packet, you must: • Do a routing lookup: where (port) to send it • Datagram -- lookup based on entire destination address (packets carry a destination field) • Cell -- lookup based on VCI • Schedule the fabric / crossbar / backplane • Maybe buffer, maybe QoS, maybe filtering by ACLs • Back-of-the-envelope numbers • Line cards can be 40 Gbit/sec today (OC-768) • To handle minimum-sized packets (~40B) • 125 Mpps, or 8ns per packet • But note that this can be deeply pipelined, at the cost of buffering and complexity. Some lookup chips do this, though still with SRAM, not DRAM. Good lookup algorithms needed still.

Router Architecture • Control Plane • How routing protocols establish routes etc. • Port mappers • Data Plane • How packets get forwarded

Outline • Circuit switching • Packet switching • Switch generations • Switch fabrics • Buffer placement • Schedule the fabric / crossbar / backplane • Routing lookup

First generation switch - shared memory • Line card DMAs into buffer, CPU examines header, has output DMA out • Bottleneck can be CPU, host-adapter or I/O bus, depending • Most Ethernet switches and cheap packet routers • Low-cost routers, Speed: 300 Mbps – 1 Gbps

Example (First generation switch) • Today router built with 1.33 GHz CPU - bottleneck • Mean packet size 500 B • word (4B) access take 15 ns • Bus ~ 100 MHz, memory ~ 5 ns • 1) Interrupt takes 2.5 µs per packet • 2) Per-packet processing time: (200 instrcts) = 0.15 µs • 3) Copying packet takes 500/4 *33 = 4.1 µs • 4 instructions + 2 memory accesses = 33 ns (4B word) • Total time = 2.5 + 0.15 + 4.1 = 6.75 µs • => speed is (500x8/6.75) = 600 Mbps • Amortized interrupt cost balanced by routing protocol cost

Second generation switch - shared bus • Port mapping (forwarding decisions) in line cards • direct transfer over bus between line cards • if no route in line card --> CPU (“slow” operations) • Bottleneck is the bus • Medium-end routers, switches, ATM switches • Speed: 10 Gbps (> 8* 1-st generation switch)

Third generation switches- point-to-point (switched) bus (fabric) +tag ----->

Third generation (contd.) • Bottleneck in second generation switch is the bus • Third generation switch provides parallel paths (fabric) • Features • self- routing fabric (+ tag) • output buffer is a point of contention • unless we arbitrate access to fabric • potential for unlimited scaling, as long as we can resolve contention for output buffer • High-end routers, switches • Speed: 1000 Gbps

Buffered crossbar • What happens if packets at two inputs both want to go to same output? • Can defer one at an input buffer • Or, buffer crosspoints

Broadcast • Packets are tagged with output port # • Each output matches tags • Need to match N addresses in parallel at output • Useful only for small switches, or as a stage in a large switch

Switch fabric element • Can build complicated fabrics from a simple element • Self- routing rule: if tag = 0, send packet to upper output, else to lower output • If both packets to same output, buffer or drop

Fabrics built with switching elements • NxN switch with bxb elements has stages with elements per stage • ex: 8x8 switch with 2x2 elements has 3 stages with 4 elements per stage • Fabric is self-routing • Recursive • Can be synchronous or asynchronous • Regular and suitable for VLSI implementation

The fabric for Batcher-banyan switch

An example using Batcher-banyan switch

Outline • Circuit switching • Packet switching • Switch generations • Switch fabrics • Buffer placement (Architecture) • Schedule the fabric / crossbar / backplane • Routing lookup

Generic Router Architecture Speedup • C – input/output link capacity • RI– maximum rate at which an input interface can send data into backplane • RO– maximum rate at which an output can read data from backplane • B – maximum aggregate backplane transfer rate input interface output interface Inter- connection Medium (Backplane) C RI RO B C • Back-plane speedup: B/C • Input speedup: RI/C • Output speedup: RO/C

Buffering - three router architectures • Where should we place buffers? • Output queued (OQ) • Input queued (IQ) • Combined Input-Output queued (CIOQ) • Function division • Input interfaces: • Must perform packet forwarding – need to know to which output interface to send packets • May enqueue packets and perform scheduling • Output interfaces: • May enqueue packets and perform scheduling

Output Queued (OQ) Routers • Only output interfaces store packets • Advantages • Easy to design algorithms: only one congestion point • Disadvantages • Requires an output speedup of N, where N is the number of interfaces  not feasible input interface output interface Backplane RO C

Input Queueing (IQ) Routers • Only input interfaces store packets • Advantages • Easy to built • Store packets at inputs if contention at outputs • Relatively easy to design algorithms • Only one congestion point, but not output… • need to implement backpressure • Disadvantages • In general, hard to achieve high utilization • However, theoretical and simulation results show that for realistic traffic an input/output speedup of 2 is enough to achieve utilizations close to 1 input interface output interface Backplane RO C

Combined Input-Output Queueing (CIOQ) Routers • Both input and output interfaces store packets • Advantages • Easy to built • Utilization 1 can be achieved with input/output speedup (<= 2) • Disadvantages • Harder to design algorithms • Two congestion points • Need to design flow control • An input/output speedup of 2, a CIOQ can emulate any work-conserving OQ [G+98,SZ98] input interface output interface Backplane RO C

Generic Architecture of a High Speed Router Today • CIOQ - Combined Input-Output Queued Architecture • Input/output speedup <= 2 • Input interface • Perform packet forwarding (and classification) • Output interface • Perform packet (classification and) scheduling • Backplane / fabric • Point-to-point (switched) bus; speedup N • Schedule packet transfer from input to output

Backplane / Fabric / Crossbar • Point-to-point switch allows to simultaneously transfer a packet between any two disjoint pairs of input-output interfaces • Goal: come-up with a schedule that • Maximize router throughput • Meet flow QoS requirements • Challenges: • Address head-of-line blocking at inputs • Resolve input/output speedups contention • Avoid packet dropping at output if possible • Note: packets are fragmented in fix sized cells (why?) at inputs and reassembled at outputs • In Partridge et al, a cell is 64 B (what are the trade-offs?)

Cannot be transferred because is blocked by red cell Cannot be transferred because output buffer full Head-of-line Blocking • The cell at the head of an input queue cannot be transferred, thus blocking the following cells Input 1 Output 1 Input 2 Output 2 Input 3 Output 3

Input 1 Output 1 Output 2 Input 2 Output 3 Input 3 To Avoid Head-of-line Blocking • Head-of-line blocking with only 1 queue per input • Max throughput <= (2-sqrt(2)) =~ 58% • Solution? Maintain at each input N virtual queues, i.e., one per output • Requires N queues; more if QoS • The Way It’s Done Now

Cell transfer • Schedule: ideally, find the maximum number of input-output pairs such that: • Resolve input/output contentions • Avoid packet drops at outputs • Packets meet their time constraints (e.g., deadlines), if any • Example: • Use stable matching • Try to emulate an OQ switch

Stable Marriage Problem • Consider N women and N men • Each woman/man ranks each man/woman in the order of their preferences • Stable matching, a matching with no blocking pairs • Blocking pair; let p(i) denote the pair of i • There are matched pairs (k, p(k)) and (j, p(j)) such that k prefers p(j) to p(k), and p(j) prefers k to j

Gale Shapely Algorithm (GSA) • As long as there is a free man m • m proposes to highest ranked women w in his list he hasn’t proposed yet • If w is free, m an w are engaged • If w is engaged to m’ and w prefers m to m’, w releases m’ • Otherwise m remains free • A stable matching exists for every set of preference lists • Complexity: worst-case O(N2)

OQ Emulation with a Speedup of 2 • Each input and output maintains a preference list • Input preference list: list of cells at that input ordered in the inverse order of their arrival • Output preference list: list of all input cells to be forwarded to that output ordered by the times they would be served in an Output Queueing schedule • Use GSA to match inputs to outputs • Outputs initiate the matching • Can emulate all work-conserving schedulers

a.1 1 a c.2 b.2 b.1 a.1 2 b c.1 a.2 c.1 3 c b.3 c.3 (b) a.1 1 a 1 a c.2 b.2 b.1 a.1 c.2 b.2 b.1 a.1 2 b 2 b c.1 a.2 c.1 a.2 b.3 3 b.3 c 3 c b.3 c.3 c.3 c.1 (c) (d) after step 1 Example with a Speedup of 2 1 a c.2 b.2 b.1 a.1 2 b c.1 a.2 3 c b.3 c.3 (a)

A Case Study [Partridge et al ’98] • Goal: show that routers can keep pace with improvements of transmission link bandwidths • Architecture • A CIOQ router • 15 (input/output) line cards: C = 2.4 Gbps (3.3 Gpps including packet headers) • Each input card can handle up to 16 (input/output) interfaces • Separate forward engines (FEs) to perform routing • Backplane: Point-to-point (switched) bus, capacity B = 50 Gbps (32 MPPS) • B/C = 50/2.4 = 20

packet header Router Architecture

Update routing tables Router Architecture input interface output interfaces 1 Backplane Data in Data out 15 Set scheduling (QoS) state forward engines Network processor Control data (e.g., routing)

Router Architecture: Data Plane • Line cards • Input processing: can handle input links up to 2.4 Gbps • Output processing: use a 52 MHz FPGA (Field Programmable Gate Array); implements QoS • Forward engine: • 415-MHz DECAlpha 21164 processor, three level cache to store recent routes • Up to 12,000 routes in second level cache (96 kB); ~ 95% hit rate • Entire routing table in tertiary cache (16 MB divided in two banks)

Router Architecture: Control Plane • Network processor: 233-MHz 21064 Alpha running NetBSD 1.1 • Update routing • Manage link status • Implement reservation • Backplane Allocator: implemented by an FPGA • The allocator is the heart of the high-speed switch • Schedule transfers between input/output interfaces

Control Plane: Backplane Allocator • Time divided in epochs • 16 ticks of data clock (8 allocation clocks) • Transfer unit: 64 B (8 data clock ticks) • Up to 15 simultaneous transfers in an epoch • One transfer: 128 B of data + 176 auxiliary bits • Minimum of 4 epochs to schedule and complete a transfer but scheduling is pipelined. • Source card signals that it has data to send to the destination card • Switch allocator schedules transfer • Source and destination cards are notified and told to configure themselves • Transfer takes place • Flow control through inhibit pins

Early Crossbar Scheduling Algorithm • Wavefront algorithm Observation: 2,1 1,2 don’t conflict with each other (11 “cycles”) Slow! (36 “cycles”) Do in groups, with groups in parallel (5 “cycles”) Can find opt group size, etc. Problems: Fairness, speed, …

The Switch Allocator • Disadvantages of the simple allocator • Unfair: a preference for low-numbered sources • Requires evaluating 225 positions per epoch, which is too fast for an FPGA • Solution to unfairness problem: random shuffling of sources and destinations • Solution to timing problem: parallel evaluation of multiple locations • wavefront allocation requires 29 steps for a 15 x 15 matrix • MGR allocator uses 3 x 5 groups • Priority to requests from forwarding engines over line cards to avoid header contention on line cards

Alternatives to the Wavefront Scheduler • PIM: Parallel Iterative Matching • Request: Each input sends requests to all outputs for which it has packets • Grant: Output selects an input at random and grants • Accept: Input selects from its received grants • Problem: Matching may not be maximal • Solution: Run several times • Problem: Matching may not be “fair” • Solution: Grant/accept in round robin instead of random

iSLIP – Round-robin Parallel Iterative Matching (PIM) • Each input maintains round-robin list of outputs • Each output maints round-robin list of inputs • Request phase: Inputs send to all desired output • Grant phase: Output picks first input in round-robin sequence • Input picks first output from its RR seq • Output updates RR seq only if it’s chosen • Good fairness in simulation

100% Throughput? • Why does it matter? • Guaranteed behavior regardless of load! • Same reason moved away from cache-based router architectures • Cool result: • Dai & Prabhakar: Any maximal matching scheme in a crossbar with 2x speedup gets 100% throughput • Speedup: Run internal crossbar links at 2x the input & output link speeds

Summary: Design Decisions (Innovations) • Each FE has a complete set of routing tables • A switched fabric is used instead of the traditional shared bus • FEs are on boards distinct from the line cards • Use of an abstract link layer header • Include QoS processing in the router

Switching and Router Design 2007. 10

Switching and Router Design 2007. 10

Presentation Transcript

CS 268: Lecture 10 Router Design and Packet Lookup

Issues and Trends in Router Design

ECE 678 Project Optical Switching Router

14 - Router Design

Handout # 10: Packet Switching and Randomization; Load Balanced Router

Router Design and Packet Scheduling

Router Design

Chapter 10 Circuit Switching and Packet Switching

Addressing: Router Design

Router Design

ECE 678 Project Optical Switching Router

Router Design

Router Design

Ch. 10 Circuit Switching and Packet Switching

Router Design

CS 268: Lecture 10 Router Design and Packet Lookup

Router Design

Lecture 21: Router Design

Ch. 10 Circuit Switching and Packet Switching

SPP V2 Router Design