Block Design Review: Queue Manager and Scheduler

Block Design Review:Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM/Schd Overview Lookup Hdr Format Switch Tx S W I T C H Phy Int Rx Key Extract • QM/Scheduler • Function: • Enqueue and Dequeue from queues • Scheduling algorithm (5-ports, N queue per port, WDRR across queues) • Drop Policy • RR port scheduling, rate controlled • Memory Accesses: • SRAM: • Q-Array Reads and Writes • Scheduling Data Structure Reads and Writes • QLength Data Structure Reads and Writes • Queue weight, discard threshold, and port rates Reads • Retrieve Packet Length from Buffer Descriptor Reads Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) V: Valid Bit Frame Length (16b) Stats Index (16b)

V 1 Rsv (3b) Port (4b) Buffer Handle(24b) Data Structures SRAM Queue length Head LW0-1 xxx Discard threshold Tail LW2 Pkt_Size (16b) xxx Count Weight quantum LW3-7 xxx Q params (Per queue) Q Descrpt. (Per queue) : : : : Buf. Descrpt. High level Cache Arch. Local memory (16 entries) CAM (16 entries) SRAM Q-array (16 entries) Enqueuer QID(20b) Qlen Valid Head Valid Tail Valid Queue id (20b) Queue head/tail/count Dequeuer Queue length Dequeuer Dequeuer : : : : Dequeuer Discard threshold Dequeuer Weight quantum Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) Frame Length (16b) Stats Index (16b)

V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM/Schd Interface Lookup Hdr Format Switch Tx S W I T C H Phy Int Rx Key Extract • Scratch Ring Interface • For both ingress and egress • Threads used: 7 • Thread 0: Free list maintenance and initialization • Thread 1-5: Dequeue for port 0-4 • Thread 6: Enqueue for all 5 ports • Threads are synchronized after each round • A round enqueues up to 5 packets • Dequeues up to 5 packets, one for each port Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) V: Valid Bit Frame Length (16b) Stats Index (16b)

Thread Synchronization Note that in the enqueue thread, signal A is not used, it is implemented Using a register which is set by thread 0 and reset by enqueuer

Resource Usage • Local memory: 1512 bytes • #define PAR_CACHE_LM_BASE 0x0 • #define PORT_DATA_LM_BASE 0x100 • #define BBUF_FL_LM_BASE 0x1a8 • #define BBUF_LM_BASE 0x1fc • #define FL_LM_BASE 0x598 • SRAM • Queue descriptors (16B per queue) • Queue parameters (16B per queue) • Port rates (4B per port) • Free lists • Batch buffers • Enqueue: • 15 signals, 16 RD xfer, 10 WR xfer • Dequeue: • 9 signals, QM uses 4 RD xfer, 1 WR xfer. SCH used more xfers

Data Consistency Precautions • Only one thread (dequeue or enqueue) reads in the queue parameters of a Queue • Flags are used to ensure that when thread x is reading in the Q param • thread y doesn’t read them • Also, thread y waits until thread x stores the data read into cache • Flags are stored in local memory • Three flags are used, (head valid, tail valid, and Q param valid) • Head valid implies dequeue thread has cached the Q descriptor • Tail valid implies enqueue thread has cached the Q descriptor • Both valid means, both head and tail are cached • Before a thread swaps out • Move relevant register contents (flags, queue length) into the local memory • After a thread resumes • Move relevant local memory data back to register • Cache contents are refreshed after every 4k iterations • Port rate in register are refreshed every 4k iterations

Initialization • Thread 0 initializes all shared data-structure ??? • CAM and Q-array (cam_clear and Q-array empty) • Memory controller variables • Set SRAM Channel CSR to ignore cellcount and eop bit in the buffer handle • Local memory • Queue parameter cache (all zeroes) • Scheduling data structures (set by scheduler) • SRAM • Queue parameters (length, weight quantum, discard threshold) • Queue descriptors (all zeroes) • Port rates (as per token bucket) • Free list (set by free list macro) • Scheduling data structure (set by scheduler)

Enqueue Thread • Operates in batch mode (5 packets at a time) • Read 5 requests from the scratch ring • Check CAM for the 5 queue ids read • If miss • Evict LRU entry (write back queue params and descr) • Read queue params from SRAM into cache • Read queue descriptor into Q-array • Update CAM • check for discard • If discard, call dl_drop_buf • If admit • Send enqueue command to Q-array • Check if queue was already active • If not call add_queue_to_tail • Update the queue length in cache • Write back queue length (in future may want to do less often)

Dequeue Thread (per port) • One thread handles one port • Done for the round if port rate $$tx_q_flow_control is set or port is inactive (port_active macro) or tokens are over • If current batch is done, call get_head macro • If batch buffer is non-empty then consider the first queue_id • Check CAM for the queue_id • If miss • Evict LRU entry (write back queue params and descr) • Read queue params from SRAM into cache • Read queue descriptor into Q-array and Update CAM • If Hit or after data is ready • Send dequeue command to Q-array • Call dl_sink_1ME_SCR_1words • Read the pkt_length from buffer descriptor • Update queue length (and write back) and the credit • If credit <= 0 and queue_length > 0 then add_queue_to_tail • If queue_length <= 0 OR credit <= 0 then incr. batch_index • If batch_index = 5 OR queue_id = 0 then call advance_head

Enqueue Thread Read 15 words from scratch 28 inst. 2x5 Writes 1, 3 words 2x5 Reads 3, 2 words For 5 q_ids, check CAM hit: If miss, write back LRU and read queue param/descriptor 40/31 inst. per Q 202/157 inst. total For all 5 requests: Worst case: 545+5x All discard: 395 All accept/hit: 500+5x Admit? dl_drop_buf() Per packet 41 if discard If admit: 62+add_q_2_tail Total 205 / 310+5x + 6 inst. for signals enqueue / update Q params 2x5 Writes 1, 1 word SCH reads Active? add_queue_to_tail() (x instr) x = 18-49 Write back the queue length Loop around

Dequeue Thread (per port) 1 Read (once / 16K cycles) Rate_control 27 inst. If curr_queue = 0, get_head() 27 inst. 2 Writes 1, 3 words 2 Reads 3, 2 words Check CAM, evict, load 32/44 inst. Worst case: 320 Best: 170 Update cache, dequeue 24+ inst. 1 Read Send tx_msg, read pkt_len 34 inst. 1 Read Update credit/q_len, Wr q_len 13 inst. 1 Write Adv_head: 35-63 inst Add_queu..: 18-49 inst Overheads: 13 inst add_queue_to_tail() advance_head() Write_old_tail and loop around

Dequeue Rate Control • Token bucket • The unit of port_rate is bytes per 4096 clocks (ME clock/16 MHz). • curr_time is the counts of 16 clocks (ME clock/16 MHz). • last_time is the time when the last packet was sent. • IF PORT IS INAVTIVE THEN tokens = 4095 • ELSE IF (tokens = 4095) • SEND PACKET • last_time := curr_time • tokens = tokenspkt_length • ELSE • tokens = min [4095, tokens + {(curr_timelast_time) port_rate >> 4}] // 16x16 mult. • IF (tokens > 0) • SEND PACKET • last_time := curr_time • tokens = tokenspkt_length • Port rates • Must be specified in LSB 16-bits • 1 unit = 195 KBps • Max port rate = 64K = 12.8 GBps Reserved (16b) Port rate (16b)

Performance Analysis • Dequeue thread runs much longer than the enqueue thread • Dequeue • 1273 cycles in case of a cache miss and add_queue_to_tail() and advance_head() • 867 cycles in case of cache hit and no scheduler calls • Enqueue • 876 cycles in case of all 5 cache misses • 342 cycles in case of a single enqueue and cache hit • Dequeue takes more time due to memory accesses • Read Queue_param: 110 cycles • Dequeue: 120 cycles • Read pkt_len: 110 cycles • There are few idle cycles at present • Can be removed by giving higher priority to dequeue threads

File locations (in …/IPv4_MR/) • Code • src/qm/PL/common_macros.uc • src/qm/PL/dequeue.uc • src/qm/PL/enqueue.uc • src/qm/PL/fl_macros.uc • src/qm/PL/qm.h • src/qm/PL/qm.uc • src/qm/PL/sched_macros.h • Includes • ../dispatch_loop/dl_source_WU.uc • dl_buf_drop() and dl_sink_1ME_SCR_1words() functions • Also uses local memory read and write macros (localmem.uc)

Queue Manager Validation • Tested • Threshold length discards (set length at 0, and tested if packets are enqueued) • Enqueue • Single port, single queue active • Multiple ports/queues active • Cache hit/miss (not all scenarios are tested) • Dequeue • Rate control partially tested (set the port rate at 0, and see is packet are dequeued) • Partial fairness test (set quantum at 0, and see if packets are dequeued) • Multiple active ports/queues • Both queue manager enabled • There is one bug concerning the Q-array contention

Cycle Budget • 76B packet • 1.4 Ghz clock rate • 1.4Gcycle/sec • % Gbps => 170 cycles per packet • Dequeue worst-case = 320 inst. (best case 170 inst.) • Dequeue worst-case = 545 + 5x inst. for 5 packets

Scheduling Structure Overview Head Next Head Tail Batch Buffer Batch Buffers in SRAM Batch Buffer Port 0 Batch Buffer Batch Buffer SRAM Next Pointer Queue 0 … … … … Credits 0 Port 4 Batch Buffer Batch Buffer Batch Buffer … Queue 4 Credits 4 Stack inSRAM Stack inLocal Memory Stack inLocal Memory Free List (for SRAM Batch Buffers) Batch Buffer Free List(for LM Batch Buffers)

Scheduling Structure Interface • Scheduling structure macros contained in \src\qm\PL\sched_macros.uc • add_queue_to_tail(queue, credits, port) • get_head(port, head_ptr) • advance_head(port, sig_a, sig_b) • port_active(port, label) • write_old_tail(port, sig_a, sig_b) • Free list macro contained in\src\qm\PL\fl_macros.uc • maintain_fl()

Block Design Review: Queue Manager and Scheduler