760 likes | 773 Vues
This project focuses on developing an NP router for the Open Network Lab with support for 64 datagram queues per port. It includes discussions on QID partitioning, SRAM rings, and the upgrade to IXP SDK 4.3.1.
E N D
An NP-Based Router for the Open Network Lab John DeHart
Notes from 3/23/07 ONL Control Mtg • Using the same QID for all copies of a multicast does not work • The QM does not partition QIDs across ports • Do we need to support Datagram queues? • Yes, we will support 64 datagram queues per port • We will use the same Hash Function as in the NSP router • For testing purposes, can users assign the datagram queues to filters/routes? • Proposed partitioning of QIDs: • QID[15:13]: Port Number 0-4 • QID[12]: Reserved by RLI vs XScale • 0: RLI Reserved • 1: XScale Reserved • QID[11: 0] : per port queues • 4096 RLI reserved queues per port • 4032 XScale reserved queues per port • 64 datagram queues per port • yyy1 0000 00xx xxxx: Datagram queues for port <yyy> • IDT XScale software kernel memory issues still need to be resolved.
ONL NP Router SRAM Ring xScale xScale Scratch Ring TCAM Assoc. Data ZBT-SRAM SRAM NN Ring NN 32KW Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Mostly Unchanged 64KW SRAM 32KW Each New NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Needs A Lot Of Mod. Needs Some Mod. Tx, QM Parse Plugin XScale Stats (1 ME) FreeList Mgr (1 ME) QM Copy Plugins SRAM
Project Assignments • XScale daemons, etc: Charlie • With Design and Policy help from Fred and Ken • PLC (Parse, Lookup and Copy): Jing and JohnD • With consulting from Brandon • QM: Dave and JohnD • Rx: Dave • Tx: Dave • Stats: Dave • Header Format: Mike • Mux: Mart? • Freelist_Mgr: JohnD • Plugin Framework: Charlie and Shakir • With consulting from Ken • Dispatch loop and utilities: All • Dl_sink_to_Stats, dl_sink_to_freelist_mgr • These should take in a signal and not wait • Documentation: Ken • With help from All • Test cases and test pkt generation: Brandon
Project Level Stuff • Upgrade to IXA SDK 4.3.1 • Techx/Development/IXP_SDK_4.3/{cd1,cd2,4-3-1_update} • Project Files • We’re working on them right now. • C vs. uc • Probably any new blocks should be written in C • Existing code (Rx, Tx, QM, Stats) can remain as uc. • Freelist Mgr might go either way. • Stubs: • Do we need them this time around? • SRAM rings: • We need to understand the implications of using them. • No way to pre-test for empty/full? • Subversion • Do we want to take this opportunity to upgrade? • Current version: • Cygwin (my laptop): 1.3.0-1 • Linux (bang.arl.wustl.edu): 1.3.2 • Available: • Cygwin: 1.4.2-1 • subversion.tigris.org: 1.4.3
Notes from 3/13/07 • Ethertype needs to be written to buffer descriptor so HF can get it. • Who tags non-IP pkts for being sent to XScale: Parse? • We will not be supporting ethernet headers with: • VLANs • LLC/SNAP encapsulation • Add In Plugin in data going to a Plugin: • In Plugin: tells the last plugin that had the packet • Plugins can write to other Plugins sram rings • Support for XScale participation in an IP multicast • For use with Control protocols? • Add In Port values for Plugin and XScale generated packets • Include both In Port and In Plugin to lookup key? • Should flag bits also go to Plugins • For users to use our IP MCast support they must abide by the IP multicast addressing rules. • i.e. Copy will do the translation of IP MCast DAddr to Ethernet MCast DAddr so if the IP DA does not conform it can’t do it.
Issues and Questions • Upgrade to IXA SDK 4.3.1 • Techx/Development/IXP_SDK_4.3/{cd1,cd2,4-3-1_update} • Which Rx to use? • Intel Rx from IXA SDK 4.3.1 is our base for further work • Which Tx to use? • Three options: • Our current Tx (Intel IXA SDK 4.0, Radisys modifications, WU Modifications) • Among other changes, we removed some code that supported buffer chaining. • Radisys Tx based on SDK 4.0 – we would need to re-do our modifications • This would get the buffer chaining code back if we need/want it • Intel IXA SDK 4.3.1 Tx – no Radisys modifications, we would need to re-do our modifications • How will we write L2 Headers? • When there are >1 copies: • For a copy going to the QM, Copy allocates a buffer and buffer descriptor for the L2 Header • Copy writes the DAddr into the buffer descriptor • Options: • HF writes full L2 header to DRAM buffer and Tx initiates the transfer from DRAM to TBUF • Unicast: to packet DRAM buffer • Multicast: to prepended header DRAM buffer • HF/Tx writes/reads L2 header to/from Scratch ring and Tx writes it directly to TBUF • When there is only one copy of the packet: • No extra buffer and buffer descriptor are allocated • L2 header is given to Tx in same way as it is for the >1 copy case • How should Exceptions be handled? • TTL Expired • IP Options present • No Route • C vs. uc • Probably any new blocks should be written in C • Existing code (Rx, Tx, QM, Stats) can remain as uc. Freelist Mgr? • Continued on next slide…
Issues and Questions • Need to add Global counters • See ONLStats.ppt • Global counters: • Per port Rx and Tx: Pkt and Byte counters • Drop counters: • Rx (out of buffers) • Parse (malformed IP header/pkt) • QM (queue overflow) • Plugin • XScale • Copy (lookup result has Drop bit set, lookup MISS?) • Tx (internal buffer overflow) • What is our performance target? • 5-port Router, full link rates. • How should SRAM banks be allocated? • How many packets should be able to be resident in system at any given time? • How many queues do we need to support? • Etc. • How will lookups be structured? • One operation across multiple DBs vs. multiple operations each on one DB • Will results be stored in Associated Data SRAM or in one of our SRAM banks? • Can we use SRAM Bank0 and still get the throughput we want? • Multicast: • Are we defining how an ONL user should implement multicast? • Or are we just trying to provide some mechanisms to allow ONL users to experiment with multicast? • Do we need to allow a Unicast lookup with one copy going out and one copy going to a plugin? • If so, this would use the NH_MAC field and the copy vector field • Continued on next slide…
Issues and Questions • Plugins: • Can they send pkts directly to the QM instead of always going back through Parse/Lookup/Copy? • Use of NN rings between Plugins to do plugin chaining • Plugins should be able to write to Stats module ring also to utilize stats counters as they want. • XScale: • Can it send pkts directly to the QM instead of always going through Parse/Lookup/Copy path? • ARP request and reply? • What else will it do besides handling ARP? • Do we need to guarantee in-order delivery of packets for a flow that triggers an ARP operation? • Re-injected packet may be behind a recently arrived packet for same flow. • What is the format of our Buffer Descriptor: • Add Reference Count (4 bits) • Add MAC DAddr (48 bits) • Does the Packet Size or Offset ever change once written? • Yes, Plugins can change the packet size and offset. • Other? • Continued on next slide…
Issues and Questions • How will we manage the Free list? • Support for Multicast (ref count in buf desc) makes reclaiming buffers a little trickier. • Scratch ring to Separate ME • Do we want it to batch requests? • Read 5 or 10 from the scratch ring at once, compare the buffer handles and accumulate • Depending on queue, copies of packets will go out close in time to one another… • But vast majority of packets will be unicast so no accumulation will be possible. • Or, use the CAM to accumulate 16 buffer handles • Evict unicast or done multicast from CAM and actually free descriptor • Do we want to put Freelist Mgr ME just ahead of Rx and use NN ring into Rx to feed buffer descriptors when we can? • We might be able to have Mux and Freelist Mgr share and ME (4 threads per or something) • Modify dl_buf_drop() • Performance assumptions of blocks that do drops may have to be changed if we add an SRAM operation to a drop • It will also add a context swap. The drop code will need to do a test_and_decr, wait for the result (i.e. context swap) and then depending on the result perhaps do the drop. • Note: test_and_decr SRAM atomic operation returns pre-modified value • Usage Scenarios: • It would be good to document some typical ONL usage examples. • This might just be extracting some stuff from existing ONL documentation and class projects. • Ken? • It might also be good to document a JST dream sequence for an ONL experiment • Oh my, what I have done now… • Do we need to worry about balancing MEs across the two clusters? • QM and Lookup are probably heaviest SRAM users • Rx and Tx are probably heaviest DRAM users. • Plugins need to be in neighboring MEs • QM and HF need to be in neighboring MEs
Hardware • Promentum™ ATCA-7010 (NP Blade): • Two Intel IXP2850 NPs • 1.4 GHz Core • 700 MHz Xscale • Each NPU has: • 3x256MB RDRAM, 533 MHz • 3 Channels • Address space is striped across all three. • 4 QDR II SRAM Channels • Channels 1, 2 and 3 populated with 8MB each running at 200 MHz • 16KB of Scratch Memory • 16 Microengines • Instruction Store: 8K 40-bit wide instructions • Local Memory: 640 32-bit words • TCAM: Network Search Engine (NSE) on SRAM channel 0 • Each NPU has a separate LA-1 Interface • Part Number: IDT75K72234 • 18Mb TCAM • Rear Transition Module (RTM) • Connects via ATCA Zone 3 • 10 1GE Physical Interfaces • Supports Fiber or Copper interfaces using SFP modules.
Hardware ATCA Chassis NP Blade RTM
ONL Router Architecture / 5x1Gb/s NPUA / 5x1Gb/s SPI • Each NPU is one 5-port Router • ONL Chassis has no switch Blade • 1Gb/s Links on RTM connect to external ONL switch(es) NPUB RTM 7010 Blade
Performance • What is our performance target? • To hit 5 Gb rate: • Minimum Ethernet frame: 76B • 64B frame + 12B InterFrame Spacing • 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mpkt/sec • IXP ME processing: • 1.4Ghz clock rate • 1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet • compute budget: (MEs*170) • 1 ME: 170 cycles • 2 ME: 340 cycles • 3 ME: 510 cycles • 4 ME: 680 cycles • latency budget: (threads*170) • 1 ME: 8 threads: 1360 cycles • 2 ME: 16 threads: 2720 cycles • 3 ME: 24 threads: 4080 cycles • 4 ME: 32 threads: 5440 cycles
ONL NP Router (Jon’s Original) xScale xScale add largeSRAM ring TCAM SRAM HdrFmt (1 ME) Rx (2 ME) Parse, Lookup, Copy (3 MEs) Mux (1 ME) QueueManager (1 ME) Tx (2 ME) largeSRAM ring Stats (1 ME) • Each output has common set of QiDs • Multicast copies use same QiD for all outputs • QiD ignored for plugin copies Plugin Plugin Plugin Plugin Plugin SRAM xScale largeSRAM ring
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Inter Block Rings • Scratch Rings (sizes in 32b Words: 128, 256, 512, 1024) • XScale MUX • N Word per pkt • 256 Word Ring • 256/N pkts • PLC XScale • N Word per pkt • 256 Word Ring • 256/N pkts • MUX PLC • 1 Word per pkt • 256 Word Ring • 256 pkts • QM • N Words per pkt • 1024 Word Ring • 1024/N Pkts • HF TX • 1 Word per pkt • 256 Word Ring • 256 pkts • Stats • 1 Word per pkt • 256 Word Ring • 256 pkts • Freelist Mgr • 1 Word per pkt • 256 Word Ring • 256 pkts • Total Scratch Size: 4KW (16KB) • Total Used in Rings: 2.5 KW
Inter Block Rings • SRAM Rings (sizes in 32b KW: 0.5, 1, 2, 4, 8, 16, 32, 64) • RX MUX • 2 Words per pkt • 32KW Ring • 16K Pkts • PLC Plugins (5 of them) • 3 Words per pkt • 32KW Rings • ~10K Pkts • Plugins MUX (1 serving all plugins) • 3 Words per pkt • 64KW Ring • ~20K Pkts • NN Rings (128 32b words) • QM HF • 1 Word per pkt • 128 Pkts • Plugin N Plugin N+1 (for N=1 to N=4) • Words per pkt is plugin dependent
SRAM Buffer Descriptor • Problem: • With the use of Filters, Plugins and recycling back around for reclassification, we can end up with an arbitrary number of copies of one packet in the system at a time. • Each copy of a packet could end up going to an output port and need a different MAC DAddr from all the other packets • Having one Buffer Descriptor per packet regardless of the number of copies will not be sufficient. • Solution: • When there are multiple copies of the packet in the system, each copy will need a separate Header buffer descriptor which will contain the MAC DAddr for that copy. • When the Copy block gets a packet that it only needs to send one copy to QM, it will read the current reference count and if this copy is the ONLY copy in the system, it will not prepend the Header buffer descriptor. • SRAM buffer descriptors are the scarce resource and we want to optimize their use. • Therefore: We do NOT want to always prepend a header buffer descriptor • Otherwise, Copy will prepend a Header buffer descriptor to each copy going to the QM. • Copy does not need to prepend a Header buffer descriptor to copies going to plugins • We have to think some more about the case of copies going to the XScale. • The Header buffer descriptors will come from the same pool (freelist 0) as the PacketPayload buffer descriptors. • There is no advantage to associating these Header buffer descriptors with small DRAM buffers. • DRAM is not the scarce resource • SRAM buffer descriptors are the scarce resource.
Offset (16b) Ref_Cnt (8b) ONL Buffer Descriptor Buffer_Next (32b) LW0 Buffer_Size (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Ref_Cnt (8b) LW2 MAC DAddr_47_32 (16b) Stats Index (16b) LW3 MAC DAddr_31_00 (32b) LW4 Reserved (16b) EtherType (16b) LW5 Reserved (32b) LW6 Packet_Next (32b) LW7 1 Written by Rx, Added to by Copy Decremented by Freelist Mgr Written by Freelist Mgr Written by Rx Written by Copy Written by Rx and Plugins Written by QM
MR Buffer Descriptor Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Reserved (8b) LW2 Reserved (16b) Stats Index (16b) LW3 Reserved (16b) Reserved (8b) Reserved (4b) Reserved (4b) LW4 Reserved (32b) Reserved (4b) Reserved (4b) LW5 Reserved (16b) Reserved (16b) LW6 Packet_Next (32b) LW7
Intel Buffer Descriptor Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list (4b) Rx_stat (4b) Hdr_Type (8b) LW2 Input_Port (16b) Output_Port (16b) LW3 Next_Hop_ID (16b) Fabric_Port (8b) Reserved (4b) NHID type (4b) LW4 FlowID (32b) ColorID (4b) Reserved (4b) LW5 Class_ID (16b) Reserved (16b) LW6 Packet_Next (32b) LW7
SRAM Usage • What will be using SRAM? • Buffer descriptors • Current MR supports 229,376 buffers • 32 Bytes per SRAM buffer descriptor • 7 MBytes • Queue Descriptors • Current MR supports 65536 queues • 16 Bytes per Queue Descriptor • 1 MByte • Queue Parameters • 16 Bytes per Queue Params (actually only 12 used in SRAM) • 1 MByte • QM Scheduling structure: • Current MR supports 13109 batch buffers per QM ME • 44 Bytes per batch buffer • 576796 Bytes • QM Port Rates • 4 Bytes per port • Plugin “scratch” memory • How much per plugin? • Large inter-block rings • Rx Mux • Plugins • Plugins • Stats/Counters • Currently 64K sets, 16 bytes per set: 1 MByte • Lookup Results
SRAM Bank Allocation • SRAM Banks: • Bank0: • 4 MB total, 2MB per NPU • Same interface/bus as TCAM • Bank1-3 • 8 MB each • Criteria for how SRAM banks should be allocated? • Size: • SRAM Bandwidth: • How many SRAM accesses per packet are needed for the various SRAM uses? • QM needs buffer desc and queue desc in same bank
SRAM Accesses Per Packet • To support 8.22 M pkts/sec we can have 24 Reads and 24 Writes per pkt (200M/8.22M) • Rx: • SRAM Dequeue (1 Word) • To retrieve a buffer descriptor from free list • Write buffer desc (2 Words) • Parse • Lookup • TCAM Operations • Reading Results • Copy • Write buffer desc (3 Words) • Ref_cnt • MAC DAddr • Stats Index • Pre-Q stats increments • Read: 2 Words • Write: 2 Words • HF • Should not need to read or write any of the buffer descriptor • Tx • Read buffer desc (4 Words) • Freelist Mgr: • SRAM Enqueue – Write 1 Word • To return buffer descriptor to free list.
QM SRAM Accesses Per Packet • QM (Worst case analysis) • Enqueue (assume queue is idle and not loaded in Q-Array) • Write Q-Desc (4 Words) • Eviction of Least Recently Used Queue • Write Q-Params ? • When we evict a Q do we need to write its params back? • The Q-Length is the only thing that the QM is changing. • Looks like it writes it back ever time it enqueues or dequeues • AND it writes it back when it evcicts (we can probably remove the one when it evicts) • Read Q-Desc (4 Words) • Read Q-Params (3 Words) • Q-Length, Threshold, Quantum • Write Q-Length (1 Word) • SRAM Enqueue -- Write (1 Word) • Scheduling structure accesses? • They are done once every 5 pkts (when running full rate) • Dequeue (assume queue is not loaded in Q-Array) • Write Q-Desc (4 Words) • Write Q-Params ? • See notes in enqueue section • Read Q-Desc (4 Words) • Read Q-Params (3 Words) • Write Q-Length (1 Word) • SRAM Dequeue -- Read (1 Word) • Scheduling structure accesses? • They are done once every 5 pkts (when running full rate) • Post-Q stats increments • 2 Reads • 2 Writes
QM SRAM Accesses Per Packet • QM (Worst case analysis) • Total Per Pkt accesses: • Queue Descriptors and Buffer Enq/Deq: • Write: 9 Words • Read: 9 Words • Queue Params: • Write: 2 Words • Read: 6 Words • Scheduling Structure Accesses Per Iteration (batch of 5 packets): • Advance Head: Read 11 Words • Write Tail: Write 11 Words • Update Freelist • Read 2 Words • OR • Write 5 Words
Proposed SRAM Bank Allocation • SRAM Bank 0: • TCAM • Lookup Results • SRAM Bank 1 (2.5MB/8MB): • QM Queue Params (1MB) • QM Scheduling Struct (0.5 MB) • QM Port Rates (20B) • Large Inter-Block Rings (1MB) • SRAM Rings are of sizes (in Words): 0.5K, 1K, 2K, 4K, 8K, 16K, 32K, 64K • Rx Mux (2 Words per pkt): 32KW (16K pkts): 128KB • Plugin (3 Words per pkt): 32KW each (10K Pkts each): 640KB • Plugin (3 Words per pkt): 64KW (20K Pkts): 256KB • SRAM Bank 2 (8MB/8MB): • Buffer Descriptors (7MB) • Queue Descriptors (1MB) • SRAM Bank 3 (6MB/8MB): • Stats Counters (1MB) • Plugin “scratch” memory (5MB, 1MB per plugin)
Lookups • How will lookups be structured? • Three Databases: • Route Lookup: Containing Unicast and Multicast Entries • Unicast: • Port: Can be wildcarded • Longest Prefix Match on DAddr • Routes should be shorted in the DB with longest prefixes first. • Multicast • Port: Can be wildcarded? • Exact Match on DAddr • Longest Prefix Match on SAddr • Routes should be sorted in the DB with longest prefixes first. • Primary Filter • Filters should be sorted in the DB with higher priority filters first • Auxiliary Filter • Filters should be sorted in the DB with higher priority filters first • Will results be stored in Associated Data SRAM or in one of our external SRAM banks? • Can we use SRAM Bank0 and still get the throughput we want? • Priority between Primary Filter and Route Lookup • A priority will be stored with each Primary Filter • A priority will be assigned to RLs (all routes have same priority) • PF priority and RL priority compared after result is retrieved. • One of them will be selected based on this priority comparison. • Auxiliary Filters: • If matched, cause a copy of packet to be sent out according to the Aux Filter’s result.
TCAM Operations for Lookups • Five TCAM Operations of interest: • Lookup (Direct) • 1 DB, 1 Result • Multi-Hit Lookup (MHL) (Direct) • 1 DB, <= 8 Results • Simultaneous Multi-Database Lookup (SMDL) (Direct) • 2 DB, 1 Result Each • DBs must be consecutive! • Care must be given when assigning segments to DBs that use this operation. There must be a clean separation of even and odd DBs and segments. • Multi-Database Lookup (MDL) (Indirect) • <= 8 DB, 1 Result Each • Simultaneous Multi-Database Lookup (SMDL) (Indirect) • 2 DB, 1 Result Each • Functionally same as Direct version but key presentation and DB selection are different. • DBs need not be consecutive. • Care must be given when assigning segments to DBs that use this operation. There must be a clean separation of even and odd DBs and segments.
Lookups • Route Lookup: • Key (72b) • Port (4b): Can be a wildcard (for Unicast, probably not for Multicast) • Plugin (4b): Can be a wildcard (for Unicast, probably not for Multicast) • DAddr (32b) • Prefixed for Unicast • Exact Match for Multicast • SAddr (32b) • Unicast entries always have this and its mask 0 • Prefixed for Multicast • Result (79b) • CopyVector (11b) • One bit for each of the 5 ports and 5 plugins and one bit for the XScale • QID (16b) • Drop (1b): Drop pkt • NH_IP/NH_MAC (48b) • At most one of NH_IP or NH_MAC should be valid • Valid Bits (3b) • At most one of the following three bits should be set • IP_MCast Valid (1b) • NH_IP_Valid (1b) • NH_MAC_Valid (1b)
Lookups • Filter Lookup • Key (140b) • Port (4b): Can be a wildcard (for Unicast, probably not for Multicast) • Plugin (4b): Can be a wildcard (for Unicast, probably not for Multicast) • DAddr (32b) • SAddr (32b) • Protocol (8b) • DPort (16b) • Sport (16b) • TCP Flags (12b) • Exception Bits (16b) • Allow for directing of packets based on defined exceptions • Result (89b) • CopyVector (11b) • One bit for each of the 5 ports and 5 plugins and one bit for the XScale • NH IP(32b)/MAC(48b) (48b) • At most one of NH_IP or NH_MAC should be valid • QID (16b) • Drop (1b): Drop pkt • Valid Bits (3b) • At most one of the following three bits should be set • NH IP Valid (1b) • NH MAC Valid (1b) • IP_MCast Valid (1b) • Sampling bits (2b) : For Aux Filters only • Priority (8b) : For Primary Filters only
TCAM Core Lookup Performance Routes Filters • Lookup/Core size of 72 or 144 bits, Freq=200MHz • CAM Core can support 100M searches per second • For 1 Router on each of NPUA and NPUB: • 8.22 MPkt/s per Router • 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) • Total Per Router: 24.66 M Searches per second • TCAM Total: 49.32 M Searches per second • So, the CAM Core can keep up • Now lets look at the LA-1 Interfaces…
TCAM LA-1 Interface Lookup Performance Routes Filters • Lookup/Core size of 144 bits (ignore for now that Route size is smaller) • Each LA-1 interface can support 40M searches per second. • For 1 Router on each of NPUA and NPUB (each NPU uses a separate LA-1 Intf): • 8.22 MPkt/s per Router • Maximum of 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) • Max of 3 assumes they are each done as a separate operation • Total Per Interface: 24.66 M Searches per second • So, the LA-1 Interfaces can keep up • Now lets look at the AD SRAM Results …
TCAM Assoc. Data SRAM Results Performance • 8.22M 72b or 144b lookups • 32b results consumes 1/12 • 64b results consumes 1/6 • 128b results consumes 1/3 Routes Filters • Lookup/Core size of 72 or 144 bits, Freq=200MHz, SRAM Result Size of 128 bits • Associated SRAM can support up to 25M searches per second. • For 1 Router on each of NPUA and NPUB: • 8.22 MPkt/s per Router • 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) • Total Per Router: 24.66 M Searches per second • TCAM Total: 49.32 M Searches per second • So, the Associated Data SRAM can NOT keep up
Lookups: Proposed Design • Use SRAM Bank 0 (2 MB per NPU) for all Results • B0 Byte Address Range: 0x000000 – 0x3FFFFF • 22 bits • B0 Word Address Range: 0x000000 – 0x3FFFFC • 20 bits • Two trailing 0’s • Use 32-bit Associated Data SRAM result for Address of actual Result: • Done: 1b • Hit: 1b • MHit: 1b • Priority: 8b • Present for Primary Filters, for RL and Aux Filters should be 0 • SRAM B0 Word Address: 21b • 1 spare bit • Use Multi-Database Lookup (MDL) Indirect for searching all 3 DBs • Order of fields in Key is important. • Each thread will need one TCAM context • Route DB: • Lookup Size: 68b (3 32b words transferred across QDR intf) • Core Size: 72b • AD Result Size: 32b • SRAM B0 Result Size: 78b (3 Words) • Primary DB: • Lookup Size: 136b (5 32b words transferred across QDR intf) • Core Size: 144b • AD Result Size: 32b • SRAM B0 Result Size: 82b (3 Words) • Priority not included in SRAM B0 result because it is in AD result
Lookups: Latency • Three searches in one MDL Indirect Operation • Latencies for operation • QDR xfer time: 6 clock cycles • 1 for MDL Indirect subinstruction • 5 for 144 bit key transferred across QDR Bus • Instruction Fifo: 2 clock cycles • Synchronizer: 3 clock cycles • Execution Latency: search dependent • Re-Synchronizer: 1 clock cycle • Total: 12 clock cycles
Lookups: Latency • 144 bit DB, 32 bits of AD (two of these) • Instruction Latency: 30 • Core blocking delay: 2 • Backend latency: 8 • 72 bit DB, 32 bits of AD • Instruction Latency: 30 • Core blocking delay:2 • Backend latency: 8 • Latency of first search (144 bit DB): • 11 + 30 = 41 clock cycles • Latency of subsequent searchs: • (previous search latency) – (backend latency of previous search) + (core block delay of previous search) + (backend latency of this search) • Latency of second 144 bit search: • 41 – 8 + 2 + 8 = 43 • Latency of third search (72 bit): • 43 – 8 + 2 + 8 = 45 clock cycles • 45 QDR Clock cycles (200 MHz clock) 315 IXP Clock cycles (1400 MHz clock) • This is JUST for the TCAM operation, we also need to read the SRAM: • SRAM Read to retrieve TCAM Results Mailbox (3 words – one per search) • TWO SRAM Reads to then retrieve the full results (3 Words each) from SRAM Bank 0 • but we don’t have to wait for one to complete before issuing the second. • About 150 IXP cycles for an SRAM Read 315 + 150 + 150 = 615 IXP Clock cycles • Lets estimate 650 IXP Clock cycles for issuing, performing and retrieving results for a lookup. (multi-word, two reads, …) • Does not include any lookup block processing
Lookups: SRAM Bandwidth • Analysis is PER LA-1 QDR Interface • That is, each of NPUA and NPUB can do the following. • 16-bit QDR SRAM at 200 MHz • Separate read and write bus • Operations on rising and falling edge of each clock • 32 bits of read AND 32 bits of write per clock tick • QDR Write Bus: • 6 32-bit cycles per instruction • Cycle 0: • Write Address bus contains the TCAM Indirect Instruction • Write Data bus contains the TCAM Indirect MDL Sub-Instruction • Cycles 1-5 • Write Data bus contains the 5 words of the Lookup Key • Write Bus can support 200M/6 = 33.33 M searches/sec • QDR Read Bus: • Retrieval of Results Mailbox: • 3 32-bit cycles per instruction • Retrieval of two full results from QDR SRAM Bank 0: • 6 32-bit cycles per instruction • Total of 9 32-bit cycles per instruction • Read Bus can support 200M/9 = 22.22 M searches/sec • Conclusion: • Plenty of SRAM bandwidth to support TCAM operations AND SRAM Bank 0 accesses to perform all aspects of lookups at over 8.22 M searches/sec.
Block Interfaces • The next set of slides show the block interfaces • These slides are still very much a work in progress
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM Buf Handle(32b) Eth. Frame Len (16b) Reserved (12b) InPort (4b) ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM Reserved (4b) PT (1b) Pl (1b) X (1b) Rx (1b) ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN Flags: Source (3b): Rx/XScale/Plugin PassThrough(1)/Classify(0) (1b): Reserved (4b) 64KW SRAM Rsv (4b) Out Port (4b) Buffer Handle(24b) 32KW Each In Plugin (4b) In Port (4b) Flags (8b) QID(16b) Stats Index (16b) Frame Length (16b) SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN 7 3 1 2 0
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM Reserved (8b) Buffer Handle(24b) Rsv (4b) Out Port (4b) QID(16b) Rsv (8b) Frame Length (16b) Stats Index (16b) ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM V 1 Rsv (3b) Port (4b) Buffer Handle(24b) ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM V 1 Rsv (3b) Port (4b) Buffer Handle(24b) Ethernet DA[47-16] (32b) Ethernet DA[15-0](16b) Ethernet SA[47-32](16b) Ethernet SA[31-0] (32b) Ethernet Type(16b) Reserved (16b) ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN Reserved (8b) Buffer Handle(24b) 64KW SRAM In Port (4b) Rsv (8b) In Plugin (4b) QID(16b) 32KW Each Stats Index (16b) Frame Length (16b) SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM Flags: PassThrough/Classify (1b): Reserved (7b) Rsv (4b) Out Port (4b) Buffer Handle(24b) In Plugin (4b) In Port (4b) Flags (8b) QID(16b) Stats Index (16b) Frame Length (16b) ONL NP Router xScale xScale TCAM Assoc. Data ZBT-SRAM SRAM 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN
Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) Stats (1 ME) QM Copy Plugins SRAM ONL NP Router Flags: Why pkt is being sent to XScale xScale xScale Rsv (8b) Out Port (4b) Buffer Handle(24b) TCAM In Plugin (4b) In Port (4b) Flags (8b) QID(16b) Assoc. Data ZBT-SRAM SRAM Stats Index (16b) Frame Length (16b) 32KW HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) Rx (2 ME) Mux (1 ME) QM (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin4 Plugin5 Plugin1 Plugin2 Plugin3 SRAM xScale Scratch Ring NN Ring NN