230 likes | 234 Vues
Utilizing NIC’s enhancements. A look at how driver software needs to change when using newer features of our hardware. ‘theory’ versus ‘practice’.
E N D
Utilizing NIC’s enhancements A look at how driver software needs to change when using newer features of our hardware
‘theory’ versus ‘practice’ • The engineering designs one encounters in computer hardware components can be observed to undergo an ‘evolution’ during successive iterations, from a scheme that embodies simplicity, purity, and symmetry at the outset, based upon what designers think will be the device’s likely uses, to a conglomeration of disparate ‘add-ons’ as actual practices dictate accommodations
‘backward compatibility’ • An historically important consideration in the marketing of computer hardware has been the need to maintain past functions in a ‘transparent’ manner – i.e., no change is needed to run older software on newer equipment, while offering enhancements as ‘options’ that can be selectively enabled
Example: Intel’s x86 • The current generation of Intel CPU’s will still execute all of the software written for PCs a quarter-century ago – based on a small set of 16-bit registers, a restricted set of instructions, and a one-megabyte memory-space – but is able, as an option, to use more and larger registers (64-bits), richer instruction-sets, and more memory
Gigabit NICs • Intel’s network controller designs exhibit this same kind of ‘evolution’ over time • The ‘Legacy’ descriptor-formats are just one example of keeping prior-generation functionality: it’s simple, it’s ‘pure’ (i.e., not tied to any specific network-protocols, but emphasizing ‘mechanism’, not ‘policy’) • But now alternatives exist -- as options!
‘Legacy’ RX-Descriptors The device-driver initializes this ‘base-address’ field with the physical address of a packet-buffer… … and network hardware does not ever modify it Base-address (64-bits) Packet- length Packet- checksum status errors VLAN tag The network controller later will ‘write-back’ values into all these fields when it has finished transferring a received packet’s data into that packet-buffer
RxDesc Status-field 7 6 5 4 3 2 1 0 PIF IPCS TCPCS UDPCS VP IXSM EOP DD DD = Descriptor Done (1=yes, 0=no) shows if nic is finished with descriptor EOP = End Of Packet (1=yes, 0=no) shows if this packet is logically last IXSM = Ignore Checksum Indications (1=yes, 0=no) VP = VLAN Packet match (1=yes, 0=no) USPCS = UDP Checksum calculated in packet (1=yes, 0=no) TCPCS = TCP Checksum calculated in packet (1=yes, 0=no) IPCS = IPv4 Checksum calculated on packet (1=yes, 0=no) PIF = Passed In-exact Filter (1=yes, 0=no) shows if software must check
RxDesc Errors-field 7 6 5 4 3 2 1 0 RXE IPE TCPE reserved (=0) reserved (=0) SEQ SE CE CE = CRC Error or Alignment Error (check statistics registers to differentiate) TCPE = TCP/UDP Checksum Error IPE = IPv4 Checksum Error These bits are relevant only while NIC is operating in ‘SerDes’ mode: SE = Symbol Error SEQ = Sequence Error RXE = Rx Data Error
‘Extended’ RX-Descriptors CPU writes this, NIC reads it: NIC writes this, CPU reads it: Base-address (64-bits) Packet- checksum IP identification MRQ (multiple receive queues) reserved (=0) VLAN tag Packet- length Extended errors Extended status The device-driver initializes the ‘base-address’ field with the physical address of a packet-buffer, and it initializes the ‘reserved’ field with a zero-value… … the network hardware will later modify both fields The network controller will ‘write-back’ the values for these fields when it has transferred a received packet’s data into the packet-buffer
An alternative option CPU writes this, NIC reads it: NIC writes this, CPU reads it: Base-address (64-bits) RSS Hash (Receive Side Scaling) MRQ (multiple receive queues) reserved (=0) VLAN tag Packet- length Extended errors Extended status ‘Receive Side Scaling’ refers to an optional capability in the network controller to assist with routing of network packets to various CPUs within a modern multiprocessor system (See Section 3.2.13 in Intel’s Software Developer’s Manual)
Extended Rx-Status (20-bits) 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 A C K 0 0 0 0 U D P V I P I V 0 P I F I P C S T C P C S U D P C S V P I X S M E O P D D These ‘extra’ status-bits provide additional hardware support to driver software for processing ethernet packets that conform to standard TCP/IP network protocols (with possibilities for future expansion) These eight bits have the same meanings as in a ‘Legacy’ Rx-Status byte DD = Descriptor Done EOP = End Of Packet IXSM = Ignore Checksum Indications VP = VLAN Packet match USPCS = UDP Checksum calculated TCPCS = TCP Checksum calculated IPCS = IPv4 Checksum calculated PIF = Passed In-exact Filter ACK = TCP ACK-Packet identification UDPV = Valid UDP checksum IPIV = Valid IP Identification
Extended Rx-Errors (12 bits) 11 10 9 8 7 6 5 4 3 2 1 0 RXE IPE TCPE 0 0 SEQ SE CE 0 0 0 0 These eight bits have the same meanings, and the occupy the same arrangement, as in the ‘Legacy’ Rx-Errors byte
Main device-driver changes • If we want to utilize the NIC’s ‘Extended’ Receive Descriptor format, we will need several significant changes in our driver source-code and data-types: • Our module’s initialization of ‘base_address’ fields • Our new need for programming register RFCTL • Our ‘typedef’ for the ‘RX_DESCRIPTOR’ structs • Our ‘get_info_rx()’ function for ‘/proc/nicrx’ display • Our interrupt-handler’s treatment of ‘rxring’ entries
Use of C language ‘union’ • Each Receive-Descriptor now has a ‘dual’ identity, as far as the NIC is concerned: • one layout during its ‘fetch’ from memory • another layout during ‘write-back’ to memory • The C language provides a special ‘type’ construction for accommodating this kind of programming situation, it’s known as a union and it requires a special syntax
‘Bitfields’ in C • Some of the fields in the ‘Extended’ RX Descriptor do not align with the CPU’s natural 8-bit,16-bit and 32-bit data-sizes • The C language provides ‘bitfields’ for a situation like this (not yet ‘standardized’) Extended errors Extended status 12-bits 20-bits
Syntax for Rx-Descriptors typedef struct { unsigned long long base_address; unsigned long long reserved; } RX_DESC_FETCH; typedef struct { unsigned int mrq; unsigned short ip_identification; unsigned short packet_chksum; unsigned int desc_status:20; unsigned int desc_errors:12; unsigned short packet_length; unsigned short vlan_tag; } RX_DESC_STORE; typedef union { RX_DESC_FETCH rxf; RX_DESC_STORE rxs; } RX_DESCRIPTOR;
RFCTL (0x5008) The Receive Filter Control register 31 16 reserved (=0) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 E X T E N IP FRSP _DIS ACKD _DIS ACK DIS IPv6 XSUM _DIS IPv6 _DIS NFS_VER NFSR _DIS NFSW _DIS iSCSI_DWC iSCSI _DIS EXTEN (bit 15) = Extended Status Enable (1=yes, 0=no) This enables the NIC to write-back the ‘Extended Status’
Modifying ‘my_read()’ • To implement use of ‘Extended’ Receive Descriptors in our most recent character-mode device-driver (i.e., ‘zerocopy.c’), we need some changes in the ‘read()’ method • Most obvious example: a packet-buffer’s memory address can no longer be gotten from an Rx-Descriptor’s ‘base_address’ (which now gets ‘overwritten’ by the NIC)
For our pseudo-file’s sake… • Also our driver’s ‘read()’ function shouldn’t prepare a current rx-descriptor for reuse, as it did in earlier drivers, since that would destroy all of the useful information which the NIC has just written into that descriptor • Instead, the preparation of a descriptor for reuse in a future packet-receive operation should be deferred, at least temporarily
OK, but then when? • We can reassign the duty to ‘refresh’ some Rx-Descriptors for reuse to our driver’s Interrupt Service Routine; specifically, at the point in time when an ‘RXDMT0’ event is signaled (Rx-Descriptor Min-Threshold) • It might be best to create a ‘bottom half’ to take care of those re-initializations, but we haven’t yet done that in our new prototype
Handling ‘RXDMT0’ interrupts irqreturn_t my_isr( int irq, void *dev_id ) { int intr_cause = ioread32( io + E1000_ICR ); if ( intr_cause & (1<<4) ) // Rx-Descriptors Low { unsigned int rx_buf = virt_to_phys( rxring ) + 16 * N_RX_DESC; unsigned int rxtail = ioread32( io + E1000_RDT ), i, ba; // prepare the next eight Rx-Descriptors for ‘reuse’ by the NIC for (i = 0; i < 8; i++) { ba = rx_buf + rxtail * RX_BUFSIZ; rxring[ rxtail ].base_address = ba; rxring[ rxtail ].reserved = 0LL; rxtail = (1 + rxtail) % N_RX_DESC; } // now give the NIC ‘ownership’ of these reinitialized descriptors iowrite32( rxtail, io + E1000_RDT ); }
‘extended.c’ • Here’s our revision of ‘zerocopy.c’, aimed at showing how we can incorporate use of the NIC’s ‘Extended’ Receive Descriptors • It appears to function exactly as before, until a user attempts to view the driver’s Receive-Descriptor queue: $ cat /proc/nicrx • Then we are shown descriptors having two distinct formats (i.e., FETCH and STORE)
Demo: ‘bitfield.c’ • Because the manner in which ‘bitfields’ are handled in the C language varies with the particular C-compiler being used, we have created a short demo-program that shows us how our GNU C-compiler ‘gcc’ handles the layout of bitfields within a C data-item typedef struct { unsigned int desc_status:20; // bits 0..19 unsigned int desc_errors:12; // bits 20..31 } RXD_ELT;