From 49ca37d0d825d175bc6ae624f584a7502e848ebc Mon Sep 17 00:00:00 2001 From: Magnus Karlsson Date: Wed, 19 Jul 2023 15:24:14 +0200 Subject: [PATCH] xsk: add multi-buffer documentation Add AF_XDP multi-buffer support documentation including two pseudo-code samples. Signed-off-by: Magnus Karlsson Link: https://lore.kernel.org/r/20230719132421.584801-18-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov --- Documentation/networking/af_xdp.rst | 211 +++++++++++++++++++++++++++++++++++- 1 file changed, 210 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 1cc35de..dceeb0d 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt Gets options from an XDP socket. The only one supported so far is XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. +Multi-Buffer Support +==================== + +With multi-buffer support, programs using AF_XDP sockets can receive +and transmit packets consisting of multiple buffers both in copy and +zero-copy mode. For example, a packet can consist of two +frames/buffers, one with the header and the other one with the data, +or a 9K Ethernet jumbo frame can be constructed by chaining together +three 4K frames. + +Some definitions: + +* A packet consists of one or more frames + +* A descriptor in one of the AF_XDP rings always refers to a single + frame. In the case the packet consists of a single frame, the + descriptor refers to the whole packet. + +To enable multi-buffer support for an AF_XDP socket, use the new bind +flag XDP_USE_SG. If this is not provided, all multi-buffer packets +will be dropped just as before. Note that the XDP program loaded also +needs to be in multi-buffer mode. This can be accomplished by using +"xdp.frags" as the section name of the XDP program used. + +To represent a packet consisting of multiple frames, a new flag called +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx +descriptors. If it is true (1) the packet continues with the next +descriptor and if it is false (0) it means this is the last descriptor +of the packet. Why the reverse logic of end-of-packet (eop) flag found +in many NICs? Just to preserve compatibility with non-multi-buffer +applications that have this bit set to false for all packets on Rx, +and the apps set the options field to zero for Tx, as anything else +will be treated as an invalid descriptor. + +These are the semantics for producing packets onto AF_XDP Tx ring +consisting of multiple frames: + +* When an invalid descriptor is found, all the other + descriptors/frames of this packet are marked as invalid and not + completed. The next descriptor is treated as the start of a new + packet, even if this was not the intent (because we cannot guess + the intent). As before, if your program is producing invalid + descriptors you have a bug that must be fixed. + +* Zero length descriptors are treated as invalid descriptors. + +* For copy mode, the maximum supported number of frames in a packet is + equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all + descriptors accumulated so far are dropped and treated as + invalid. To produce an application that will work on any system + regardless of this config setting, limit the number of frags to 18, + as the minimum value of the config is 17. + +* For zero-copy mode, the limit is up to what the NIC HW + supports. Usually at least five on the NICs we have checked. We + consciously chose to not enforce a rigid limit (such as + CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have + resulted in copy actions under the hood to fit into what limit the + NIC supports. Kind of defeats the purpose of zero-copy mode. How to + probe for this limit is explained in the "probe for multi-buffer + support" section. + +On the Rx path in copy-mode, the xsk core copies the XDP data into +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as +detailed before. Zero-copy mode works the same, though the data is not +copied. When the application gets a descriptor with the XDP_PKT_CONTD +flag set to one, it means that the packet consists of multiple buffers +and it continues with the next buffer in the following +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it +means that this is the last buffer of the packet. AF_XDP guarantees +that only a complete packet (all frames in the packet) is sent to the +application. If there is not enough space in the AF_XDP Rx ring, all +frames of the packet will be dropped. + +If application reads a batch of descriptors, using for example the libxdp +interfaces, it is not guaranteed that the batch will end with a full +packet. It might end in the middle of a packet and the rest of the +buffers of that packet will arrive at the beginning of the next batch, +since the libxdp interface does not read the whole ring (unless you +have an enormous batch size or a very small ring size). + +An example program each for Rx and Tx multi-buffer support can be found +later in this document. + Usage -===== +----- In order to use AF_XDP sockets two parts are needed. The user-space application and the XDP program. For a complete setup and @@ -541,6 +625,131 @@ like this: But please use the libbpf functions as they are optimized and ready to use. Will make your life easier. +Usage Multi-Buffer Rx +--------------------- + +Here is a simple Rx path pseudo-code example (using libxdp interfaces +for simplicity). Error paths have been excluded to keep it short: + +.. code-block:: c + + void rx_packets(struct xsk_socket_info *xsk) + { + static bool new_packet = true; + u32 idx_rx = 0, idx_fq = 0; + static char *pkt; + + int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); + + xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); + + for (int i = 0; i < rcvd; i++) { + struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); + char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); + bool eop = !(desc->options & XDP_PKT_CONTD); + + if (new_packet) + pkt = frag; + else + add_frag_to_pkt(pkt, frag); + + if (eop) + process_pkt(pkt); + + new_packet = eop; + + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; + } + + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); + xsk_ring_cons__release(&xsk->rx, rcvd); + } + +Usage Multi-Buffer Tx +--------------------- + +Here is an example Tx path pseudo-code (using libxdp interfaces for +simplicity) ignoring that the umem is finite in size, and that we +eventually will run out of packets to send. Also assumes pkts.addr +points to a valid location in the umem. + +.. code-block:: c + + void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, + int batch_size) + { + u32 idx, i, pkt_nb = 0; + + xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); + + for (i = 0; i < batch_size;) { + u64 addr = pkts[pkt_nb].addr; + u32 len = pkts[pkt_nb].size; + + do { + struct xdp_desc *tx_desc; + + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); + tx_desc->addr = addr; + + if (len > xsk_frame_size) { + tx_desc->len = xsk_frame_size; + tx_desc->options = XDP_PKT_CONTD; + } else { + tx_desc->len = len; + tx_desc->options = 0; + pkt_nb++; + } + len -= tx_desc->len; + addr += xsk_frame_size; + + if (i == batch_size) { + /* Remember len, addr, pkt_nb for next iteration. + * Skipped for simplicity. + */ + break; + } + } while (len); + } + + xsk_ring_prod__submit(&xsk->tx, i); + } + +Probing for Multi-Buffer Support +-------------------------------- + +To discover if a driver supports multi-buffer AF_XDP in SKB or DRV +mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to +query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for +querying for XDP multi-buffer support. If XDP supports multi-buffer in +a driver, then AF_XDP will also support that in SKB and DRV mode. + +To discover if a driver supports multi-buffer AF_XDP in zero-copy +mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY +flag. If it is set, it means that at least zero-copy is supported and +you should go and check the netlink attribute +NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer +value will be returned stating the max number of frags that are +supported by this device in zero-copy mode. These are the possible +return values: + +1: Multi-buffer for zero-copy is not supported by this device, as max + one fragment supported means that multi-buffer is not possible. + +>=2: Multi-buffer is supported in zero-copy mode for this device. The + returned number signifies the max number of frags supported. + +For an example on how these are used through libbpf, please take a +look at tools/testing/selftests/bpf/xskxceiver.c. + +Multi-Buffer Support for Zero-Copy Drivers +------------------------------------------ + +Zero-copy drivers usually use the batched APIs for Rx and Tx +processing. Note that the Tx batch API guarantees that it will provide +a batch of Tx descriptors that ends with full packet at the end. This +to facilitate extending a zero-copy driver with multi-buffer support. + Sample application ================== -- 2.7.4