Martin KaFai Lau [Wed, 2 Mar 2022 19:55:57 +0000 (11:55 -0800)]
net: ip: Handle delivery_time in ip defrag
A latter patch will postpone the delivery_time clearing until the stack
knows the skb is being delivered locally. That will allow other kernel
forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
An earlier attempt was to do skb_clear_delivery_time() in
ip_local_deliver() and ip6_input(). The discussion [0] requested
to move it one step later into ip_local_deliver_finish()
and ip6_input_finish() so that the delivery_time can be kept
for the ip_vs forwarding path also.
To do that, this patch also needs to take care of the (rcv) timestamp
usecase in ip_is_fragment(). It needs to expect delivery_time in
the skb->tstamp, so it needs to save the mono_delivery_time bit in
inet_frag_queue such that the delivery_time (if any) can be restored
in the final defragmented skb.
[Note that it will only happen when the locally generated skb is looping
from egress to ingress over a virtual interface (e.g. veth, loopback...),
skb->tstamp may have the delivery time before it is known that it will
be delivered locally and received by another sk.]
[0]: https://lore.kernel.org/netdev/
ca728d81-80e8-3767-d5e-
d44f6ad96e43@ssi.bg/
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 2 Mar 2022 19:55:50 +0000 (11:55 -0800)]
net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
The previous patches handled the delivery_time before sch_handle_ingress().
This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
and also clear it with skb_clear_delivery_time() after
sch_handle_ingress(). This will make the bpf_redirect_*()
to keep the mono delivery_time and used by a qdisc (fq) of
the egress-ing interface.
A latter patch will postpone the skb_clear_delivery_time() until the
stack learns that the skb is being delivered locally and that will
make other kernel forwarding paths (ip[6]_forward) able to keep
the delivery_time also. Thus, like the previous patches on using
the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
is not limited within the CONFIG_NET_INGRESS to avoid too many code
churns among this set.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 2 Mar 2022 19:55:44 +0000 (11:55 -0800)]
net: Clear mono_delivery_time bit in __skb_tstamp_tx()
In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
the sk_error_queue. The outgoing skb may have the mono delivery_time
while the (rcv) timestamp is expected for the clone, so the
skb->mono_delivery_time bit needs to be cleared from the clone.
This patch adds the skb->mono_delivery_time clearing to the existing
__net_timestamp() and use it in __skb_tstamp_tx().
The __net_timestamp() fast path usage in dev.c is changed to directly
call ktime_get_real() since the mono_delivery_time bit is not set at
that point.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 2 Mar 2022 19:55:38 +0000 (11:55 -0800)]
net: Handle delivery_time in skb->tstamp during network tapping with af_packet
A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
skb_clear_tstamp() will then keep this delivery_time during forwarding.
This patch is to make the network tapping (with af_packet) to handle
the delivery_time stored in skb->tstamp.
Regardless of tapping at the ingress or egress, the tapped skb is
received by the af_packet socket, so it is ingress to the af_packet
socket and it expects the (rcv) timestamp.
When tapping at egress, dev_queue_xmit_nit() is used. It has already
expected skb->tstamp may have delivery_time, so it does
skb_clone()+net_timestamp_set() to ensure the cloned skb has
the (rcv) timestamp before passing to the af_packet sk.
This patch only adds to clear the skb->mono_delivery_time
bit in net_timestamp_set().
When tapping at ingress, it currently expects the skb->tstamp is either 0
or the (rcv) timestamp. Meaning, the tapping at ingress path
has already expected the skb->tstamp could be 0 and it will get
the (rcv) timestamp by ktime_get_real() when needed.
There are two cases for tapping at ingress:
One case is af_packet queues the skb to its sk_receive_queue.
The skb is either not shared or new clone created. The newly
added skb_clear_delivery_time() is called to clear the
delivery_time (if any) and set the (rcv) timestamp if
needed before the skb is queued to the sk_receive_queue.
Another case, the ingress skb is directly copied to the rx_ring
and tpacket_get_timestamp() is used to get the (rcv) timestamp.
The newly added skb_tstamp() is used in tpacket_get_timestamp()
to check the skb->mono_delivery_time bit before returning skb->tstamp.
As mentioned earlier, the tapping@ingress has already expected
the skb may not have the (rcv) timestamp (because no sk has asked
for it) and has handled this case by directly calling ktime_get_real().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 2 Mar 2022 19:55:31 +0000 (11:55 -0800)]
net: Add skb_clear_tstamp() to keep the mono delivery_time
Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.
The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.
This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clears everything else.
The delivery_time clearing will be postponed until the stack knows the
skb will be delivered locally. It will be done in a latter patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 2 Mar 2022 19:55:25 +0000 (11:55 -0800)]
net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).
Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.
Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.
While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev. For example, when forwarding from egress to
ingress and then finally back to egress:
tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
^ ^
reset rest
This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq. A latter patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.
In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
[ This patch is a prep work. The following patches will
get the other parts of the stack ready first. Then another patch
after that will finally set the skb->mono_delivery_time. ]
skb_set_delivery_time() function is added. It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.
A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder. It will be enabled in a latter patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().
[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 14:15:31 +0000 (14:15 +0000)]
Merge branch 'dsa-unicast-filtering'
Vladimir Oltean says:
====================
DSA unicast filtering
This series doesn't attempt anything extremely brave, it just changes
the way in which standalone ports which support FDB isolation work.
Up until now, DSA has recommended that switch drivers configure
standalone ports in a separate VID/FID with learning disabled, and with
the CPU port as the only destination, reached trivially via flooding.
That works, except that standalone ports will deliver all packets to the
CPU. We can leverage the hardware FDB as a MAC DA filter, and disable
flooding towards the CPU port, to force the dropping of packets with
unknown MAC DA.
We handle port promiscuity by re-enabling flooding towards the CPU port.
This is relevant because the bridge puts its automatic (learning +
flooding) ports in promiscuous mode, and this makes some things work
automagically, like for example bridging with a foreign interface.
We don't delve yet into the territory of managing CPU flooding more
aggressively while under a bridge.
The only switch driver that benefits from this work right now is the
NXP LS1028A switch (felix). The others need to implement FDB isolation
first, before DSA is going to install entries to the port's standalone
database. Otherwise, these entries might collide with bridge FDB/MDB
entries.
This work was done mainly to have all the required features in place
before somebody starts seriously architecting DSA support for multiple
CPU ports. Otherwise it is much more difficult to bolt these features on
top of multiple CPU ports.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:17 +0000 (21:14 +0200)]
net: mscc: ocelot: accept configuring bridge port flags on the NPI port
In order for the Felix DSA driver to be able to turn on/off flooding
towards its CPU port, we need to redirect calls on the NPI port to
actually act upon the index in the analyzer block that corresponds to
the CPU port module. This was never necessary until now because DSA
(or the bridge) never called ocelot_port_bridge_flags() for the NPI
port.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:16 +0000 (21:14 +0200)]
net: dsa: felix: stop clearing CPU flooding in felix_setup_tag_8021q
felix_migrate_flood_to_tag_8021q_port() takes care of clearing the
flooding bits on the old CPU port (which was the CPU port module), so
manually clearing this bit from PGID_UC, PGID_MC, PGID_BC is redundant.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:15 +0000 (21:14 +0200)]
net: dsa: felix: start off with flooding disabled on the CPU port
The driver probes with all ports as standalone, and it supports unicast
filtering. So DSA will call port_fdb_add() for all necessary addresses
on the current CPU port. We also handle migrations when the CPU port
hardware resource changes (on tagging protocol change), so there should
not be any unknown address that we have to receive while not promiscuous.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:14 +0000 (21:14 +0200)]
net: dsa: felix: migrate flood settings from NPI to tag_8021q CPU port
When the tagging protocol changes from "ocelot" to "ocelot-8021q" or in
reverse, the DSA promiscuity setting that was applied for the old CPU
port must be transferred to the new one.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:13 +0000 (21:14 +0200)]
net: dsa: felix: migrate host FDB and MDB entries when changing tag proto
The "ocelot" and "ocelot-8021q" tagging protocols make use of different
hardware resources, and host FDB entries have different destination
ports in the switch analyzer module, practically speaking.
So when the user requests a tagging protocol change, the driver must
migrate all host FDB and MDB entries from the NPI port (in fact CPU port
module) towards the same physical port, but this time used as a regular
port.
It is pointless for the felix driver to keep a copy of the host
addresses, when we can create and export DSA helpers for walking through
the addresses that it already needs to keep on the CPU port, for
refcounting purposes.
felix_classify_db() is moved up to avoid a forward declaration.
We pass "bool change" because dp->fdbs and dp->mdbs are uninitialized
lists when felix_setup() first calls felix_set_tag_protocol(), so we
need to avoid calling dsa_port_walk_fdbs() during probe time.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:12 +0000 (21:14 +0200)]
net: dsa: manage flooding on the CPU ports
DSA can treat IFF_PROMISC and IFF_ALLMULTI on standalone user ports as
signifying whether packets with an unknown MAC DA will be received or
not. Since known MAC DAs are handled by FDB/MDB entries, this means that
promiscuity is analogous to including/excluding the CPU port from the
flood domain of those packets.
There are two ways to signal CPU flooding to drivers.
The first (chosen here) is to synthesize a call to
ds->ops->port_bridge_flags() for the CPU port, with a mask of
BR_FLOOD | BR_MCAST_FLOOD. This has the effect of turning on egress
flooding on the CPU port regardless of source.
The alternative would be to create a new ds->ops->port_host_flood()
which is called per user port. Some switches (sja1105) have a flood
domain that is managed per {ingress port, egress port} pair, so it would
make more sense for this kind of switch to not flood the CPU from port A
if just port B requires it. Nonetheless, the sja1105 has other quirks
that prevent it from making use of unicast filtering, and without a
concrete user making use of this feature, I chose not to implement it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:11 +0000 (21:14 +0200)]
net: dsa: install the primary unicast MAC address as standalone port host FDB
To be able to safely turn off CPU flooding for standalone ports, we need
to ensure that the dev_addr of each DSA slave interface is installed as
a standalone host FDB entry for compatible switches.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:10 +0000 (21:14 +0200)]
net: dsa: install secondary unicast and multicast addresses as host FDB/MDB
In preparation of disabling flooding towards the CPU in standalone ports
mode, identify the addresses requested by upper interfaces and use the
new API for DSA FDB isolation to request the hardware driver to offload
these as FDB or MDB objects. The objects belong to the user port's
database, and are installed pointing towards the CPU port.
Because dev_uc_add()/dev_mc_add() is VLAN-unaware, we offload to the
port standalone database addresses with VID 0 (also VLAN-unaware).
So this excludes switches with global VLAN filtering from supporting
unicast filtering, because there, it is possible for a port of a switch
to join a VLAN-aware bridge, and this changes the VLAN awareness of
standalone ports, requiring VLAN-aware standalone host FDB entries.
For the same reason, hellcreek, which requires VLAN awareness in
standalone mode, is also exempted from unicast filtering.
We create "standalone" variants of dsa_port_host_fdb_add() and
dsa_port_host_mdb_add() (and the _del coresponding functions).
We also create a separate work item type for handling deferred
standalone host FDB/MDB entries compared to the switchdev one.
This is done for the purpose of clarity - the procedure for offloading a
bridge FDB entry is different than offloading a standalone one, and
the switchdev event work handles only FDBs anyway, not MDBs.
Deferral is needed for standalone entries because ndo_set_rx_mode runs
in atomic context. We could probably optimize things a little by first
queuing up all entries that need to be offloaded, and scheduling the
work item just once, but the data structures that we can pass through
__dev_uc_sync() and __dev_mc_sync() are limiting (there is nothing like
a void *priv), so we'd have to keep the list of queued events somewhere
in struct dsa_switch, and possibly a lock for it. Too complicated for
now.
Adding the address to the master is handled by dev_uc_sync(), adding it
to the hardware is handled by __dev_uc_sync(). So this is the reason why
dsa_port_standalone_host_fdb_add() does not call dev_uc_add(). Not that
it had the rtnl_mutex anyway - ndo_set_rx_mode has it, but is atomic.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:09 +0000 (21:14 +0200)]
net: dsa: rename the host FDB and MDB methods to contain the "bridge" namespace
We are preparing to add API in port.c that adds FDB and MDB entries that
correspond to the port's standalone database. Rename the existing
methods to make it clear that the FDB and MDB entries offloaded come
from the bridge database.
Since the function names lengthen in dsa_slave_switchdev_event_work(),
we place "addr" and "vid" in temporary variables, to shorten those.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Mar 2022 19:14:08 +0000 (21:14 +0200)]
net: dsa: remove workarounds for changing master promisc/allmulti only while up
Lennert Buytenhek explains in commit
df02c6ff2e39 ("dsa: fix master
interface allmulti/promisc handling"), dated Nov 2008, that changing the
promiscuity of interfaces that are down (here the master) is broken.
This fact regarding promisc/allmulti has changed since commit
b6c40d68ff64 ("net: only invoke dev->change_rx_flags when device is UP")
by Vlad Yasevich, dated Nov 2013.
Therefore, DSA now has unnecessary complexity to handle master state
transitions from down to up. In fact, syncing the unicast and multicast
addresses can happen completely asynchronously to the administrative
state changes.
This change reduces that complexity by effectively fully reverting
commit
df02c6ff2e39 ("dsa: fix master interface allmulti/promisc
handling").
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karol Kolacinski [Tue, 1 Mar 2022 18:38:03 +0000 (10:38 -0800)]
ice: add TTY for GNSS module for E810T device
Add a new ice_gnss.c file for holding the basic GNSS module functions.
If the device supports GNSS module, call the new ice_gnss_init and
ice_gnss_release functions where appropriate.
Implement basic functionality for reading the data from GNSS module
using TTY device.
Add I2C read AQ command. It is now required for controlling the external
physical connectors via external I2C port expander on E810-T adapters.
Future changes will introduce write functionality.
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Sudhansu Sekhar Mishra <sudhansu.mishra@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 10:43:37 +0000 (10:43 +0000)]
Merge branch 'nfc-llcp-cleanups'
Krzysztof Kozlowski says:
====================
nfc: llcp: few cleanups/improvements
These are improvements, not fixing any experienced issue, just looking correct
to me from the code point of view.
Changes since v1
================
1. Split from the fix.
Testing
=======
Under QEMU only. The NFC/LLCP code was not really tested on a device.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:23 +0000 (20:25 +0100)]
nfc: llcp: Revert "NFC: Keep socket alive until the DISC PDU is actually sent"
This reverts commit
17f7ae16aef1f58bc4af4c7a16b8778a91a30255.
The commit brought a new socket state LLCP_DISCONNECTING, which was
never set, only read, so socket could never set to such state.
Remove the dead code.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:22 +0000 (20:25 +0100)]
nfc: llcp: protect nfc_llcp_sock_unlink() calls
nfc_llcp_sock_link() is called in all paths (bind/connect) as a last
action, still protected with lock_sock(). When cleaning up in
llcp_sock_release(), call nfc_llcp_sock_unlink() in a mirrored way:
earlier and still under the lock_sock().
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:21 +0000 (20:25 +0100)]
nfc: llcp: use test_bit()
Use test_bit() instead of open-coding it, just like in other places
touching the bitmap.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:20 +0000 (20:25 +0100)]
nfc: llcp: use centralized exiting of bind on errors
Coding style encourages centralized exiting of functions, so rewrite
llcp_sock_bind() error paths to use such pattern. This reduces the
duplicated cleanup code, make success path visually shorter and also
cleans up the errors in proper order (in reversed way from
initialization).
No functional impact expected.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:19 +0000 (20:25 +0100)]
nfc: llcp: simplify llcp_sock_connect() error paths
The llcp_sock_connect() error paths were using a mixed way of central
exit (goto) and cleanup
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:18 +0000 (20:25 +0100)]
nfc: llcp: nullify llcp_sock->dev on connect() error paths
Nullify the llcp_sock->dev on llcp_sock_connect() error paths,
symmetrically to the code llcp_sock_bind(). The non-NULL value of
llcp_sock->dev is used in a few places to check whether the socket is
still valid.
There was no particular issue observed with missing NULL assignment in
connect() error path, however a similar case - in the bind() error path
- was triggereable. That one was fixed in commit
4ac06a1e013c ("nfc:
fix NULL ptr dereference in llcp_sock_getname() after failed connect"),
so the change here seems logical as well.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 10:37:23 +0000 (10:37 +0000)]
Merge branch 'net-hw-counters-for-soft-devices'
Ido Schimmel says:
====================
HW counters for soft devices
Petr says:
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as a VLAN. In this patch set, add the necessary
infrastructure to allow exposing these statistics to the offloaded
netdevice in question, and add mlxsw offload.
Across HW platforms, the counter itself very likely constitutes a limited
resource, and the act of counting may have a performance impact. Therefore
this patch set makes the HW statistics collection opt-in and togglable from
userspace on a per-netdevice basis.
Additionally, HW devices may have various limiting conditions under which
they can realize the counter. Therefore it is also possible to query
whether the requested counter is realized by any driver. In TC parlance,
which is to a degree reused in this patch set, two values are recognized:
"request" tracks whether the user enabled collecting HW statistics, and
"used" tracks whether any HW statistics are actually collected.
In the past, this author has expressed the opinion that `a typical user
doing "ip -s l sh", including various scripts, wants to see the full
picture and not worry what's going on where'. While that would be nice,
unfortunately it cannot work:
- Packets that trap from the HW datapath to the SW datapath would be
double counted.
For a given netdevice, some traffic can be purely a SW artifact, and some
may flow through the HW object corresponding to the netdevice. But some
traffic can also get trapped to the SW datapath after bumping the HW
counter. It is not clear how to make sure double-counting does not occur
in the SW datapath in that case, while still making sure that possibly
divergent SW forwarding path gets bumped as appropriate.
So simply adding HW and SW stats may work roughly, most of the time, but
there are scenarios where the result is nonsensical.
- HW devices will have limitations as to what type of traffic they can
count.
In case of mlxsw, which is part of this patch set, there is no reasonable
way to count all traffic going through a certain netdevice, such as a
VLAN netdevice enslaved to a bridge. It is however very simple to count
traffic flowing through an L3 object, such as a VLAN netdevice with an IP
address.
Similarly for physical netdevices, the L3 object at which the counter is
installed is the subport carrying untagged traffic.
These are not "just counters". It is important that the user understands
what is being counted. It would be incorrect to conflate these statistics
with another existing statistics suite.
To that end, this patch set introduces a statistics suite called "L3
stats". This label should make it easy to understand what is being counted,
and to decide whether a given device can or cannot implement this suite for
some type of netdevice. At the same time, the code is written to make
future extensions easy, should a device pop up that can implement a
different flavor of statistics suite (say L2, or an address-family-specific
suite).
For example, using a work-in-progress iproute2[1], to turn on and then list
the counters on a VLAN netdevice:
# ip stats set dev swp1.200 l3_stats on
# ip stats show dev swp1.200 group offload subgroup l3_stats
56: swp1.200: group offload subgroup l3_stats on used on
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
The patchset progresses as follows:
- Patch #1 is a cleanup.
- In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are
dev-backed.
The only attribute defined under the nest is currently
IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the
driver that supplies the statistics is not the same as the driver that
implements the netdevice. Make the code compatible with this in patch #2.
- In patch #3, add the possibility to filter inside nests.
The filter_mask field of RTM_GETSTATS header determines which
top-level attributes should be included in the netlink response. This
saves processing time by only including the bits that the user cares
about instead of always dumping everything. This is doubly important
for HW-backed statistics that would typically require a trip to the
device to fetch the stats. In this patch, the UAPI is extended to
allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular,
but the scheme is easily extensible to other nests as well.
- In patch #4, propagate extack where we need it.
In patch #5, make it possible to propagate errors from drivers to the
user.
- In patch #6, add the in-kernel APIs for keeping track of the new stats
suite, and the notifiers that the core uses to communicate with the
drivers.
- In patch #7, add UAPI for obtaining the new stats suite.
- In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry
the message to toggle the newly-added stats suite.
In patch #9, add the toggle itself.
At this point the core is ready for drivers to add support for the new
stats suite.
- In patches #10, #11 and #12, apply small tweaks to mlxsw code.
- In patch #13, add support for L3 stats, which are realized as RIF
counters.
- Finally in patch #14, a selftest is added to the net/forwarding
directory. Technically this is a HW-specific test, in that without a HW
implementing the counters, it just will not pass. But devices that
support L3 statistics at all are likely to be able to reuse this
selftest, so it seems appropriate to put it in the general forwarding
directory.
We also have a netdevsim implementation, and a corresponding selftest that
verifies specifically some of the core code. We intend to contribute these
later. Interested parties can take a look at the raw code at [2].
[1] https://github.com/pmachata/iproute2/commits/soft_counters
[2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2
v2:
- Patch #3:
- Do not declare strict_start_type at the new policies, since they are
used with nla_parse_nested() (sans _deprecated).
- Use NLA_POLICY_NESTED to declare what the nest contents should be
- Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering
attribute.
- Patch #6:
- s/monotonous/monotonic/ in commit message
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #7:
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #8:
- Do not declare strict_start_type at the new policies, since they are
used with nla_parse_nested() (sans _deprecated).
- Patch #13:
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:28 +0000 (18:31 +0200)]
selftests: forwarding: hw_stats_l3: Add a new test
Add a test that verifies operation of L3 HW statistics.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:27 +0000 (18:31 +0200)]
mlxsw: Add support for IFLA_OFFLOAD_XSTATS_L3_STATS
Spectrum machines support L3 stats by binding a counter to a RIF, a
hardware object representing a router interface. Recognize the netdevice
notifier events, NETDEV_OFFLOAD_XSTATS_*, to support enablement,
disablement, and reporting back to core.
As a netdevice gains a RIF, if L3 stats are enabled, install the counters,
and ping the core so that a userspace notification can be emitted.
Similarly, as a netdevice loses a RIF, push the as-yet-unreported
statistics to the core, so that they are not lost, and ping the core to
emit userspace notification.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:26 +0000 (18:31 +0200)]
mlxsw: Extract classification of router-related events to a helper
Several more events are coming in the following patches, and extending the
if statement is getting awkward. Instead, convert it to a switch.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:25 +0000 (18:31 +0200)]
mlxsw: spectrum_router: Drop mlxsw_sp arg from counter alloc/free functions
The mlxsw_sp reference is carried by the mlxsw_sp_rif object that is passed
to these functions as well. Just deduce the former from the latter,
and drop the explicit mlxsw_sp parameter. Adapt callers.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:24 +0000 (18:31 +0200)]
mlxsw: reg: Fix packing of router interface counters
The function mlxsw_reg_ritr_counter_pack() formats a register to configure
a router interface (RIF) counter. The parameter `egress' determines whether
an ingress or egress counter is to be configured. RITR, the register in
question, has two sets of counter-related fields: one for ingress, one for
egress. When setting values of the fields, the function sets the proper
counter index field, but when setting the counter type, it always sets the
egress field. Thus configuration of ingress counters is broken, and in fact
an attempt to configure an ingress counter mangles a previously configured
egress counter.
This was never discovered, because there is currently no way to enable
ingress counters on a router interface, only the egress one.
Fix in an obvious way.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:23 +0000 (18:31 +0200)]
net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS
The offloaded HW stats are designed to allow per-netdevice enablement and
disablement. Add an attribute, IFLA_STATS_SET_OFFLOAD_XSTATS_L3_STATS,
which should be carried by the RTM_SETSTATS message, and expresses a desire
to toggle L3 offload xstats on or off.
As part of the above, add an exported function rtnl_offload_xstats_notify()
that drivers can use when they have installed or deinstalled the counters
backing the HW stats.
At this point, it is possible to enable, disable and query L3 offload
xstats on netdevices. (However there is no driver actually implementing
these.)
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:22 +0000 (18:31 +0200)]
net: rtnetlink: Add RTM_SETSTATS
The offloaded HW stats are designed to allow per-netdevice enablement and
disablement. These stats are only accessible through RTM_GETSTATS, and
therefore should be toggled by a RTM_SETSTATS message. Add it, and the
necessary skeleton handler.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:21 +0000 (18:31 +0200)]
net: rtnetlink: Add UAPI for obtaining L3 offload xstats
Add a new IFLA_STATS_LINK_OFFLOAD_XSTATS child attribute,
IFLA_OFFLOAD_XSTATS_L3_STATS, to carry statistics for traffic that takes
place in a HW router.
The offloaded HW stats are designed to allow per-netdevice enablement and
disablement. Additionally, as a netdevice is configured, it may become or
cease being suitable for binding of a HW counter. Both of these aspects
need to be communicated to the userspace. To that end, add another child
attribute, IFLA_OFFLOAD_XSTATS_HW_S_INFO:
- attr nest IFLA_OFFLOAD_XSTATS_HW_S_INFO
- attr nest IFLA_OFFLOAD_XSTATS_L3_STATS
- attr IFLA_OFFLOAD_XSTATS_HW_S_INFO_REQUEST
- {0,1} as u8
- attr IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED
- {0,1} as u8
Thus this one attribute is a nest that can be used to carry information
about various types of HW statistics, and indexing is very simply done by
wrapping the information for a given statistics suite into the attribute
that carries the suite is the RTM_GETSTATS query. At the same time, because
_HW_S_INFO is nested directly below IFLA_STATS_LINK_OFFLOAD_XSTATS, it is
possible through filtering to request only the metadata about individual
statistics suites, without having to hit the HW to get the actual counters.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:20 +0000 (18:31 +0200)]
net: dev: Add hardware stats support
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
these statistics to the offloaded netdevice in question. The API was shaped
by the following considerations:
- Collection of HW statistics is not free: there may be a finite number of
counters, and the act of counting may have a performance impact. It is
therefore necessary to allow toggling whether HW counting should be done
for any particular SW netdevice.
- As the drivers are loaded and removed, a particular device may get
offloaded and unoffloaded again. At the same time, the statistics values
need to stay monotonic (modulo the eventual 64-bit wraparound),
increasing only to reflect traffic measured in the device.
To that end, the netdevice keeps around a lazily-allocated copy of struct
rtnl_link_stats64. Device drivers then contribute to the values kept
therein at various points. Even as the driver goes away, the struct stays
around to maintain the statistics values.
- Different HW devices may be able to count different things. The
motivation behind this patch in particular is exposure of HW counters on
Nvidia Spectrum switches, where the only practical approach to counting
traffic on offloaded soft netdevices currently is to use router interface
counters, and count L3 traffic. Correspondingly that is the statistics
suite added in this patch.
Other devices may be able to measure different kinds of traffic, and for
that reason, the APIs are built to allow uniform access to different
statistics suites.
- Because soft netdevices and offloading drivers are only loosely bound, a
netdevice uses a notifier chain to communicate with the drivers. Several
new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
to the offloading drivers.
- Devices can have various conditions for when a particular counter is
available. As the device is configured and reconfigured, the device
offload may become or cease being suitable for counter binding. A
netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
ping offloading drivers and determine whether anyone currently implements
a given statistics suite. This information can then be propagated to user
space.
When the driver decides to unoffload a netdevice, it can use a
newly-added function, netdev_offload_xstats_report_delta(), to record
outstanding collected statistics, before destroying the HW counter.
This patch adds a helper, call_netdevice_notifiers_info_robust(), for
dispatching a notifier with the possibility of unwind when one of the
consumers bails. Given the wish to eventually get rid of the global
notifier block altogether, this helper only invokes the per-netns notifier
block.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:19 +0000 (18:31 +0200)]
net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns
Obtaining stats for the IFLA_STATS_LINK_OFFLOAD_XSTATS nest involves a HW
access, and can fail for more reasons than just netlink message size
exhaustion. Therefore do not always return -EMSGSIZE on the failure path,
but respect the error code provided by the callee. Set the error explicitly
where it is reasonable to assume -EMSGSIZE as the failure reason.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:18 +0000 (18:31 +0200)]
net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill()
Later patches add handlers for more HW-backed statistics. An extack will be
useful when communicating HW / driver errors to the client. Add the
arguments as appropriate.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:17 +0000 (18:31 +0200)]
net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests
The filter_mask field of RTM_GETSTATS header determines which top-level
attributes should be included in the netlink response. This saves
processing time by only including the bits that the user cares about
instead of always dumping everything. This is doubly important for
HW-backed statistics that would typically require a trip to the device to
fetch the stats.
So far there was only one HW-backed stat suite per attribute. However,
IFLA_STATS_LINK_OFFLOAD_XSTATS is a nest, and will gain a new stat suite in
the following patches. It would therefore be advantageous to be able to
filter within that nest, and select just one or the other HW-backed
statistics suite.
Extend rtnetlink so that RTM_GETSTATS permits attributes in the payload.
The scheme is as follows:
- RTM_GETSTATS
- struct if_stats_msg
- attr nest IFLA_STATS_GET_FILTERS
- attr IFLA_STATS_LINK_OFFLOAD_XSTATS
- u32 filter_mask
This scheme reuses the existing enumerators by nesting them in a dedicated
context attribute. This is covered by policies as usual, therefore a
gradual opt-in is possible. Currently only IFLA_STATS_LINK_OFFLOAD_XSTATS
nest has filtering enabled, because for the SW counters the issue does not
seem to be that important.
rtnl_offload_xstats_get_size() and _fill() are extended to observe the
requested filters.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:16 +0000 (18:31 +0200)]
net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_* are dev-backed
The IFLA_STATS_LINK_OFFLOAD_XSTATS attribute is a nest whose child
attributes carry various special hardware statistics. The code that handles
this nest was written with the idea that all these statistics would be
exposed by the device driver of a physical netdevice.
In the following patches, a new attribute is added to the abovementioned
nest, which however can be defined for some soft netdevices. The NDO-based
approach to querying these does not work, because it is not the soft
netdevice driver that exposes these statistics, but an offloading NIC
driver that does so.
The current code does not scale well to this usage. Simply rewrite it back
to the pattern seen in other fill-like and get_size-like functions
elsewhere.
Extract to helpers the code that is concerned with handling specifically
NDO-backed statistics so that it can be easily reused should more such
statistics be added.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:15 +0000 (18:31 +0200)]
net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_*
The currently used names rtnl_get_offload_stats() and
rtnl_get_offload_stats_size() do not clearly show the namespace. The former
function additionally seems to have been named this way in accordance with
the NDO name, as opposed to the naming used in the rtnetlink.c file (and
indeed elsewhere in the netlink handling code). As more and
differently-flavored attributes are introduced, a common clear prefix is
needed for all related functions.
Rename the functions to follow the rtnl_offload_xstats_* naming scheme.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Wed, 2 Mar 2022 10:52:22 +0000 (02:52 -0800)]
qed: validate and restrict untrusted VFs vlan promisc mode
Today when VFs are put in promiscuous mode, they can request PF
to configure device for them to receive all VLANs traffic regardless
of what vlan is configured by the PF (via ip link) and PF allows this
config request regardless of whether VF is trusted or not.
From security POV, when VLAN is configured for VF through PF (via ip link),
honour such config requests from VF only when they are configured to be
trusted, otherwise restrict such VFs vlan promisc mode config.
Cc: stable@vger.kernel.org
Fixes:
f990c82c385b ("qed*: Add support for ndo_set_vf_trust")
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Wed, 2 Mar 2022 10:52:21 +0000 (02:52 -0800)]
qed: display VF trust config
Driver does support SR-IOV VFs trust configuration but
it does not display it when queried via ip link utility.
Cc: stable@vger.kernel.org
Fixes:
f990c82c385b ("qed*: Add support for ndo_set_vf_trust")
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 10:14:06 +0000 (10:14 +0000)]
Merge branch 'stmmac-SA8155p-ADP'
@ 2022-03-02 10:39 Bhupesh Sharma
2022-03-02 10:39 ` [PATCH v2 1/2 net-next] net: stmmac: Add support for SM8150 Bhupesh Sharma
2022-03-02 10:39 ` [PATCH v2 2/2 net-next] net: stmmac: dwmac-qcom-ethqos: Adjust rgmii loopback_en per platform Bhupesh Sharma
0 siblings, 2 replies; 3+ messages in thread
Bhupesh Sharma says:
====================
net: stmmac: Enable support for Qualcomm SA8155p-ADP board
Changes since v1:
-----------------
- v1 can be seen here: https://lore.kernel.org/netdev/
20220126221725.710167-1-bhupesh.sharma@linaro.org/t/
- Fixed review comments from Bjorn - broke the v1 series into two
separate series - one each for 'net' tree and 'arm clock/dts' tree
- so as to ease review of the same from the respective maintainers.
- This series is intended for the 'net' tree.
The SA8155p-ADP board supports on-board ethernet (Gibabit Interface),
with support for both RGMII and RMII buses.
This patchset adds the support for the same.
Note that this patchset is based on an earlier sent patchset
for adding PDC controller support on SM8150 (see [1]).
[1]. https://lore.kernel.org/linux-arm-msm/
20220226184028.111566-1-bhupesh.sharma@linaro.org/T/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjorn Andersson [Wed, 2 Mar 2022 10:39:50 +0000 (16:09 +0530)]
net: stmmac: dwmac-qcom-ethqos: Adjust rgmii loopback_en per platform
Not all platforms should have RGMII_CONFIG_LOOPBACK_EN and the result it
about 50% packet loss on incoming messages. So make it possile to
configure this per compatible and enable it for QCS404.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vinod Koul [Wed, 2 Mar 2022 10:39:49 +0000 (16:09 +0530)]
net: stmmac: Add support for SM8150
This adds compatible, POR config & driver data for ethernet controller
found in SM8150 SoC.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
[bhsharma: Massage the commit log and other cosmetic changes]
Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 09:55:28 +0000 (09:55 +0000)]
Merge branch 'page_pool-stats'
Joe Damato says:
====================
page_pool: Add stats counters
Greetings:
Welcome to v9.
This revisions adds a commit which updates the page_pool documentation to
describe the stats API, structures, and fields.
Additionally, this revision contains a minor cosmetic change suggested by
Saeed in page_pool_recycle_in_ring in commit 2: "page_pool: Add recycle
stats", which removes an unnecessary #ifdef.
There are no functional changes in this revision.
Benchmark output from the v7 cover [1] is pasted below, as it is still
relevant since no functional changes have been made in this revision:
Benchmarks have been re-run. As always, results between runs are highly
variable; you'll find results showing that stats disabled are both faster
and slower than stats enabled in back to back benchmark runs.
Raw benchmark output with stats off [2] and stats on [3] are available for
examination.
Test system:
- 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
- 2 NUMA zones, with 18 cores per zone and 2 threads per core
bench_page_pool_simple results, loops=
200000000
test name stats enabled stats disabled
cycles nanosec cycles nanosec
for_loop 0 0.335 0 0.336
atomic_inc 14 6.106 13 6.022
lock 30 13.365 32 13.968
no-softirq-page_pool01 75 32.884 74 32.308
no-softirq-page_pool02 79 34.696 74 32.302
no-softirq-page_pool03 110 48.005 105 46.073
tasklet_page_pool01_fast_path 14 6.156 14 6.211
tasklet_page_pool02_ptr_ring 41 18.028 39 17.391
tasklet_page_pool03_slow 107 46.646 105 46.123
bench_page_pool_cross_cpu results, loops=
20000000 returning_cpus=4:
test name stats enabled stats disabled
cycles nanosec cycles nanosec
page_pool_cross_cpu CPU(0) 3973 1731.596 4015 1750.015
page_pool_cross_cpu CPU(1) 3976 1733.217 4022 1752.864
page_pool_cross_cpu CPU(2) 3973 1731.615 4016 1750.433
page_pool_cross_cpu CPU(3) 3976 1733.218 4021 1752.806
page_pool_cross_cpu CPU(4) 994 433.305 1005 438.217
page_pool_cross_cpu average 3378 - 3415 -
bench_page_pool_cross_cpu results, loops=
20000000 returning_cpus=8:
test name stats enabled stats disabled
cycles nanosec cycles nanosec
page_pool_cross_cpu CPU(0) 6969 3037.488 6909 3011.463
page_pool_cross_cpu CPU(1) 6974 3039.469 6913 3012.961
page_pool_cross_cpu CPU(2) 6969 3037.575 6910 3011.585
page_pool_cross_cpu CPU(3) 6974 3039.415 6913 3012.961
page_pool_cross_cpu CPU(4) 6969 3037.288 6909 3011.368
page_pool_cross_cpu CPU(5) 6972 3038.732 6913 3012.920
page_pool_cross_cpu CPU(6) 6969 3037.350 6909 3011.386
page_pool_cross_cpu CPU(7) 6973 3039.356 6913 3012.921
page_pool_cross_cpu CPU(8) 871 379.934 864 376.620
page_pool_cross_cpu average 6293 - 6239 -
Thanks.
[1]: https://lore.kernel.org/all/
1645810914-35485-1-git-send-email-jdamato@fastly.com/
[2]: https://gist.githubusercontent.com/jdamato-fsly/
d7c34b9fa7be1ce132a266b0f2b92aea/raw/
327dcd71d11ece10238fbf19e0472afbcbf22fd4/v7_stats_disabled
[3]: https://gist.githubusercontent.com/jdamato-fsly/
d7c34b9fa7be1ce132a266b0f2b92aea/raw/
327dcd71d11ece10238fbf19e0472afbcbf22fd4/v7_stats_enabled
v8 -> v9:
- Add documentation about the page_pool_get_stats API, stats
structures, and fields to Documentation/networking/page_pool.rst.
- Remove unnecessary #ifdef in page_pool_recycle_in_ring.
v7 -> v8:
- Rename mlx5 ethtool stats so that users have a better idea of
their meaning.
v6 -> v7:
- stats split out into two structs one single per-page pool struct
for allocation path stats and one per-cpu pointer for recycle
path stats.
- page_pool_get_stats updated to use a wrapper struct to gather
stats for allocation and recycle stats with a single argument.
- placement of structs adjusted
- mlx5 driver modified to use page_pool_get_stats API
v5 -> v6:
- Per cpu page_pool_stats struct pointer is now marked as
____cacheline_aligned_in_smp. Placement of the field in the
struct is unchanged; it is the last field.
v4 -> v5:
- Fixed the description of the kernel option in Kconfig.
- Squashed commits 1-10 from v4 into a single commit for easier
review.
- Changed the comment style of the comment for
the this_cpu_inc_alloc_stat macro.
- Changed the return type of page_pool_get_stats from struct
page_pool_stat * to bool.
v3 -> v4:
- Restructured stats to be per-cpu per-pool.
- Global stats and proc file were removed.
- Exposed an API (page_pool_get_stats) for batching the pool stats.
v2 -> v3:
- patch 8/10 ("Add stat tracking cache refill") fixed placement of
counter increment.
- patch 10/10 ("net-procfs: Show page pool stats in proc") updated:
- fix unused label warning from kernel test robot,
- fixed page_pool_seq_show to only display the refill stat
once,
- added a remove_proc_entry for page_pool_stat to
dev_proc_net_exit.
v1 -> v2:
- A new kernel config option has been added, which defaults to N,
preventing this code from being compiled in by default
- The stats structure has been converted to a per-cpu structure
- The stats are now exported via proc (/proc/net/page_pool_stat)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:51 +0000 (23:55 -0800)]
mlx5: add support for page_pool_get_stats
This change adds support for the page_pool_get_stats API to mlx5. If the
user has enabled CONFIG_PAGE_POOL_STATS in their kernel, ethtool will
output page pool stats.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:50 +0000 (23:55 -0800)]
Documentation: update networking/page_pool.rst
Add the new stats API, kernel config parameter, and stats structure
information to the page_pool documentation.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:49 +0000 (23:55 -0800)]
page_pool: Add function to batch and return stats
Adds a function page_pool_get_stats which can be used by drivers to obtain
stats for a specified page_pool.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:48 +0000 (23:55 -0800)]
page_pool: Add recycle stats
Add per-cpu stats tracking page pool recycling events:
- cached: recycling placed page in the page pool cache
- cache_full: page pool cache was full
- ring: page placed into the ptr ring
- ring_full: page released from page pool because the ptr ring was full
- released_refcnt: page released (and not recycled) because refcnt > 1
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:47 +0000 (23:55 -0800)]
page_pool: Add allocation stats
Add per-pool statistics counters for the allocation path of a page pool.
These stats are incremented in softirq context, so no locking or per-cpu
variables are needed.
This code is disabled by default and a kernel config option is provided for
users who wish to enable them.
The statistics added are:
- fast: successful fast path allocations
- slow: slow path order-0 allocations
- slow_high_order: slow path high order allocations
- empty: ptr ring is empty, so a slow path allocation was forced.
- refill: an allocation which triggered a refill of the cache
- waive: pages obtained from the ptr ring that cannot be added to
the cache due to a NUMA mismatch.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tao Chen [Tue, 1 Mar 2022 14:35:42 +0000 (06:35 -0800)]
tcp: Remove the unused api
Last tcp_write_queue_head() use was removed in commit
114f39feab36 ("tcp: restore autocorking"), so remove it.
Signed-off-by: Tao Chen <chentao3@hotmail.com>
Link: https://lore.kernel.org/r/SYZP282MB33317DEE1253B37C0F57231E86029@SYZP282MB3331.AUSP282.PROD.OUTLOOK.COM
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kurt Kanzenbach [Mon, 28 Feb 2022 19:58:56 +0000 (20:58 +0100)]
flow_dissector: Add support for HSR
Network drivers such as igb or igc call eth_get_headlen() to determine the
header length for their to be constructed skbs in receive path.
When running HSR on top of these drivers, it results in triggering BUG_ON() in
skb_pull(). The reason is the skb headlen is not sufficient for HSR to work
correctly. skb_pull() notices that.
For instance, eth_get_headlen() returns 14 bytes for TCP traffic over HSR which
is not correct. The problem is, the flow dissection code does not take HSR into
account. Therefore, add support for it.
Reported-by: Anthony Harivel <anthony.harivel@linutronix.de>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20220228195856.88187-1-kurt@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Baruch Siach [Mon, 28 Feb 2022 12:10:03 +0000 (14:10 +0200)]
net: dsa: mv88e6xxx: support RMII cmode
Add support for direct RMII MAC mode. This allows hardware with CPU port
connected in direct 100M fixed link to work properly.
Signed-off-by: Baruch Siach <baruch.siach@siklu.com>
Link: https://lore.kernel.org/r/a962d1ccbeec42daa10dd8aff0e66e31f0faf1eb.1646050203.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Baruch Siach [Mon, 28 Feb 2022 12:10:02 +0000 (14:10 +0200)]
net: dsa: mv88e6xxx: don't error out cmode set on missing lane
When the given cmode has no serdes, mv88e6xxx_serdes_get_lane() returns
-NODEV. Earlier in the same function the code skips serdes handing in
this case. Do the same after cmode set.
Signed-off-by: Baruch Siach <baruch.siach@siklu.com>
Link: https://lore.kernel.org/r/cd95cf3422ae8daf297a01fa9ec3931b203cdf45.1646050203.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yang Li [Sun, 27 Feb 2022 13:22:08 +0000 (21:22 +0800)]
net: openvswitch: remove unneeded semicolon
Eliminate the following coccicheck warning:
./net/openvswitch/flow.c:379:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220227132208.24658-1-yang.lee@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Baowen Zheng [Wed, 2 Mar 2022 03:29:29 +0000 (11:29 +0800)]
flow_offload: improve extack msg for user when adding invalid filter
Add extack message to return exact message to user when adding invalid
filter with conflict flags for TC action.
In previous implement we just return EINVAL which is confusing for user.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Link: https://lore.kernel.org/r/1646191769-17761-1-git-send-email-baowen.zheng@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 3 Mar 2022 06:13:06 +0000 (22:13 -0800)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2022-03-01
This series contains updates to iavf driver only.
Mateusz adds support for interrupt moderation for 50G and 100G speeds
as well as support for the driver to specify a request as its primary
MAC address. He also refactors VLAN V2 capability exchange into more
generic extended capabilities to ease the addition of future
capabilities. Finally, he corrects the incorrect return of iavf_status
values and removes non-inclusive language.
Minghao Chi removes unneeded variables, instead returning values
directly.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
iavf: Remove non-inclusive language
iavf: Fix incorrect use of assigning iavf_status to int
iavf: stop leaking iavf_status as "errno" values
iavf: remove redundant ret variable
iavf: Add usage of new virtchnl format to set default MAC
iavf: refactor processing of VLAN V2 capability message
iavf: Add support for 50G/100G in AIM algorithm
====================
Link: https://lore.kernel.org/r/20220301185939.3005116-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Christophe JAILLET [Tue, 1 Mar 2022 13:12:12 +0000 (14:12 +0100)]
nfp: flower: Remove usage of the deprecated ida_simple_xxx API
Use ida_alloc_xxx()/ida_free() instead to
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220301131212.26348-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Tue, 1 Mar 2022 08:51:39 +0000 (08:51 +0000)]
net: sfp: use %pe for printing errors
Convert sfp to use %pe for printing error codes, which can print them
as errno symbols rather than numbers.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1nOyEN-00BuuE-OB@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Tue, 1 Mar 2022 08:51:34 +0000 (08:51 +0000)]
net: phylink: use %pe for printing errors
Convert phylink to use %pe for printing error codes, which can print
them as errno symbols rather than numbers.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1nOyEI-00Buu8-K9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Harold Huang [Thu, 3 Mar 2022 02:24:40 +0000 (10:24 +0800)]
tuntap: add sanity checks about msg_controllen in sendmsg
In patch [1], tun_msg_ctl was added to allow pass batched xdp buffers to
tun_sendmsg. Although we donot use msg_controllen in this path, we should
check msg_controllen to make sure the caller pass a valid msg_ctl.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
fe8dd45bb7556246c6b76277b1ba4296c91c2505
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Harold Huang <baymaxhuang@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220303022441.383865-1-baymaxhuang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 3 Mar 2022 05:58:02 +0000 (21:58 -0800)]
Merge tag 'batadv-next-pullrequest-
20220302' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Remove redundant 'flush_workqueue()' calls, by Christophe JAILLET
- Migrate to linux/container_of.h, by Sven Eckelmann
- Demote batadv-on-batadv skip error message, by Sven Eckelmann
* tag 'batadv-next-pullrequest-
20220302' of git://git.open-mesh.org/linux-merge:
batman-adv: Demote batadv-on-batadv skip error message
batman-adv: Migrate to linux/container_of.h
batman-adv: Remove redundant 'flush_workqueue()' calls
batman-adv: Start new development cycle
====================
Link: https://lore.kernel.org/r/20220302163522.102842-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wang Qing [Wed, 2 Mar 2022 06:41:14 +0000 (22:41 -0800)]
net: hamradio: fix compliation error
add missing ")" which caused by previous commit.
Fixes:
61c4fb9c4d09 ("net: hamradio: use time_is_after_jiffies() instead of open coding it")
Link: https://lore.kernel.org/all/1646018012-61129-1-git-send-email-wangqing@vivo.com/
Signed-off-by: Wang Qing <wangqing@vivo.com>
Link: https://lore.kernel.org/r/1646203277-83159-1-git-send-email-wangqing@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sven Eckelmann [Sun, 27 Feb 2022 22:40:40 +0000 (23:40 +0100)]
batman-adv: Demote batadv-on-batadv skip error message
The error message "Cannot find parent device" was shown for users of
macvtap (on batadv devices) whenever the macvtap was moved to a different
netns. This happens because macvtap doesn't provide an implementation for
rtnl_link_ops->get_link_net.
The situation for which this message is printed is actually not an error
but just a warning that the optional sanity check was skipped. So demote
the message from error to warning and adjust the text to better explain
what happened.
Reported-by: Leonardo Mörlein <freifunk@irrelefant.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Sven Eckelmann [Fri, 21 Jan 2022 16:14:44 +0000 (17:14 +0100)]
batman-adv: Migrate to linux/container_of.h
The commit
d2a8ebbf8192 ("kernel.h: split out container_of() and
typeof_member() macros") introduced a new header for the container_of
related macros from (previously) linux/kernel.h.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Jakub Kicinski [Wed, 2 Mar 2022 02:29:35 +0000 (18:29 -0800)]
Merge branch 'if_ether-h-add-industrial-fieldbus-ethertypes'
Daniel Braunwarth says:
====================
if_ether.h: add industrial fieldbus Ethertypes
This set of patches adds the Ethertypes for PROFINET and EtherCAT.
The defines should be used by iproute2 to extend the list of available link
layer protocols.
====================
Link: https://lore.kernel.org/r/20220228133029.100913-1-daniel@braunwarth.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Braunwarth [Mon, 28 Feb 2022 13:30:29 +0000 (14:30 +0100)]
if_ether.h: add EtherCAT Ethertype
Add the Ethertype for EtherCAT protocol.
Signed-off-by: Daniel Braunwarth <daniel@braunwarth.dev>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Braunwarth [Mon, 28 Feb 2022 13:30:28 +0000 (14:30 +0100)]
if_ether.h: add PROFINET Ethertype
Add the Ethertype for PROFINET protocol.
Signed-off-by: Daniel Braunwarth <daniel@braunwarth.dev>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sven Eckelmann [Mon, 28 Feb 2022 00:32:40 +0000 (01:32 +0100)]
macvtap: advertise link netns via netlink
Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages. This fixes iproute2 which otherwise resolved
the link interface to an interface in the wrong namespace.
Test commands:
ip netns add nst
ip link add dummy0 type dummy
ip link add link macvtap0 link dummy0 type macvtap
ip link set macvtap0 netns nst
ip -netns nst link show macvtap0
Before:
10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff
After:
10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Reported-by: Leonardo Mörlein <freifunk@irrelefant.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wan Jiabing [Tue, 1 Mar 2022 11:23:54 +0000 (19:23 +0800)]
nfp: avoid newline at end of message in NL_SET_ERR_MSG_MOD
Fix the following coccicheck warning:
./drivers/net/ethernet/netronome/nfp/flower/qos_conf.c:750:7-55: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220301112356.1820985-1-wanjiabing@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Harold Huang [Mon, 28 Feb 2022 03:38:05 +0000 (11:38 +0800)]
tun: support NAPI for packets received from batched XDP buffs
In tun, NAPI is supported and we can also use NAPI in the path of
batched XDP buffs to accelerate packet processing. What is more, after
we use NAPI, GRO is also supported. The iperf shows that the throughput of
single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
Gbps nearly reachs the line speed of the phy nic and there is still about
15% idle cpu core remaining on the vhost thread.
Test topology:
[iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]
Iperf stream:
iperf3 -c 10.0.0.2 -i 1 -t 10
Before:
...
[ 5] 5.00-6.00 sec 558 MBytes 4.68 Gbits/sec 0 1.50 MBytes
[ 5] 6.00-7.00 sec 556 MBytes 4.67 Gbits/sec 1 1.35 MBytes
[ 5] 7.00-8.00 sec 556 MBytes 4.67 Gbits/sec 2 1.18 MBytes
[ 5] 8.00-9.00 sec 559 MBytes 4.69 Gbits/sec 0 1.48 MBytes
[ 5] 9.00-10.00 sec 556 MBytes 4.67 Gbits/sec 1 1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 5.39 GBytes 4.63 Gbits/sec 72 sender
[ 5] 0.00-10.04 sec 5.39 GBytes 4.61 Gbits/sec receiver
After:
...
[ 5] 5.00-6.00 sec 1.07 GBytes 9.19 Gbits/sec 0 1.55 MBytes
[ 5] 6.00-7.00 sec 1.08 GBytes 9.30 Gbits/sec 0 1.63 MBytes
[ 5] 7.00-8.00 sec 1.08 GBytes 9.25 Gbits/sec 0 1.72 MBytes
[ 5] 8.00-9.00 sec 1.08 GBytes 9.25 Gbits/sec 77 1.31 MBytes
[ 5] 9.00-10.00 sec 1.08 GBytes 9.24 Gbits/sec 0 1.48 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.8 GBytes 9.28 Gbits/sec 166 sender
[ 5] 0.00-10.04 sec 10.8 GBytes 9.24 Gbits/sec receiver
Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.com
Signed-off-by: Harold Huang <baymaxhuang@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220228033805.1579435-1-baymaxhuang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 2 Mar 2022 01:12:46 +0000 (17:12 -0800)]
Merge branch 'sfc-optimize-rxqs-count-and-affinities'
Íñigo Huguet says:
====================
sfc: optimize RXQs count and affinities
In sfc driver one RX queue per physical core was allocated by default.
Later on, IRQ affinities were set spreading the IRQs in all NUMA local
CPUs.
However, with that default configuration it result in a non very optimal
configuration in many modern systems. Specifically, in systems with hyper
threading and 2 NUMA nodes, affinities are set in a way that IRQs are
handled by all logical cores of one same NUMA node. Handling IRQs from
both hyper threading siblings has no benefit, and setting affinities to one
queue per physical core is neither a very good idea because there is a
performance penalty for moving data across nodes (I was able to check it
with some XDP tests using pktgen).
This patches reduce the default number of channels to one per physical
core in the local NUMA node. Then, they set IRQ affinities to CPUs in
the local NUMA node only. This way we save hardware resources since
channels are limited resources. We also leave more room for XDP_TX
channels without hitting driver's limit of 32 channels per interface.
Running performance tests using iperf with a SFC9140 device showed no
performance penalty for reducing the number of channels.
RX XDP tests showed that performance can go down to less than half if
the IRQ is handled by a CPU in a different NUMA node, which doesn't
happen with the new defaults from this patches.
====================
Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Íñigo Huguet [Mon, 28 Feb 2022 13:22:54 +0000 (14:22 +0100)]
sfc: set affinity hints in local NUMA node only
Affinity hints were being set to CPUs in local NUMA node first, and then
in other CPUs. This was creating 2 unintended issues:
1. Channels created to be assigned each to a different physical core
were assigned to hyperthreading siblings because of being in same
NUMA node.
Since the patch previous to this one, this did not longer happen
with default rss_cpus modparam because less channels are created.
2. XDP channels could be assigned to CPUs in different NUMA nodes,
decreasing performance too much (to less than half in some of my
tests).
This patch sets the affinity hints spreading the channels only in local
NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
is online has been added too.
Example of CPUs being assigned in a non optimal way before this and the
previous patch (note: in this system, xdp-8 to xdp-15 are created
because num_possible_cpus == 64, but num_present_cpus == 32 so they're
never used):
$ lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
/proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
/proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
/proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
/proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
/proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
/proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
/proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
/proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
/proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
/proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
/proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
/proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
/proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
/proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
/proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
/proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
/proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
/proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
/proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
/proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
/proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
/proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
/proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31
CPUs assignments after this and previous patch, so normal channels
created only one per core in NUMA node and affinities set only to local
NUMA node:
$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
/proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
/proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
/proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
/proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
/proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
/proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
/proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
/proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
/proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
/proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
/proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
/proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
/proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
/proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
/proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Íñigo Huguet [Mon, 28 Feb 2022 13:22:53 +0000 (14:22 +0100)]
sfc: default config to 1 channel/core in local NUMA node only
Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.
Fallback to all other online cores if there are not online CPUs in local
NUMA node.
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 1 Mar 2022 22:24:46 +0000 (14:24 -0800)]
net: smc: fix different types in min()
Fix build:
include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’
45 | #define min(x, y) __careful_cmp(x, y, <)
| ^~~~~~~~~~~~~
net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’
150 | corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
| ^~~
Fixes:
12bbb0d163a9 ("net/smc: add sysctl for autocorking")
Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mateusz Palczewski [Thu, 3 Feb 2022 10:25:18 +0000 (11:25 +0100)]
iavf: Remove non-inclusive language
Remove non-inclusive language from the iavf driver.
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Thu, 27 Jan 2022 14:16:40 +0000 (15:16 +0100)]
iavf: Fix incorrect use of assigning iavf_status to int
Currently there are functions in iavf_virtchnl.c for polling specific
virtchnl receive events. These are all assigning iavf_status values to
int values. Fix this and explicitly assign int values if iavf_status
is not IAVF_SUCCESS.
Also, refactor a small amount of duplicated code that can be reused by
all of the previously mentioned functions.
Finally, fix some spacing errors for variable assignment and get rid of
all the goto statements in the refactored functions for clarity.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Thu, 27 Jan 2022 14:16:29 +0000 (15:16 +0100)]
iavf: stop leaking iavf_status as "errno" values
Several functions in the iAVF core files take status values of the enum
iavf_status and convert them into integer values. This leads to
confusion as functions return both Linux errno values and status codes
intermixed. Reporting status codes as if they were "errno" values can
lead to confusion when reviewing error logs. Additionally, it can lead
to unexpected behavior if a return value is not interpreted properly.
Fix this by introducing iavf_status_to_errno, a switch that explicitly
converts from the status codes into an appropriate error value. Also
introduce a virtchnl_status_to_errno function for the one case where we
were returning both virtchnl status codes and iavf_status codes in the
same function.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Minghao Chi [Mon, 10 Jan 2022 10:46:56 +0000 (10:46 +0000)]
iavf: remove redundant ret variable
Return value directly instead of taking this in another redundant
variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Wed, 19 Jan 2022 10:15:21 +0000 (11:15 +0100)]
iavf: Add usage of new virtchnl format to set default MAC
Use new type field of VIRTCHNL_OP_ADD_ETH_ADDR and
VIRTCHNL_OP_DEL_ETH_ADDR requests to indicate that
VF wants to change its default MAC address.
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Fri, 14 Jan 2022 09:36:36 +0000 (10:36 +0100)]
iavf: refactor processing of VLAN V2 capability message
In order to handle the capability exchange necessary for
VIRTCHNL_VF_OFFLOAD_VLAN_V2, the driver must send
a VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS message. This must occur prior to
__IAVF_CONFIG_ADAPTER, and the driver must wait for the response from
the PF.
To handle this, the __IAVF_INIT_GET_OFFLOAD_VLAN_V2_CAPS state was
introduced. This state is intended to process the response from the VLAN
V2 caps message. This works ok, but is difficult to extend to adding
more extended capability exchange.
Existing (and future) AVF features are relying more and more on these
sort of extended ops for processing additional capabilities. Just like
VLAN V2, this exchange must happen prior to __IAVF_CONFIG_ADPATER.
Since we only send one outstanding AQ message at a time during init, it
is not clear where to place this state. Adding more capability specific
states becomes a mess. Instead of having the "previous" state send
a message and then transition into a capability-specific state,
introduce __IAVF_EXTENDED_CAPS state. This state will use a list of
extended_caps that determines what messages to send and receive. As long
as there are extended_caps bits still set, the driver will remain in
this state performing one send or one receive per state machine loop.
Refactor the VLAN V2 negotiation to use this new state, and remove the
capability-specific state. This makes it significantly easier to add
a new similar capability exchange going forward.
Extended capabilities are processed by having an associated SEND and
RECV extended capability bit. During __IAVF_EXTENDED_CAPS, the
driver checks these bits in order by feature, first the send bit for
a feature, then the recv bit for a feature. Each send flag will call
a function that sends the necessary response, while each receive flag
will wait for the response from the PF. If a given feature can't be
negotiated with the PF, the associated flags will be cleared in
order to skip processing of that feature.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Mon, 10 Jan 2022 15:06:38 +0000 (16:06 +0100)]
iavf: Add support for 50G/100G in AIM algorithm
Advanced link speed support was added long back, but adding AIM support was
missed. This patch adds AIM support for advanced link speed support, which
allows the algorithm to take into account 50G/100G link speeds. Also, other
previous speeds are taken into consideration when advanced link speeds are
supported.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
David S. Miller [Tue, 1 Mar 2022 14:25:12 +0000 (14:25 +0000)]
Merge branch 'smc-datapath-opts'
Dust Li says:
====================
net/smc: some datapath performance optimizations
This series tries to improve the performance of SMC in datapath.
- patch #1, add sysctl interface to support tuning the behaviour of
SMC in container environment.
- patch #2/#3, add autocorking support which is very efficient for small
messages without trade-off for latency.
- patch #4, send directly on setting TCP_NODELAY, without wake up the
TX worker, this make it consistent with clearing TCP_CORK.
- patch #5, this correct the setting of RMB window update limit, so
we don't send CDC messages to update peer's RMB window too frequently
in some cases.
- patch #6, implemented something like NAPI in SMC, decrease the number
of hardirq when busy.
- patch #7, this moves TX work doing in the BH to the user context when
sock_lock is hold by user.
With this patchset applied, we can get a good performance gain:
- qperf tcp_bw test has shown a great improvement. Other benchmarks like
'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
- In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
better then TCP in most all message size.
Here are some test results with the following testing command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
==== Bandwidth ====
MsgSize Origin SMC TCP SMC with patches
1 0.578 MB/s 2.392 MB/s(313.57%) 2.561 MB/s(342.83%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.162 MB/s(345.46%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.122 MB/s(343.46%)
8 4.668 MB/s 19.040 MB/s(307.86%) 20.521 MB/s(339.59%)
16 9.147 MB/s 38.904 MB/s(325.31%) 40.823 MB/s(346.29%)
32 18.369 MB/s 79.587 MB/s(333.25%) 80.535 MB/s(338.42%)
64 36.562 MB/s 148.668 MB/s(306.61%) 158.170 MB/s(332.60%)
128 72.961 MB/s 274.913 MB/s(276.80%) 316.217 MB/s(333.41%)
256 144.705 MB/s 512.059 MB/s(253.86%) 626.019 MB/s(332.62%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1221.596 MB/s(322.88%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2203.156 MB/s(283.70%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 3036.448 MB/s(177.25%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 3834.271 MB/s( 85.58%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 4904.910 MB/s( 31.95%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 5220.272 MB/s( 10.08%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5321.865 MB/s( -0.52%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5245.021 MB/s( 1.59%)
==== Latency ====
MsgSize Origin SMC TCP SMC with patches
1 10.540 us 11.938 us( 13.26%) 10.356 us( -1.75%)
2 10.996 us 11.992 us( 9.06%) 10.073 us( -8.39%)
4 10.229 us 11.687 us( 14.25%) 9.996 us( -2.28%)
8 10.203 us 11.653 us( 14.21%) 10.063 us( -1.37%)
16 10.530 us 11.313 us( 7.44%) 10.013 us( -4.91%)
32 10.241 us 11.586 us( 13.13%) 10.081 us( -1.56%)
64 10.693 us 11.652 us( 8.97%) 9.986 us( -6.61%)
128 10.597 us 11.579 us( 9.27%) 10.262 us( -3.16%)
256 10.409 us 11.957 us( 14.87%) 10.148 us( -2.51%)
512 11.088 us 12.505 us( 12.78%) 10.206 us( -7.95%)
1024 11.240 us 12.255 us( 9.03%) 10.631 us( -5.42%)
2048 11.485 us 16.970 us( 47.76%) 10.981 us( -4.39%)
4096 12.077 us 13.948 us( 15.49%) 11.847 us( -1.90%)
8192 13.683 us 16.693 us( 22.00%) 13.336 us( -2.54%)
16384 16.470 us 23.615 us( 43.38%) 16.519 us( 0.30%)
32768 22.540 us 40.966 us( 81.75%) 22.452 us( -0.39%)
65536 34.192 us 73.003 us(113.51%) 33.916 us( -0.81%)
------------
Test environment notes:
1. Testing is run on 2 VMs within the same physical host
2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
2 VMs respectively.
3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
physical CPUs are all isolated using boot parameter `isolcpus=xxx`
4. The queue number are set to 1, and interrupt from the queue is binded to
CPU0 in the guest
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:44:02 +0000 (17:44 +0800)]
net/smc: don't send in the BH context if sock_owned_by_user
Send data all the way down to the RDMA device is a time
consuming operation(get a new slot, maybe do RDMA Write
and send a CDC, etc). Moving those operations from BH
to user context is good for performance.
If the sock_lock is hold by user, we don't try to send
data out in the BH context, but just mark we should
send. Since the user will release the sock_lock soon, we
can do the sending there.
Add smc_release_cb() which will be called in release_sock()
and try send in the callback if needed.
This patch moves the sending part out from BH if sock lock
is hold by user. In my testing environment, this saves about
20% softirq in the qperf 4K tcp_bw test in the sender side
with no noticeable throughput drop.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:44:01 +0000 (17:44 +0800)]
net/smc: don't req_notify until all CQEs drained
When we are handling softirq workload, enable hardirq may
again interrupt the current routine of softirq, and then
try to raise softirq again. This only wastes CPU cycles
and won't have any real gain.
Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
ib_req_notify_cq() returns 0, it is safe to wait for the
next event, with no need to poll the CQ again in this case.
This patch disables hardirq during the processing of softirq,
and re-arm the CQ after softirq is done. Somehow like NAPI.
Co-developed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:44:00 +0000 (17:44 +0800)]
net/smc: correct settings of RMB window update limit
rmbe_update_limit is used to limit announcing receive
window updating too frequently. RFC7609 request a minimal
increase in the window size of 10% of the receive buffer
space. But current implementation used:
min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
always less then 10% of the receive buffer space.
This causes the receiver always sending CDC message to
update its consumer cursor when it consumes more then 2K
of data. And as a result, we may encounter something like
"TCP silly window syndrome" when sending 2.5~8K message.
This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
With this patch and SMC autocorking enabled, qperf 2K/4K/8K
tcp_bw test shows 45%/75%/40% increase in throughput respectively.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:59 +0000 (17:43 +0800)]
net/smc: send directly on setting TCP_NODELAY
In commit
ea785a1a573b("net/smc: Send directly when
TCP_CORK is cleared"), we don't use delayed work
to implement cork.
This patch use the same algorithm, removes the
delayed work when setting TCP_NODELAY and send
directly in setsockopt(). This also makes the
TCP_NODELAY the same as TCP.
Cc: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:58 +0000 (17:43 +0800)]
net/smc: add sysctl for autocorking
This add a new sysctl: net.smc.autocorking_size
We can dynamically change the behaviour of autocorking
by change the value of autocorking_size.
Setting to 0 disables autocorking in SMC
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:57 +0000 (17:43 +0800)]
net/smc: add autocorking support
This patch adds autocorking support for SMC which could improve
throughput for small message by x3+.
The main idea is borrowed from TCP autocorking with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
completed, cork the new messages until:
a) Receive CQE for the last Tx WR
b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
the last Tx WR to prevent the corked messages hang in
the send queue.
Both SMC autocorking and TCP autocorking check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.
Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.
SMC autocorking should not bring extra latency since the first
message will always been sent out immediately.
The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocorking has not increase any
ping-pong latency.
Test command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
=== Bandwidth ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%)
8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%)
16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%)
32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%)
64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%)
128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%)
256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%)
==== Latency ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%)
2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%)
4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%)
8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%)
16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%)
32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%)
64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%)
128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%)
256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%)
512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%)
1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%)
2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%)
4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%)
8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%)
16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%)
32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%)
65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%)
With SMC autocorking support, we can archive better throughput
than TCP in most message sizes without any latency trade-off.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:56 +0000 (17:43 +0800)]
net/smc: add sysctl interface for SMC
This patch add sysctl interface to support container environment
for SMC as we talk in the mail list.
Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.com
Co-developed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Mar 2022 08:38:02 +0000 (08:38 +0000)]
Merge branch 'vxlan-vnifiltering'
Roopa Prabhu says:
====================
vxlan metadata device vnifiltering support
This series adds vnifiltering support to vxlan collect metadata device.
Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.
This series adds vnifiltering support to collect metadata devices to
terminate only configured vnis. This is similar to vlan filtering in
bridge driver. The vni filtering capability is provided by a new flag on
collect metadata device.
In the below pic:
- customer1 is mapped to br1 bridge domain
- customer2 is mapped to br2 bridge domain
- customer1 vlan 10-11 is mapped to vni 1001-1002
- customer2 vlan 10-11 is mapped to vni 2001-2002
- br1 and br2 are vlan filtering bridges
- vxlan1 and vxlan2 are collect metadata devices with
vnifiltering enabled
┌──────────────────────────────────────────────────────────────────┐
│ switch │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ │ │ │ │
│ │ br1 │ │ br2 │ │
│ └┬─────────┬┘ └──┬───────┬┘ │
│ vlans│ │ vlans │ │ │
│ 10,11│ │ 10,11│ │ │
│ │ vlanvnimap: │ vlanvnimap: │
│ │ 10-1001,11-1002 │ 10-2001,11-2002 │
│ │ │ │ │ │
│ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │
│ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │
│ │ │ │ vnifilter:│ │ │ │vxlan2 │ │
│ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │
│ │ └────────────┘ │ │ 2001,2002 │ │
│ │ │ └──────────────┘ │
│ │ │ │
└───────┼──────────────────────────────────┼───────────────────────┘
│ │
│ │
┌─────┴───────┐ │
│ customer1 │ ┌─────┴──────┐
│ host/VM │ │customer2 │
└─────────────┘ │ host/VM │
└────────────┘
v2:
- remove stale xstats declarations pointed out by Nikolay Aleksandrov
- squash selinux patch with the tunnel api patch as pointed out by
benjamin poirier
- Fix various build issues:
Reported-by: kernel test robot <lkp@intel.com>
v3:
- incorporate review feedback from Jakub
- move rhashtable declarations to c file
- define and use netlink policy for top level vxlan filter api
- fix unused stats function warning
- pass vninode from vnifilter lookup into stats count function
to avoid another lookup (only applicable to vxlan_rcv)
- fix missing vxlan vni delete notifications in vnifilter uninit
function
- misc cleanups
- remote dev check for multicast groups added via vnifiltering api
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Tue, 1 Mar 2022 05:04:39 +0000 (05:04 +0000)]
drivers: vxlan: vnifilter: add support for stats dumping
Add support for VXLAN vni filter entries' stats dumping
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Tue, 1 Mar 2022 05:04:38 +0000 (05:04 +0000)]
drivers: vxlan: vnifilter: per vni stats
Add per-vni statistics for vni filter mode. Counting Rx/Tx
bytes/packets/drops/errors at the appropriate places.
This patch changes vxlan_vs_find_vni to also return the
vxlan_vni_node in cases where the vni belongs to a vni
filtering vxlan device
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:37 +0000 (05:04 +0000)]
selftests: add new tests for vxlan vnifiltering
This patch adds a new test script test_vxlan_vnifiltering.sh
with tests for vni filtering api, various datapath tests.
Also has a test with a mix of traditional, metadata and vni
filtering devices inuse at the same time.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:36 +0000 (05:04 +0000)]
vxlan: vni filtering support on collect metadata device
This patch adds vnifiltering support to collect metadata device.
Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.
vnifiltering support in collect metadata devices terminates only configured
vnis. This is similar to vlan filtering in bridge driver. The vni filtering
capability is provided by a new flag on collect metadata device.
In the below pic:
- customer1 is mapped to br1 bridge domain
- customer2 is mapped to br2 bridge domain
- customer1 vlan 10-11 is mapped to vni 1001-1002
- customer2 vlan 10-11 is mapped to vni 2001-2002
- br1 and br2 are vlan filtering bridges
- vxlan1 and vxlan2 are collect metadata devices with
vnifiltering enabled
┌──────────────────────────────────────────────────────────────────┐
│ switch │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ │ │ │ │
│ │ br1 │ │ br2 │ │
│ └┬─────────┬┘ └──┬───────┬┘ │
│ vlans│ │ vlans │ │ │
│ 10,11│ │ 10,11│ │ │
│ │ vlanvnimap: │ vlanvnimap: │
│ │ 10-1001,11-1002 │ 10-2001,11-2002 │
│ │ │ │ │ │
│ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │
│ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │
│ │ │ │ vnifilter:│ │ │ │vxlan2 │ │
│ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │
│ │ └────────────┘ │ │ 2001,2002 │ │
│ │ │ └──────────────┘ │
│ │ │ │
└───────┼──────────────────────────────────┼───────────────────────┘
│ │
│ │
┌─────┴───────┐ │
│ customer1 │ ┌─────┴──────┐
│ host/VM │ │customer2 │
└─────────────┘ │ host/VM │
└────────────┘
With this implementation, vxlan dst metadata device can
be associated with range of vnis.
struct vxlan_vni_node is introduced to represent
a configured vni. We start with vni and its
associated remote_ip in this structure. This
structure can be extended to bring in other
per vni attributes if there are usecases for it.
A vni inherits an attribute from the base vxlan device
if there is no per vni attributes defined.
struct vxlan_dev gets a new rhashtable for
vnis called vxlan_vni_group. vxlan_vnifilter.c
implements the necessary netlink api, notifications
and helper functions to process and manage lifecycle
of vxlan_vni_node.
This patch also adds new helper functions in vxlan_multicast.c
to handle per vni remote_ip multicast groups which are part
of vxlan_vni_group.
Fix build problems:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:35 +0000 (05:04 +0000)]
vxlan_multicast: Move multicast helpers to a separate file
subsequent patches will add more helpers.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:34 +0000 (05:04 +0000)]
rtnetlink: add new rtm tunnel api for tunnel id filtering
This patch adds new rtm tunnel msg and api for tunnel id
filtering in dst_metadata devices. First dst_metadata
device to use the api is vxlan driver with AF_BRIDGE
family.
This and later changes add ability in vxlan driver to do
tunnel id filtering (or vni filtering) on dst_metadata
devices. This is similar to vlan api in the vlan filtering bridge.
this patch includes selinux nlmsg_route_perms support for RTM_*TUNNEL
api from Benjamin Poirier.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:33 +0000 (05:04 +0000)]
vxlan_core: add helper vxlan_vni_in_use
more users in follow up patches
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:32 +0000 (05:04 +0000)]
vxlan_core: make multicast helper take rip and ifindex explicitly
This patch changes multicast helpers to take rip and ifindex as input.
This is needed in future patches where rip can come from a pervni
structure while the ifindex can come from the vxlan device.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>