platform/kernel/linux-starfive.git
19 months agonet/mlx5: DR, Enable patterns and arguments for supporting devices
Yevgeny Kliteynik [Mon, 29 Aug 2022 22:18:59 +0000 (01:18 +0300)]
net/mlx5: DR, Enable patterns and arguments for supporting devices

Check if patterns and arguments for modify header action
are supported and enable them accordingly.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Add support for the pattern/arg parameters in debug dump
Yevgeny Kliteynik [Sat, 31 Dec 2022 22:20:53 +0000 (00:20 +0200)]
net/mlx5: DR, Add support for the pattern/arg parameters in debug dump

Support the pattern/args-based MODIFY_HDR and TNL_L3_TO_L2 actions in dbg dump

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Modify header action of size 1 optimization
Yevgeny Kliteynik [Tue, 15 Nov 2022 16:19:39 +0000 (18:19 +0200)]
net/mlx5: DR, Modify header action of size 1 optimization

Set modify header action of size 1 directly on the STE for supporting
devices, thus reducing number of hops and cache misses.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Support decap L3 action using pattern / arg mechanism
Yevgeny Kliteynik [Tue, 15 Nov 2022 00:36:07 +0000 (02:36 +0200)]
net/mlx5: DR, Support decap L3 action using pattern / arg mechanism

Use the new accelerated action for decap L3 on RX side:
use the mechanism of pattern and argument same as in
modify-header action.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Apply new accelerated modify action and decapl3
Yevgeny Kliteynik [Tue, 8 Nov 2022 09:56:11 +0000 (11:56 +0200)]
net/mlx5: DR, Apply new accelerated modify action and decapl3

If there is support for pattern/args, use the new accelerated modify
header action for modify header and decap L3 actions.
Otherwise fall back to the old modify-header implementation.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Add modify header argument pointer to actions attributes
Yevgeny Kliteynik [Mon, 7 Nov 2022 00:49:22 +0000 (02:49 +0200)]
net/mlx5: DR, Add modify header argument pointer to actions attributes

While building the actions, add the pointer of the arguments for
accelerated modify list action into the action's attributes.
This will be used later on while building the specific STE
for this action.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Add modify header arg pool mechanism
Yevgeny Kliteynik [Mon, 27 Mar 2023 20:26:55 +0000 (23:26 +0300)]
net/mlx5: DR, Add modify header arg pool mechanism

Added new mechanism for handling arguments for modify-header action.
The new action "accelerated modify-header" asks for the arguments from
separated area from the pattern, this area accessed via general objects.
Handling of these object is done via the pool-manager struct.

When the new header patterns are supported, while loading the domain,
a few pools for argument creations will be created. The requests for
allocating/deallocating arg objects are done via the pool manager API.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Fix QP continuous allocation
Yevgeny Kliteynik [Mon, 14 Nov 2022 22:11:38 +0000 (00:11 +0200)]
net/mlx5: DR, Fix QP continuous allocation

When allocating a QP we allocate an RQ and an SQ, the RQ is stored first
in memory and followed by the SQ.
This allocation is not physically continiuos - it may span across different
physical pages. SW Steering code always writes in pairs: 1BB write + 1BB read,
or 2 continuous BBs of GTA WQE.

This lead to an issue where RQ allocation was 4x16 which is equal to 1 WQE BB,
causing 1 BB offset in the page and splitting the GTA WQE between different
physical pages.

The solution was to create the RQ with a even number of BBs and to have the
RQ aligned to a page.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Read ICM memory into dedicated buffer
Yevgeny Kliteynik [Mon, 14 Nov 2022 22:33:25 +0000 (00:33 +0200)]
net/mlx5: DR, Read ICM memory into dedicated buffer

Instead of using the write buffer for reading we will use a dedicated
buffer only for reading ICM memory.
Due to the new support for args, we can have a case with pending_wc
being odd number, and with reading into the same write buffer, it is
possible to overwrite next write on the same slot.
For example:
pending_wc is 17 so the buffer for write is:
   | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
and we have requests as follows:
   r wr wr wr wr wr wr wr wr
Now, the first read will be written into the last write because we use
the same buffer for read and write, before it was written to the HW and
we will have a wrong data in the ICM area.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Add support for writing modify header argument
Yevgeny Kliteynik [Tue, 8 Nov 2022 12:27:03 +0000 (14:27 +0200)]
net/mlx5: DR, Add support for writing modify header argument

The accelerated modify header arguments are written in the HW area
with special WQE and specific data format.
New function was added to support writing of new argument type.
Note that GTA WQE is larger than READ and WRITE, so the queue
management logic was updated to support this.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Add create/destroy for modify-header-argument general object
Yevgeny Kliteynik [Mon, 29 Aug 2022 22:18:29 +0000 (01:18 +0300)]
net/mlx5: DR, Add create/destroy for modify-header-argument general object

Add functions for creation/destruction of the new type of general object.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Check for modify_header_argument device capabilities
Yevgeny Kliteynik [Mon, 29 Aug 2022 22:21:10 +0000 (01:21 +0300)]
net/mlx5: DR, Check for modify_header_argument device capabilities

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Split chunk allocation to HW-dependent ways
Yevgeny Kliteynik [Mon, 29 Aug 2022 22:21:05 +0000 (01:21 +0300)]
net/mlx5: DR, Split chunk allocation to HW-dependent ways

This way we are able to allocate chunk for modify_headers from 2 types:
STEv0 that is allocated from the action area, and STEv1 that is allocating
the chunks from the special area for patterns.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Add cache for modify header pattern
Yevgeny Kliteynik [Sun, 6 Nov 2022 10:52:23 +0000 (12:52 +0200)]
net/mlx5: DR, Add cache for modify header pattern

Starting with ConnectX-6 Dx, we use new design of modify_header FW object.
The current modify_header object allows for having only limited number
of FW objects, so the new design of pattern and argument allows pattern
reuse, saving memory, and having a large number of modify_header objects.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agonet/mlx5: DR, Move ACTION_CACHE_LINE_SIZE macro to header
Yevgeny Kliteynik [Sun, 6 Nov 2022 10:50:41 +0000 (12:50 +0200)]
net/mlx5: DR, Move ACTION_CACHE_LINE_SIZE macro to header

Move ACTION_CACHE_LINE_SIZE macro to header to be used by
the pattern functions as well.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
19 months agoMerge branch 'msg_control-split'
David S. Miller [Fri, 14 Apr 2023 10:09:27 +0000 (11:09 +0100)]
Merge branch 'msg_control-split'

Kevin Brodsky says:

====================
net: Finish up ->msg_control{,_user} split

Commit 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for
->msg_control") introduced the msg_control_user and
msg_control_is_user fields in struct msghdr, to ensure that user
pointers are represented as such. It also took care of converting most
users of struct msghdr::msg_control where user pointers are involved. It
did however miss a number of cases, and some code using msg_control
inappropriately has also appeared in the meantime.

This series is attempting to complete the split, by eliminating the
remaining cases where msg_control is used when in fact a user
pointer is stored in the union (patch 1).

It also addresses a couple of issues with msg_control_is_user: one where
it is not updated as it should (patch 2), and one where it is not
initialised (patch 3).

v1..v2:
* Split out the msg_control_is_user fixes into separate patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet/ipv6: Initialise msg_control_is_user
Kevin Brodsky [Thu, 13 Apr 2023 11:47:05 +0000 (12:47 +0100)]
net/ipv6: Initialise msg_control_is_user

do_ipv6_setsockopt() makes use of struct msghdr::msg_control in the
IPV6_2292PKTOPTIONS case. Make sure to initialise
msg_control_is_user accordingly.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet/compat: Update msg_control_is_user when setting a kernel pointer
Kevin Brodsky [Thu, 13 Apr 2023 11:47:04 +0000 (12:47 +0100)]
net/compat: Update msg_control_is_user when setting a kernel pointer

cmsghdr_from_user_compat_to_kern() is an unusual case w.r.t. how
the kmsg->msg_control* fields are used. The input struct msghdr
holds a pointer to a user buffer, i.e. ksmg->msg_control_user is
active. However, upon success, a kernel pointer is stored in
kmsg->msg_control. kmsg->msg_control_is_user should therefore be
updated accordingly.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: Ensure ->msg_control_user is used for user buffers
Kevin Brodsky [Thu, 13 Apr 2023 11:47:03 +0000 (12:47 +0100)]
net: Ensure ->msg_control_user is used for user buffers

Since commit 1f466e1f15cf ("net: cleanly handle kernel vs user
buffers for ->msg_control"), pointers to user buffers should be
stored in struct msghdr::msg_control_user, instead of the
msg_control field.  Most users of msg_control have already been
converted (where user buffers are involved), but not all of them.

This patch attempts to address the remaining cases. An exception is
made for null checks, as it should be safe to use msg_control
unconditionally for that purpose.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agovsock/loopback: don't disable irqs for queue access
Arseniy Krasnov [Thu, 13 Apr 2023 09:17:19 +0000 (12:17 +0300)]
vsock/loopback: don't disable irqs for queue access

This replaces 'skb_queue_tail()' with 'virtio_vsock_skb_queue_tail()'.
The first one uses 'spin_lock_irqsave()', second uses 'spin_lock_bh()'.
There is no need to disable interrupts in the loopback transport as
there is no access to the queue with skbs from interrupt context. Both
virtio and vhost transports work in the same way.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge branch 'mana-jumbo-frames'
David S. Miller [Fri, 14 Apr 2023 07:56:20 +0000 (08:56 +0100)]
Merge branch 'mana-jumbo-frames'

Haiyang Zhang says:

====================
net: mana: Add support for jumbo frame

The set adds support for jumbo frame,
with some optimization for the RX path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: mana: Add support for jumbo frame
Haiyang Zhang [Wed, 12 Apr 2023 21:16:03 +0000 (14:16 -0700)]
net: mana: Add support for jumbo frame

During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: mana: Enable RX path to handle various MTU sizes
Haiyang Zhang [Wed, 12 Apr 2023 21:16:02 +0000 (14:16 -0700)]
net: mana: Enable RX path to handle various MTU sizes

Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: mana: Refactor RX buffer allocation code to prepare for various MTU
Haiyang Zhang [Wed, 12 Apr 2023 21:16:01 +0000 (14:16 -0700)]
net: mana: Refactor RX buffer allocation code to prepare for various MTU

Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agonet: mana: Use napi_build_skb in RX path
Haiyang Zhang [Wed, 12 Apr 2023 21:16:00 +0000 (14:16 -0700)]
net: mana: Use napi_build_skb in RX path

Use napi_build_skb() instead of build_skb() to take advantage of the
NAPI percpu caches to obtain skbuff_head.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 months agoMerge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Fri, 14 Apr 2023 05:28:03 +0000 (22:28 -0700)]
Merge tag 'mlx5-updates-2023-04-11' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-04-11

1) Vlad adds the support for linux bridge multicast offload support
   Patches #1 through #9
   Synopsis

Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.

Steering architecture

Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:

                                                                                                                            +--------+--+
                                                                                    +---------------------------------------> Port 1 |  |
                                                                                    |                                       +-^------+--+
                                                                                    |                                         |
                                                                                    |                                         |
                                       +-----------------------------------------+  |     +---------------------------+       |
                                       | EGRESS table                            |  |  +--> PORT 1 multicast table    |       |
+----------------------------------+   +-----------------------------------------+  |  |  +---------------------------+       |
| INGRESS table                    |   |                                         |  |  |  |                           |       |
+----------------------------------+   | dst_mac=P1,vlan=X -> pop vlan, goto P1  +--+  |  | FG0:                      |       |
|                                  |   | dst_mac=P1,vlan=Y -> pop vlan, goto P1  |     |  | src_port=dst_port -> drop |       |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2  +--+  |  | FG1:                      |       |
| ...                              |   | dst_mac=P2,vlan=Y -> goto P2            |  |  |  | VLAN X -> pop, goto port  |       |
|                                  |   | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+  | ...                       |       |
+----------------------------------+   |                                         |  |  |  | VLAN Y -> pop, goto port  +-------+
                                       +-----------------------------------------+  |  |  | FG3:                      |
                                                                                    |  |  | matchall -> goto port     |
                                                                                    |  |  |                           |
                                                                                    |  |  +---------------------------+
                                                                                    |  |
                                                                                    |  |
                                                                                    |  |                                    +--------+--+
                                                                                    +---------------------------------------> Port 2 |  |
                                                                                       |                                    +-^------+--+
                                                                                       |                                      |
                                                                                       |                                      |
                                                                                       |  +---------------------------+       |
                                                                                       +--> PORT 2 multicast table    |       |
                                                                                          +---------------------------+       |
                                                                                          |                           |       |
                                                                                          | FG0:                      |       |
                                                                                          | src_port=dst_port -> drop |       |
                                                                                          | FG1:                      |       |
                                                                                          | VLAN X -> pop, goto port  |       |
                                                                                          | ...                       |       |
                                                                                          |                           |       |
                                                                                          | FG3:                      |       |
                                                                                          | matchall -> goto port     +-------+
                                                                                          |                           |
                                                                                          +---------------------------+

Patches overview:

- Patch 1 adds hardware definition bits for capabilities required to replicate
  multicast packets to multiple per-port tables. These bits are used by
  following patches to only attempt multicast offload if firmware and hardware
  provide necessary support.

- Pathces 2-4 patches are preparations and refactoring.

- Patch 5 implements necessary infrastructure to toggle multicast offload
  via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
  This also enabled IGMP and MLD snooping.

- Patch 6 implements per-port multicast replication tables. It only supports
  filtering of loopback packets.

- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
  VLANs.

- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
  creates MDB replication rules in egress table that can replicate packets to
  multiple per-port multicast tables.

- Patch 9 adds tracepoints for MDB events.

==============

2) Parav Create a new allocation profile for SFs, to save on memory

3) Yevgeny provides some initial patches for upcoming software steering
   support new pattern/arguments type of modify_header actions.

Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.

As a preparation Yevgeny provides the following 4 patches:
 - Patch 1: Add required mlx5_ifc HW bits
 - Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
   support and adds appropriate support in dr_send.c
 - Patch 4: Add ICM pool for modify-header-pattern objects and implement
   patterns cache, allowing patterns reuse for different flows

* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Add modify-header-pattern ICM pool
  net/mlx5: DR, Prepare sending new WQE type
  net/mlx5: Add new WQE for updating flow table
  net/mlx5: Add mlx5_ifc bits for modify header argument
  net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
  net/mlx5: Create a new profile for SFs
  net/mlx5: Bridge, add tracepoints for multicast
  net/mlx5: Bridge, implement mdb offload
  net/mlx5: Bridge, support multicast VLAN pop
  net/mlx5: Bridge, add per-port multicast replication tables
  net/mlx5: Bridge, snoop igmp/mld packets
  net/mlx5: Bridge, extract code to lookup parent bridge of port
  net/mlx5: Bridge, move additional data structures to priv header
  net/mlx5: Bridge, increase bridge tables sizes
  net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================

Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'add-kernel-tc-mqprio-and-tc-taprio-support-for-preemptible-traffic...
Jakub Kicinski [Fri, 14 Apr 2023 05:22:12 +0000 (22:22 -0700)]
Merge branch 'add-kernel-tc-mqprio-and-tc-taprio-support-for-preemptible-traffic-classes'

Vladimir Oltean says:

====================
Add kernel tc-mqprio and tc-taprio support for preemptible traffic classes

The last RFC in August 2022 contained a proposal for the UAPI of both
TSN standards which together form Frame Preemption (802.1Q and 802.3):
https://lore.kernel.org/netdev/20220816222920.1952936-1-vladimir.oltean@nxp.com/

It wasn't clear at the time whether the 802.1Q portion of Frame Preemption
should be exposed via the tc qdisc (mqprio, taprio) or via some other
layer (perhaps also ethtool like the 802.3 portion, or dcbnl), even
though the options were discussed extensively, with pros and cons:
https://lore.kernel.org/netdev/20220816222920.1952936-3-vladimir.oltean@nxp.com/

So the 802.3 portion got submitted separately and finally was accepted:
https://lore.kernel.org/netdev/20230119122705.73054-1-vladimir.oltean@nxp.com/

leaving the only remaining question: how do we expose the 802.1Q bits?

This series proposes that we use the Qdisc layer, through separate
(albeit very similar) UAPI in mqprio and taprio, and that both these
Qdiscs pass the information down to the offloading device driver through
the common mqprio offload structure (which taprio also passes).

An implementation is provided for the NXP LS1028A on-board Ethernet
endpoint (enetc). Previous versions also contained support for its
embedded switch (felix), but this needs more work and will be submitted
separately.

v4: https://lore.kernel.org/netdev/20230403103440.2895683-1-vladimir.oltean@nxp.com/
v2: https://lore.kernel.org/netdev/20230219135309.594188-1-vladimir.oltean@nxp.com/
v1: https://lore.kernel.org/netdev/20230216232126.3402975-1-vladimir.oltean@nxp.com/
====================

Link: https://lore.kernel.org/r/20230411180157.1850527-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: enetc: add support for preemptible traffic classes
Vladimir Oltean [Tue, 11 Apr 2023 18:01:57 +0000 (21:01 +0300)]
net: enetc: add support for preemptible traffic classes

PFs which support the MAC Merge layer also have a set of 8 registers
called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)".
Through these, a traffic class (group of TX rings of same dequeue
priority) can be mapped to the eMAC or to the pMAC.

There's nothing particularly spectacular here. We should probably only
commit the preemptible TCs to hardware once the MAC Merge layer became
active, but unlike Felix, we don't have an IRQ that notifies us of that.
We'd have to sleep for up to verifyTime (127 ms) to wait for a
resolution coming from the verification state machine; not only from the
ndo_setup_tc() code path, but also from enetc_mm_link_state_update().
Since it's relatively complicated and has a relatively small benefit,
I'm not doing it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: enetc: rename "mqprio" to "qopt"
Vladimir Oltean [Tue, 11 Apr 2023 18:01:56 +0000 (21:01 +0300)]
net: enetc: rename "mqprio" to "qopt"

To gain access to the larger encapsulating structure which has the type
tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet/sched: taprio: allow per-TC user input of FP adminStatus
Vladimir Oltean [Tue, 11 Apr 2023 18:01:55 +0000 (21:01 +0300)]
net/sched: taprio: allow per-TC user input of FP adminStatus

This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.

I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.

Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.

Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.

A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet/sched: mqprio: allow per-TC user input of FP adminStatus
Vladimir Oltean [Tue, 11 Apr 2023 18:01:54 +0000 (21:01 +0300)]
net/sched: mqprio: allow per-TC user input of FP adminStatus

IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).

The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.

I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).

The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.

Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet/sched: pass netlink extack to mqprio and taprio offload
Vladimir Oltean [Tue, 11 Apr 2023 18:01:53 +0000 (21:01 +0300)]
net/sched: pass netlink extack to mqprio and taprio offload

With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.

Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.

Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.

In fact, the taprio handling is a bit more tricky, for 2 reasons.

First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().

The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet/sched: mqprio: add an extack message to mqprio_parse_opt()
Vladimir Oltean [Tue, 11 Apr 2023 18:01:52 +0000 (21:01 +0300)]
net/sched: mqprio: add an extack message to mqprio_parse_opt()

Ferenc reports that a combination of poor iproute2 defaults and obscure
cases where the kernel returns -EINVAL make it difficult to understand
what is wrong with this command:

$ ip link add veth0 numtxqueues 8 numrxqueues 8 type veth peer name veth1
$ tc qdisc add dev veth0 root mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
        queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7
RTNETLINK answers: Invalid argument

Hopefully with this patch, the cause is clearer:

Error: Device does not support hardware offload.

The kernel was (and still is) rejecting this because iproute2 defaults
to "hw 1" if this command line option is not specified.

Link: https://lore.kernel.org/netdev/ede5e9a2f27bf83bfb86d3e8c4ca7b34093b99e2.camel@inf.elte.hu/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet/sched: mqprio: add extack to mqprio_parse_nlattr()
Vladimir Oltean [Tue, 11 Apr 2023 18:01:51 +0000 (21:01 +0300)]
net/sched: mqprio: add extack to mqprio_parse_nlattr()

Netlink attribute parsing in mqprio is a minesweeper game, with many
options having the possibility of being passed incorrectly and the user
being none the wiser.

Try to make errors less sour by giving user space some information
regarding what went wrong.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS
Vladimir Oltean [Tue, 11 Apr 2023 18:01:50 +0000 (21:01 +0300)]
net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS

In commit 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and
shaper in mqprio"), the TCA_OPTIONS format of mqprio was extended to
contain a fixed portion (of size NLA_ALIGN(sizeof struct tc_mqprio_qopt))
and a variable portion of other nlattrs (in the TCA_MQPRIO_* type space)
following immediately afterwards.

In commit feb2cf3dcfb9 ("net/sched: mqprio: refactor nlattr parsing to a
separate function"), we've moved the nlattr handling to a smaller
function, but yet, a small parse_attr() still remains, and the larger
mqprio_parse_nlattr() still does not have access to the beginning, and
the length, of the TCA_OPTIONS region containing these other nlattrs.

In a future change, the mqprio qdisc will need to iterate through this
nlattr region to discover other attributes, so eliminate parse_attr()
and add 2 variables in mqprio_parse_nlattr() which hold the beginning
and the length of the nlattr range.

We avoid the need to memset when nlattr_opt_len has insufficient length
by pre-initializing the table "tb".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: ethtool: create and export ethtool_dev_mm_supported()
Vladimir Oltean [Tue, 11 Apr 2023 18:01:49 +0000 (21:01 +0300)]
net: ethtool: create and export ethtool_dev_mm_supported()

Create a wrapper over __ethtool_dev_mm_supported() which also calls
ethnl_ops_begin() and ethnl_ops_complete(). It can be used by other code
layers, such as tc, to make sure that preemptible TCs are supported
(this is true if an underlying MAC Merge layer exists).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agotools: ynl: Rename ethtool to ethtool.py
Rahul Rameshbabu [Thu, 13 Apr 2023 01:22:52 +0000 (18:22 -0700)]
tools: ynl: Rename ethtool to ethtool.py

Make it explicit that this tool is not a drop-in replacement for ethtool.
This tool is intended for testing ethtool functionality implemented in the
kernel and should use a name that differentiates it from the ethtool
utility.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20230413012252.184434-2-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agotools: ynl: Remove absolute paths to yaml files from ethtool testing tool
Rahul Rameshbabu [Thu, 13 Apr 2023 01:22:51 +0000 (18:22 -0700)]
tools: ynl: Remove absolute paths to yaml files from ethtool testing tool

Absolute paths for the spec and schema files make the ethtool testing tool
unusable with freshly checked-out source trees. Replace absolute paths with
relative paths for both files in the Documentation/ directory.

Issue seen before the change

  Traceback (most recent call last):
    File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/./ethtool", line 424, in <module>
      main()
    File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/./ethtool", line 158, in main
      ynl = YnlFamily(spec, schema)
    File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/lib/ynl.py", line 342, in __init__
      super().__init__(def_path, schema)
    File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/lib/nlspec.py", line 333, in __init__
      with open(spec_path, "r") as stream:
  FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/google/home/sdf/src/linux/Documentation/netlink/specs/ethtool.yaml'

Fixes: f3d07b02b2b8 ("tools: ynl: ethtool testing tool")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20230413012252.184434-1-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'macb-ptp-minor-updates'
Jakub Kicinski [Fri, 14 Apr 2023 05:16:11 +0000 (22:16 -0700)]
Merge branch 'macb-ptp-minor-updates'

Harini Katakam says:

====================
Macb PTP minor updates

- Enable PTP unicast
- Optimize HW timestamp reading
====================

Link: https://lore.kernel.org/r/20230411123712.11459-1-harini.katakam@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: macb: Optimize reading HW timestamp
Harini Katakam [Tue, 11 Apr 2023 12:37:12 +0000 (18:07 +0530)]
net: macb: Optimize reading HW timestamp

The seconds input from BD (6 bits) just needs to be ORed with the
upper bits from timer in this function. Avoid addition operation
every single time. Seconds rollover handling is left untouched.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: macb: Enable PTP unicast
Harini Katakam [Tue, 11 Apr 2023 12:37:11 +0000 (18:07 +0530)]
net: macb: Enable PTP unicast

Enable transmission and reception of PTP unicast packets by
updating PTP unicast config bit and setting current HW mac
address as allowed address in PTP unicast filter registers.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: macb: Update gem PTP support check
Harini Katakam [Tue, 11 Apr 2023 12:37:10 +0000 (18:07 +0530)]
net: macb: Update gem PTP support check

There are currently two checks for PTP functionality - one on GEM
capability and another on the kernel config option. Combine them
into a single function as there's no use case where gem_has_ptp is
TRUE and MACB_USE_HWSTAMP is false.

Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'ocelot-felix-driver-cleanup'
Jakub Kicinski [Fri, 14 Apr 2023 04:56:08 +0000 (21:56 -0700)]
Merge branch 'ocelot-felix-driver-cleanup'

Vladimir Oltean says:

====================
Ocelot/Felix driver cleanup

The cleanup mostly handles the statistics code path - some issues
regarding understandability became apparent after the series
"Fix trainwreck with Ocelot switch statistics counters":
https://lore.kernel.org/netdev/20230321010325.897817-1-vladimir.oltean@nxp.com/

There is also one patch which cleans up a misleading comment
in the DSA felix_setup().
====================

Link: https://lore.kernel.org/r/20230412124737.2243527-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: fix ineffective WARN_ON() in ocelot_stats.c
Vladimir Oltean [Wed, 12 Apr 2023 12:47:37 +0000 (15:47 +0300)]
net: mscc: ocelot: fix ineffective WARN_ON() in ocelot_stats.c

Since it is hopefully now clear that, since "last" and "layout[i].reg"
are enum types and not addresses, the existing WARN_ON() is ineffective
in checking that the _addresses_ are sorted in the proper order.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: strengthen type of "int i" in ocelot_stats.c
Vladimir Oltean [Wed, 12 Apr 2023 12:47:36 +0000 (15:47 +0300)]
net: mscc: ocelot: strengthen type of "int i" in ocelot_stats.c

The "int i" used to index the struct ocelot_stat_layout array actually
has a specific type: enum ocelot_stat. Use it, so that the WARN()
comment from ocelot_prepare_stats_regions() makes more sense.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: strengthen type of "u32 reg" and "u32 base" in ocelot_stats.c
Vladimir Oltean [Wed, 12 Apr 2023 12:47:35 +0000 (15:47 +0300)]
net: mscc: ocelot: strengthen type of "u32 reg" and "u32 base" in ocelot_stats.c

Use the specific enum ocelot_reg to make it clear that the region
registers are encoded and not plain addresses.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: dsa: felix: remove confusing/incorrect comment from felix_setup()
Vladimir Oltean [Wed, 12 Apr 2023 12:47:34 +0000 (15:47 +0300)]
net: dsa: felix: remove confusing/incorrect comment from felix_setup()

That comment was written prior to knowing that what I was actually
seeing was a manifestation of the bug fixed in commit b4024c9e5c57
("felix: Fix initialization of ioremap resources").

There isn't any particular reason now why the hardware initialization is
done in felix_setup(), so just delete that comment to avoid spreading
misinformation.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: remove blank line at the end of ocelot_stats.c
Vladimir Oltean [Wed, 12 Apr 2023 12:47:33 +0000 (15:47 +0300)]
net: mscc: ocelot: remove blank line at the end of ocelot_stats.c

Commit a3bb8f521fd8 ("net: mscc: ocelot: remove unnecessary exposure of
stats structures") made an unnecessary change which was to add a new
line at the end of ocelot_stats.c. Remove it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: debugging print for statistics regions
Vladimir Oltean [Wed, 12 Apr 2023 12:47:32 +0000 (15:47 +0300)]
net: mscc: ocelot: debugging print for statistics regions

To make it easier to debug future issues with statistics counters not
getting aggregated properly into regions, like what happened in commit
6acc72a43eac ("net: mscc: ocelot: fix stats region batching"), add some
dev_dbg() prints which show the regions that were dynamically
determined.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: refactor enum ocelot_reg decoding to helper
Vladimir Oltean [Wed, 12 Apr 2023 12:47:31 +0000 (15:47 +0300)]
net: mscc: ocelot: refactor enum ocelot_reg decoding to helper

ocelot_io.c duplicates the decoding of an enum ocelot_reg (which holds
an enum ocelot_target in the upper bits and an index into a regmap array
in the lower bits) 4 times.

We'd like to reuse that logic once more, from ocelot.c. In order to do
that, let's consolidate the existing 4 instances into a header
accessible both by ocelot.c as well as by ocelot_io.c.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: mscc: ocelot: strengthen type of "u32 reg" in I/O accessors
Vladimir Oltean [Wed, 12 Apr 2023 12:47:30 +0000 (15:47 +0300)]
net: mscc: ocelot: strengthen type of "u32 reg" in I/O accessors

The "u32 reg" argument that is passed to these functions is not a plain
address, but rather a driver-specific encoding of another enum
ocelot_target target in the upper bits, and an index into the
u32 ocelot->map[target][] array in the lower bits. That encoded value
takes the type "enum ocelot_reg" and is what is passed to these I/O
functions, so let's actually use that to prevent type confusion.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoDaniel Borkmann says:
Jakub Kicinski [Thu, 13 Apr 2023 23:11:22 +0000 (16:11 -0700)]
Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 13 Apr 2023 23:04:28 +0000 (16:04 -0700)]
Merge git://git./linux/kernel/git/netdev/net

Conflicts:

tools/testing/selftests/net/config
  62199e3f1658 ("selftests: net: Add VXLAN MDB test")
  3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 13 Apr 2023 22:33:04 +0000 (15:33 -0700)]
Merge tag 'net-6.3-rc7' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, and bluetooth.

  Not all that quiet given spring celebrations, but "current" fixes are
  thinning out, which is encouraging. One outstanding regression in the
  mlx5 driver when using old FW, not blocking but we're pushing for a
  fix.

  Current release - new code bugs:

   - eth: enetc: workaround for unresponsive pMAC after receiving
     express traffic

  Previous releases - regressions:

   - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the
     pid/seq fields 0 for backward compatibility

  Previous releases - always broken:

   - sctp: fix a potential overflow in sctp_ifwdtsn_skip

   - mptcp:
      - use mptcp_schedule_work instead of open-coding it and make the
        worker check stricter, to avoid scheduling work on closed
        sockets
      - fix NULL pointer dereference on fastopen early fallback

   - skbuff: fix memory corruption due to a race between skb coalescing
     and releasing clones confusing page_pool reference counting

   - bonding: fix neighbor solicitation validation on backup slaves

   - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp

   - bpf: arm64: fixed a BTI error on returning to patched function

   - openvswitch: fix race on port output leading to inf loop

   - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid
     returning a different errno than expected

   - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove

   - Bluetooth: fix printing errors if LE Connection times out

   - Bluetooth: assorted UaF, deadlock and data race fixes

   - eth: macb: fix memory corruption in extended buffer descriptor mode

  Misc:

   - adjust the XDP Rx flow hash API to also include the protocol layers
     over which the hash was computed"

* tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
  selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
  mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
  veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
  mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
  xdp: rss hash types representation
  selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
  skbuff: Fix a race between coalescing and releasing SKBs
  net: macb: fix a memory corruption in extended buffer descriptor mode
  selftests: add the missing CONFIG_IP_SCTP in net config
  udp6: fix potential access to stale information
  selftests: openvswitch: adjust datapath NL message declaration
  selftests: mptcp: userspace pm: uniform verify events
  mptcp: fix NULL pointer dereference on fastopen early fallback
  mptcp: stricter state check in mptcp_worker
  mptcp: use mptcp_schedule_work instead of open-coding it
  net: enetc: workaround for unresponsive pMAC after receiving express traffic
  sctp: fix a potential overflow in sctp_ifwdtsn_skip
  net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume()
  rtnetlink: Restore RTM_NEW/DELLINK notification behavior
  net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes
  ...

19 months agoMerge tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 13 Apr 2023 22:21:56 +0000 (15:21 -0700)]
Merge tag 'devicetree-fixes-for-6.2-3' of git://git./linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

 - Fix interaction between fw_devlink and DT overlays causing devices to
   not be probed

 - Fix the compatible string for loongson,cpu-interrupt-controller

* tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  treewide: Fix probing of devices in DT overlays
  dt-bindings: interrupt-controller: loongarch: Fix mismatched compatible

19 months agoMerge tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Thu, 13 Apr 2023 22:17:59 +0000 (15:17 -0700)]
Merge tag 'pinctrl-v6.3-3' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fix from Linus Walleij:
 "This is just a revert of the AMD fix, because the fix broke some
  laptops. We are working on a proper solution"

* tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  Revert "pinctrl: amd: Disable and mask interrupts on resume"

19 months agoMerge tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Thu, 13 Apr 2023 21:58:55 +0000 (14:58 -0700)]
Merge tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:

 - two fbcon regressions

 - amdgpu: dp mst, smu13

 - i915: dual link dsi for tgl+

 - armada, nouveau, drm/sched, fbmem

* tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm:
  fbcon: set_con2fb_map needs to set con2fb_map!
  fbcon: Fix error paths in set_con2fb_map
  drm/amd/pm: correct the pcie link state check for SMU13
  drm/amd/pm: correct SMU13.0.7 max shader clock reporting
  drm/amd/pm: correct SMU13.0.7 pstate profiling clock settings
  drm/amd/display: Pass the right info to drm_dp_remove_payload
  drm/armada: Fix a potential double free in an error handling path
  fbmem: Reject FB_ACTIVATE_KD_TEXT from userspace
  drm/nouveau/fb: add missing sysmen flush callbacks
  drm/i915/dsi: fix DSS CTL register offsets for TGL+
  drm/scheduler: Fix UAF race in drm_sched_entity_push_job()

19 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Jakub Kicinski [Thu, 13 Apr 2023 20:04:44 +0000 (13:04 -0700)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-04-13

We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 14 files changed, 205 insertions(+), 38 deletions(-).

The main changes are:

1) One late straggler fix on the XDP hints side which fixes
   bpf_xdp_metadata_rx_hash kfunc API before the release goes out
   in order to provide information on the RSS hash type,
   from Jesper Dangaard Brouer.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
  mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
  veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
  mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
  xdp: rss hash types representation
  selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
====================

Link: https://lore.kernel.org/r/20230413192939.10202-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge tag 'drm-misc-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm-misc...
Daniel Vetter [Thu, 13 Apr 2023 18:47:58 +0000 (20:47 +0200)]
Merge tag 'drm-misc-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Short summary of fixes pull:

 * armada: Fix double free
 * fb: Clear FB_ACTIVATE_KD_TEXT in ioctl
 * nouveau: Add missing callbacks
 * scheduler: Fix use-after-free error

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20230413184233.GA8148@linux-uq9g
19 months agobpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap
Daniel Borkmann [Thu, 13 Apr 2023 18:28:42 +0000 (20:28 +0200)]
bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap

syzbot reported a splat and bisected it to recent commit ed17aa92dc56 ("bpf,
sockmap: fix deadlocks in the sockhash and sockmap"):

  [...]
  WARNING: CPU: 1 PID: 9280 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
  Modules linked in:
  CPU: 1 PID: 9280 Comm: syz-executor.1 Not tainted 6.2.0-syzkaller-13249-gd319f344561d #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023
  RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
  [...]
  Call Trace:
  <TASK>
  spin_unlock_bh include/linux/spinlock.h:395 [inline]
  sock_map_del_link+0x2ea/0x510 net/core/sock_map.c:165
  sock_map_unref+0xb0/0x1d0 net/core/sock_map.c:184
  sock_hash_delete_elem+0x1ec/0x2a0 net/core/sock_map.c:945
  map_delete_elem kernel/bpf/syscall.c:1536 [inline]
  __sys_bpf+0x2edc/0x53e0 kernel/bpf/syscall.c:5053
  __do_sys_bpf kernel/bpf/syscall.c:5166 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:5164 [inline]
  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5164
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  RIP: 0033:0x7fe8f7c8c169
  </TASK>
  [...]

Revert for now until we have a proper solution.

Fixes: ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap")
Reported-by: syzbot+49f6cef45247ff249498@syzkaller.appspotmail.com
Cc: Hsin-Wei Hung <hsinweih@uci.edu>
Cc: Xin Liu <liuxin350@huawei.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/000000000000f1db9605f939720e@google.com/
19 months agoMerge branch 'XDP-hints: change RX-hash kfunc bpf_xdp_metadata_rx_hash'
Alexei Starovoitov [Thu, 13 Apr 2023 18:05:49 +0000 (11:05 -0700)]
Merge branch 'XDP-hints: change RX-hash kfunc bpf_xdp_metadata_rx_hash'

Jesper Dangaard Brouer says:

====================

Current API for bpf_xdp_metadata_rx_hash() returns the raw RSS hash value,
but doesn't provide information on the RSS hash type (part of 6.3-rc).

This patchset proposal is to change the function call signature via adding
a pointer value argument for providing the RSS hash type.

Patchset also removes all bpf_printk's from xdp_hw_metadata program
that we expect driver developers to use. Instead counters are introduced
for relaying e.g. skip and fail info.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
Jesper Dangaard Brouer [Wed, 12 Apr 2023 19:49:00 +0000 (21:49 +0200)]
selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg

Update BPF selftests to use the new RSS type argument for kfunc
bpf_xdp_metadata_rx_hash.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132894068.340624.8914711185697163690.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agomlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
Jesper Dangaard Brouer [Wed, 12 Apr 2023 19:48:55 +0000 (21:48 +0200)]
mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type

Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via matching individual Completion Queue Entry (CQE) status bits.

Fixes: ab46182d0dcb ("net/mlx4_en: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132893562.340624.12779118462402031248.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoveth: bpf_xdp_metadata_rx_hash add xdp rss hash type
Jesper Dangaard Brouer [Wed, 12 Apr 2023 19:48:50 +0000 (21:48 +0200)]
veth: bpf_xdp_metadata_rx_hash add xdp rss hash type

Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type.

The veth driver currently only support XDP-hints based on SKB code path.
The SKB have lost information about the RSS hash type, by compressing
the information down to a single bitfield skb->l4_hash, that only knows
if this was a L4 hash value.

In preparation for veth, the xdp_rss_hash_type have an L4 indication
bit that allow us to return a meaningful L4 indication when working
with SKB based packets.

Fixes: 306531f0249f ("veth: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132893055.340624.16209448340644513469.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agomlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
Jesper Dangaard Brouer [Wed, 12 Apr 2023 19:48:45 +0000 (21:48 +0200)]
mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type

Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via mapping table.

The mlx5 hardware can also identify and RSS hash IPSEC.  This indicate
hash includes SPI (Security Parameters Index) as part of IPSEC hash.

Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.

Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoxdp: rss hash types representation
Jesper Dangaard Brouer [Wed, 12 Apr 2023 19:48:40 +0000 (21:48 +0200)]
xdp: rss hash types representation

The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.

Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.

The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.

This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.

The RSS type implementations for each driver comes as separate patches.

Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
Jesper Dangaard Brouer [Wed, 12 Apr 2023 19:48:35 +0000 (21:48 +0200)]
selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters

The tool xdp_hw_metadata can be used by driver developers
implementing XDP-hints metadata kfuncs.

Remove all bpf_printk calls, as the tool already transfers all the
XDP-hints related information via metadata area to AF_XDP
userspace process.

Add counters for providing remaining information about failure and
skipped packet events.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132891533.340624.7313781245316405141.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agofbcon: set_con2fb_map needs to set con2fb_map!
Daniel Vetter [Wed, 12 Apr 2023 15:31:46 +0000 (17:31 +0200)]
fbcon: set_con2fb_map needs to set con2fb_map!

I got really badly confused in d443d9386472 ("fbcon: move more common
code into fb_open()") because we set the con2fb_map before the failure
points, which didn't look good.

But in trying to fix that I moved the assignment into the wrong path -
we need to do it for _all_ vc we take over, not just the first one
(which additionally requires the call to con2fb_acquire_newinfo).

I've figured this out because of a KASAN bug report, where the
fbcon_registered_fb and fbcon_display arrays went out of sync in
fbcon_mode_deleted() because the con2fb_map pointed at the old
fb_info, but the modes and everything was updated for the new one.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Xingyuan Mo <hdthky0@gmail.com>
Fixes: d443d9386472 ("fbcon: move more common code into fb_open()")
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Xingyuan Mo <hdthky0@gmail.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v5.19+
19 months agofbcon: Fix error paths in set_con2fb_map
Daniel Vetter [Wed, 12 Apr 2023 15:23:49 +0000 (17:23 +0200)]
fbcon: Fix error paths in set_con2fb_map

This is a regressoin introduced in b07db3958485 ("fbcon: Ditch error
handling for con2fb_release_oldinfo"). I failed to realize what the if
(!err) checks. The mentioned commit was dropping the
con2fb_release_oldinfo() return value but the if (!err) was also
checking whether the con2fb_acquire_newinfo() function call above
failed or not.

Fix this with an early return statement.

Note that there's still a difference compared to the orginal state of
the code, the below lines are now also skipped on error:

if (!search_fb_in_map(info_idx))
info_idx = newidx;

These are only needed when we've actually thrown out an old fb_info
from the console mappings, which only happens later on.

Also move the fbcon_add_cursor_work() call into the same if block,
it's all protected by console_lock so doesn't matter when we set up
the blinking cursor delayed work anyway. This further simplifies the
control flow and allows us to ditch the found local variable.

v2: Clarify commit message (Javier)

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Xingyuan Mo <hdthky0@gmail.com>
Fixes: b07db3958485 ("fbcon: Ditch error handling for con2fb_release_oldinfo")
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Xingyuan Mo <hdthky0@gmail.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v5.19+
19 months agoskbuff: Fix a race between coalescing and releasing SKBs
Liang Chen [Thu, 13 Apr 2023 09:03:53 +0000 (17:03 +0800)]
skbuff: Fix a race between coalescing and releasing SKBs

Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
recycling") allowed coalescing to proceed with non page pool page and page
pool page when @from is cloned, i.e.

to->pp_recycle    --> false
from->pp_recycle  --> true
skb_cloned(from)  --> true

However, it actually requires skb_cloned(@from) to hold true until
coalescing finishes in this situation. If the other cloned SKB is
released while the merging is in process, from_shinfo->nr_frags will be
set to 0 toward the end of the function, causing the increment of frag
page _refcount to be unexpectedly skipped resulting in inconsistent
reference counts. Later when SKB(@to) is released, it frees the page
directly even though the page pool page is still in use, leading to
use-after-free or double-free errors. So it should be prohibited.

The double-free error message below prompted us to investigate:
BUG: Bad page state in process swapper/1  pfn:0e0d1
page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
index:0x2 pfn:0xe0d1
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount

CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
Call Trace:
 <IRQ>
dump_stack_lvl+0x32/0x50
bad_page+0x69/0xf0
free_pcp_prepare+0x260/0x2f0
free_unref_page+0x20/0x1c0
skb_release_data+0x10b/0x1a0
napi_consume_skb+0x56/0x150
net_rx_action+0xf0/0x350
? __napi_schedule+0x79/0x90
__do_softirq+0xc8/0x2b1
__irq_exit_rcu+0xb9/0xf0
common_interrupt+0x82/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
RIP: 0010:default_idle+0xb/0x20

Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230413090353.14448-1-liangchen.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agonet: macb: fix a memory corruption in extended buffer descriptor mode
Roman Gushchin [Wed, 12 Apr 2023 23:21:44 +0000 (16:21 -0700)]
net: macb: fix a memory corruption in extended buffer descriptor mode

For quite some time we were chasing a bug which looked like a sudden
permanent failure of networking and mmc on some of our devices.
The bug was very sensitive to any software changes and even more to
any kernel debug options.

Finally we got a setup where the problem was reproducible with
CONFIG_DMA_API_DEBUG=y and it revealed the issue with the rx dma:

[   16.992082] ------------[ cut here ]------------
[   16.996779] DMA-API: macb ff0b0000.ethernet: device driver tries to free DMA memory it has not allocated [device address=0x0000000875e3e244] [size=1536 bytes]
[   17.011049] WARNING: CPU: 0 PID: 85 at kernel/dma/debug.c:1011 check_unmap+0x6a0/0x900
[   17.018977] Modules linked in: xxxxx
[   17.038823] CPU: 0 PID: 85 Comm: irq/55-8000f000 Not tainted 5.4.0 #28
[   17.045345] Hardware name: xxxxx
[   17.049528] pstate: 60000005 (nZCv daif -PAN -UAO)
[   17.054322] pc : check_unmap+0x6a0/0x900
[   17.058243] lr : check_unmap+0x6a0/0x900
[   17.062163] sp : ffffffc010003c40
[   17.065470] x29: ffffffc010003c40 x28: 000000004000c03c
[   17.070783] x27: ffffffc010da7048 x26: ffffff8878e38800
[   17.076095] x25: ffffff8879d22810 x24: ffffffc010003cc8
[   17.081407] x23: 0000000000000000 x22: ffffffc010a08750
[   17.086719] x21: ffffff8878e3c7c0 x20: ffffffc010acb000
[   17.092032] x19: 0000000875e3e244 x18: 0000000000000010
[   17.097343] x17: 0000000000000000 x16: 0000000000000000
[   17.102647] x15: ffffff8879e4a988 x14: 0720072007200720
[   17.107959] x13: 0720072007200720 x12: 0720072007200720
[   17.113261] x11: 0720072007200720 x10: 0720072007200720
[   17.118565] x9 : 0720072007200720 x8 : 000000000000022d
[   17.123869] x7 : 0000000000000015 x6 : 0000000000000098
[   17.129173] x5 : 0000000000000000 x4 : 0000000000000000
[   17.134475] x3 : 00000000ffffffff x2 : ffffffc010a1d370
[   17.139778] x1 : b420c9d75d27bb00 x0 : 0000000000000000
[   17.145082] Call trace:
[   17.147524]  check_unmap+0x6a0/0x900
[   17.151091]  debug_dma_unmap_page+0x88/0x90
[   17.155266]  gem_rx+0x114/0x2f0
[   17.158396]  macb_poll+0x58/0x100
[   17.161705]  net_rx_action+0x118/0x400
[   17.165445]  __do_softirq+0x138/0x36c
[   17.169100]  irq_exit+0x98/0xc0
[   17.172234]  __handle_domain_irq+0x64/0xc0
[   17.176320]  gic_handle_irq+0x5c/0xc0
[   17.179974]  el1_irq+0xb8/0x140
[   17.183109]  xiic_process+0x5c/0xe30
[   17.186677]  irq_thread_fn+0x28/0x90
[   17.190244]  irq_thread+0x208/0x2a0
[   17.193724]  kthread+0x130/0x140
[   17.196945]  ret_from_fork+0x10/0x20
[   17.200510] ---[ end trace 7240980785f81d6f ]---

[  237.021490] ------------[ cut here ]------------
[  237.026129] DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000021d79e7b
[  237.033886] WARNING: CPU: 0 PID: 0 at kernel/dma/debug.c:499 add_dma_entry+0x214/0x240
[  237.041802] Modules linked in: xxxxx
[  237.061637] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W         5.4.0 #28
[  237.068941] Hardware name: xxxxx
[  237.073116] pstate: 80000085 (Nzcv daIf -PAN -UAO)
[  237.077900] pc : add_dma_entry+0x214/0x240
[  237.081986] lr : add_dma_entry+0x214/0x240
[  237.086072] sp : ffffffc010003c30
[  237.089379] x29: ffffffc010003c30 x28: ffffff8878a0be00
[  237.094683] x27: 0000000000000180 x26: ffffff8878e387c0
[  237.099987] x25: 0000000000000002 x24: 0000000000000000
[  237.105290] x23: 000000000000003b x22: ffffffc010a0fa00
[  237.110594] x21: 0000000021d79e7b x20: ffffffc010abe600
[  237.115897] x19: 00000000ffffffef x18: 0000000000000010
[  237.121201] x17: 0000000000000000 x16: 0000000000000000
[  237.126504] x15: ffffffc010a0fdc8 x14: 0720072007200720
[  237.131807] x13: 0720072007200720 x12: 0720072007200720
[  237.137111] x11: 0720072007200720 x10: 0720072007200720
[  237.142415] x9 : 0720072007200720 x8 : 0000000000000259
[  237.147718] x7 : 0000000000000001 x6 : 0000000000000000
[  237.153022] x5 : ffffffc010003a20 x4 : 0000000000000001
[  237.158325] x3 : 0000000000000006 x2 : 0000000000000007
[  237.163628] x1 : 8ac721b3a7dc1c00 x0 : 0000000000000000
[  237.168932] Call trace:
[  237.171373]  add_dma_entry+0x214/0x240
[  237.175115]  debug_dma_map_page+0xf8/0x120
[  237.179203]  gem_rx_refill+0x190/0x280
[  237.182942]  gem_rx+0x224/0x2f0
[  237.186075]  macb_poll+0x58/0x100
[  237.189384]  net_rx_action+0x118/0x400
[  237.193125]  __do_softirq+0x138/0x36c
[  237.196780]  irq_exit+0x98/0xc0
[  237.199914]  __handle_domain_irq+0x64/0xc0
[  237.204000]  gic_handle_irq+0x5c/0xc0
[  237.207654]  el1_irq+0xb8/0x140
[  237.210789]  arch_cpu_idle+0x40/0x200
[  237.214444]  default_idle_call+0x18/0x30
[  237.218359]  do_idle+0x200/0x280
[  237.221578]  cpu_startup_entry+0x20/0x30
[  237.225493]  rest_init+0xe4/0xf0
[  237.228713]  arch_call_rest_init+0xc/0x14
[  237.232714]  start_kernel+0x47c/0x4a8
[  237.236367] ---[ end trace 7240980785f81d70 ]---

Lars was fast to find an explanation: according to the datasheet
bit 2 of the rx buffer descriptor entry has a different meaning in the
extended mode:
  Address [2] of beginning of buffer, or
  in extended buffer descriptor mode (DMA configuration register [28] = 1),
  indicates a valid timestamp in the buffer descriptor entry.

The macb driver didn't mask this bit while getting an address and it
eventually caused a memory corruption and a dma failure.

The problem is resolved by explicitly clearing the problematic bit
if hw timestamping is used.

Fixes: 7b4296148066 ("net: macb: Add support for PTP timestamps in DMA descriptors")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230412232144.770336-1-roman.gushchin@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoselftests: add the missing CONFIG_IP_SCTP in net config
Xin Long [Wed, 12 Apr 2023 15:13:06 +0000 (11:13 -0400)]
selftests: add the missing CONFIG_IP_SCTP in net config

The selftest sctp_vrf needs CONFIG_IP_SCTP set in config
when building the kernel, so add it.

Fixes: a61bd7b9fef3 ("selftests: add a selftest for sctp vrf")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/61dddebc4d2dd98fe7fb145e24d4b2430e42b572.1681312386.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoudp6: fix potential access to stale information
Eric Dumazet [Wed, 12 Apr 2023 13:03:08 +0000 (13:03 +0000)]
udp6: fix potential access to stale information

lena wang reported an issue caused by udpv6_sendmsg()
mangling msg->msg_name and msg->msg_namelen, which
are later read from ____sys_sendmsg() :

/*
 * If this is sendmmsg() and sending to current destination address was
 * successful, remember it.
 */
if (used_address && err >= 0) {
used_address->name_len = msg_sys->msg_namelen;
if (msg_sys->msg_name)
memcpy(&used_address->name, msg_sys->msg_name,
       used_address->name_len);
}

udpv6_sendmsg() wants to pretend the remote address family
is AF_INET in order to call udp_sendmsg().

A fix would be to modify the address in-place, instead
of using a local variable, but this could have other side effects.

Instead, restore initial values before we return from udpv6_sendmsg().

Fixes: c71d8ebe7a44 ("net: Fix security_socket_sendmsg() bypass problem.")
Reported-by: lena wang <lena.wang@mediatek.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20230412130308.1202254-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoselftests: openvswitch: adjust datapath NL message declaration
Aaron Conole [Wed, 12 Apr 2023 11:58:28 +0000 (07:58 -0400)]
selftests: openvswitch: adjust datapath NL message declaration

The netlink message for creating a new datapath takes an array
of ports for the PID creation.  This shouldn't cause much issue
but correct it for future cases where we need to do decode of
datapath information that could include the per-cpu PID map.

Fixes: 25f16c873fb1 ("selftests: add openvswitch selftest suite")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230412115828.3991806-1-aconole@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoMerge branch 'mptcp-more-fixes-for-6-3'
Jakub Kicinski [Thu, 13 Apr 2023 16:59:00 +0000 (09:59 -0700)]
Merge branch 'mptcp-more-fixes-for-6-3'

Matthieu Baerts says:

====================
mptcp: more fixes for 6.3

Patch 1 avoids scheduling the MPTCP worker on a closed socket on some
edge cases. It fixes issues that can be visible from v5.11.

Patch 2 makes sure the MPTCP worker doesn't try to manipulate
disconnected sockets. This is also a fix for an issue that can be
visible from v5.11.

Patch 3 fixes a NULL pointer dereference when MPTCP FastOpen is used
and an early fallback is done. A fix for v6.2.

Patch 4 improves the stability of the userspace PM selftest for a
subtest added in v6.2.
====================

Link: https://lore.kernel.org/r/20230411-upstream-net-20230411-mptcp-fixes-v1-0-ca540f3ef986@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agoselftests: mptcp: userspace pm: uniform verify events
Matthieu Baerts [Tue, 11 Apr 2023 20:42:12 +0000 (22:42 +0200)]
selftests: mptcp: userspace pm: uniform verify events

Simply adding a "sleep" before checking something is usually not a good
idea because the time that has been picked can not be enough or too
much. The best is to wait for events with a timeout.

In this selftest, 'sleep 0.5' is used more than 40 times. It is always
used before calling a 'verify_*' function except for this
verify_listener_events which has been added later.

At the end, using all these 'sleep 0.5' seems to work: the slow CIs
don't complain so far. Also because it doesn't take too much time, we
can just add two more 'sleep 0.5' to uniform what is done before calling
a 'verify_*' function. For the same reasons, we can also delay a bigger
refactoring to replace all these 'sleep 0.5' by functions waiting for
events instead of waiting for a fix time and hope for the best.

Fixes: 6c73008aa301 ("selftests: mptcp: listener test for userspace PM")
Cc: stable@vger.kernel.org
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agomptcp: fix NULL pointer dereference on fastopen early fallback
Paolo Abeni [Tue, 11 Apr 2023 20:42:11 +0000 (22:42 +0200)]
mptcp: fix NULL pointer dereference on fastopen early fallback

In case of early fallback to TCP, subflow_syn_recv_sock() deletes
the subflow context before returning the newly allocated sock to
the caller.

The fastopen path does not cope with the above unconditionally
dereferencing the subflow context.

Fixes: 36b122baf6a8 ("mptcp: add subflow_v(4,6)_send_synack()")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agomptcp: stricter state check in mptcp_worker
Paolo Abeni [Tue, 11 Apr 2023 20:42:10 +0000 (22:42 +0200)]
mptcp: stricter state check in mptcp_worker

As reported by Christoph, the mptcp protocol can run the
worker when the relevant msk socket is in an unexpected state:

connect()
// incoming reset + fastclose
// the mptcp worker is scheduled
mptcp_disconnect()
// msk is now CLOSED
listen()
mptcp_worker()

Leading to the following splat:

divide error: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21 Comm: kworker/1:0 Not tainted 6.3.0-rc1-gde5e8fd0123c #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: events mptcp_worker
RIP: 0010:__tcp_select_window+0x22c/0x4b0 net/ipv4/tcp_output.c:3018
RSP: 0018:ffffc900000b3c98 EFLAGS: 00010293
RAX: 000000000000ffd7 RBX: 000000000000ffd7 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8214ce97 RDI: 0000000000000004
RBP: 000000000000ffd7 R08: 0000000000000004 R09: 0000000000010000
R10: 000000000000ffd7 R11: ffff888005afa148 R12: 000000000000ffd7
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000405270 CR3: 000000003011e006 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 tcp_select_window net/ipv4/tcp_output.c:262 [inline]
 __tcp_transmit_skb+0x356/0x1280 net/ipv4/tcp_output.c:1345
 tcp_transmit_skb net/ipv4/tcp_output.c:1417 [inline]
 tcp_send_active_reset+0x13e/0x320 net/ipv4/tcp_output.c:3459
 mptcp_check_fastclose net/mptcp/protocol.c:2530 [inline]
 mptcp_worker+0x6c7/0x800 net/mptcp/protocol.c:2705
 process_one_work+0x3bd/0x950 kernel/workqueue.c:2390
 worker_thread+0x5b/0x610 kernel/workqueue.c:2537
 kthread+0x138/0x170 kernel/kthread.c:376
 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308
 </TASK>

This change addresses the issue explicitly checking for bad states
before running the mptcp worker.

Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/374
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agomptcp: use mptcp_schedule_work instead of open-coding it
Paolo Abeni [Tue, 11 Apr 2023 20:42:09 +0000 (22:42 +0200)]
mptcp: use mptcp_schedule_work instead of open-coding it

Beyond reducing code duplication this also avoids scheduling
the mptcp_worker on a closed socket on some edge scenarios.

The addressed issue is actually older than the blamed commit
below, but this fix needs it as a pre-requisite.

Fixes: ba8f48f7a4d7 ("mptcp: introduce mptcp_schedule_work")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
19 months agosamples/bpf: Fix fout leak in hbm's run_bpf_prog
Hao Zeng [Tue, 11 Apr 2023 08:43:49 +0000 (16:43 +0800)]
samples/bpf: Fix fout leak in hbm's run_bpf_prog

Fix fout being fopen'ed but then not subsequently fclose'd. In the affected
branch, fout is otherwise going out of scope.

Signed-off-by: Hao Zeng <zenghao@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230411084349.1999628-1-zenghao@kylinos.cn
19 months agoxsk: Elide base_addr comparison in xp_unaligned_validate_desc
Kal Conley [Tue, 11 Apr 2023 13:00:25 +0000 (15:00 +0200)]
xsk: Elide base_addr comparison in xp_unaligned_validate_desc

Remove redundant (base_addr >= pool->addrs_cnt) comparison from the
conditional.

In particular, addr is computed as:

    addr = base_addr + offset

... where base_addr and offset are stored as 48-bit and 16-bit unsigned
integers, respectively. The above sum cannot overflow u64 since base_addr
has a maximum value of 0x0000ffffffffffff and offset has a maximum value
of 0xffff (implying a maximum sum of 0x000100000000fffe). Since overflow
is impossible, it follows that addr >= base_addr.

Now if (base_addr >= pool->addrs_cnt), then clearly:

    addr >= base_addr
         >= pool->addrs_cnt

Thus, (base_addr >= pool->addrs_cnt) implies (addr >= pool->addrs_cnt).
Subsequently, the former comparison is unnecessary in the conditional
since for any boolean expressions A and B, (A || B) && (A -> B) is
equivalent to B.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230411130025.19704-1-kal.conley@dectris.com
19 months agoselftests/bpf: Fix compiler warnings in bpf_testmod for kfuncs
Andrii Nakryiko [Wed, 12 Apr 2023 03:46:47 +0000 (20:46 -0700)]
selftests/bpf: Fix compiler warnings in bpf_testmod for kfuncs

Add -Wmissing-prototypes ignore in bpf_testmod.c, similarly to what we
do in kernel code proper.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/oe-kbuild-all/202304080951.l14IDv3n-lkp@intel.com/
Link: https://lore.kernel.org/bpf/20230412034647.3968143-1-andrii@kernel.org
19 months agoxsk: Simplify xp_aligned_validate_desc implementation
Kal Conley [Mon, 10 Apr 2023 12:18:41 +0000 (14:18 +0200)]
xsk: Simplify xp_aligned_validate_desc implementation

Perform the chunk boundary check like the page boundary check in
xp_desc_crosses_non_contig_pg(). This simplifies the implementation and
reduces the number of branches.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230410121841.643254-1-kal.conley@dectris.com
19 months agoselftests/bpf: Remove stand-along test_verifier_log test binary
Andrii Nakryiko [Wed, 12 Apr 2023 17:06:55 +0000 (10:06 -0700)]
selftests/bpf: Remove stand-along test_verifier_log test binary

test_prog's prog_tests/verifier_log.c is superseding test_verifier_log
stand-alone test. It cover same checks and adds more, and is also
integrated into test_progs test runner.

Just remove test_verifier_log.c.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230412170655.1866831-1-andrii@kernel.org
19 months agoselftests/bpf: Keep the loop in bpf_testmod_loop_test
Song Liu [Wed, 12 Apr 2023 21:04:23 +0000 (14:04 -0700)]
selftests/bpf: Keep the loop in bpf_testmod_loop_test

Some compilers (for example clang-15) optimize bpf_testmod_loop_test and
remove the loop:

gcc version
(gdb) disassemble bpf_testmod_loop_test
Dump of assembler code for function bpf_testmod_loop_test:
   0x0000000000000570 <+0>:     callq  0x575 <bpf_testmod_loop_test+5>
   0x0000000000000575 <+5>:     xor    %eax,%eax
   0x0000000000000577 <+7>:     test   %edi,%edi
   0x0000000000000579 <+9>:     jle    0x587 <bpf_testmod_loop_test+23>
   0x000000000000057b <+11>:    xor    %edx,%edx
   0x000000000000057d <+13>:    add    %edx,%eax
   0x000000000000057f <+15>:    add    $0x1,%edx
   0x0000000000000582 <+18>:    cmp    %edx,%edi
   0x0000000000000584 <+20>:    jne    0x57d <bpf_testmod_loop_test+13>
   0x0000000000000586 <+22>:    retq
   0x0000000000000587 <+23>:    retq

clang-15 version
(gdb) disassemble bpf_testmod_loop_test
Dump of assembler code for function bpf_testmod_loop_test:
   0x0000000000000450 <+0>:     nopl   0x0(%rax,%rax,1)
   0x0000000000000455 <+5>:     test   %edi,%edi
   0x0000000000000457 <+7>:     jle    0x46b <bpf_testmod_loop_test+27>
   0x0000000000000459 <+9>:     lea    -0x1(%rdi),%eax
   0x000000000000045c <+12>:    lea    -0x2(%rdi),%ecx
   0x000000000000045f <+15>:    imul   %rax,%rcx
   0x0000000000000463 <+19>:    shr    %rcx
   0x0000000000000466 <+22>:    lea    -0x1(%rdi,%rcx,1),%eax
   0x000000000000046a <+26>:    retq
   0x000000000000046b <+27>:    xor    %eax,%eax
   0x000000000000046d <+29>:    retq

Note: The jne instruction is removed in clang-15 version.

Force the compile to keep the loop by making sum volatile.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230412210423.900851-4-song@kernel.org
19 months agoselftests/bpf: Fix leaked bpf_link in get_stackid_cannot_attach
Song Liu [Wed, 12 Apr 2023 21:04:22 +0000 (14:04 -0700)]
selftests/bpf: Fix leaked bpf_link in get_stackid_cannot_attach

skel->links.oncpu is leaked in one case. This causes test perf_branches
fails when it runs after get_stackid_cannot_attach:

./test_progs -t get_stackid_cannot_attach,perf_branches
84      get_stackid_cannot_attach:OK
test_perf_branches_common:PASS:test_perf_branches_load 0 nsec
test_perf_branches_common:PASS:attach_perf_event 0 nsec
test_perf_branches_common:PASS:set_affinity 0 nsec
check_good_sample:FAIL:output not valid no valid sample from prog
146/1   perf_branches/perf_branches_hw:FAIL
146/2   perf_branches/perf_branches_no_hw:OK
146     perf_branches:FAIL

All error logs:
test_perf_branches_common:PASS:test_perf_branches_load 0 nsec
test_perf_branches_common:PASS:attach_perf_event 0 nsec
test_perf_branches_common:PASS:set_affinity 0 nsec
check_good_sample:FAIL:output not valid no valid sample from prog
146/1   perf_branches/perf_branches_hw:FAIL
146     perf_branches:FAIL
Summary: 1/1 PASSED, 0 SKIPPED, 1 FAILED

Fix this by adding the missing bpf_link__destroy().

Fixes: 346938e9380c ("selftests/bpf: Add get_stackid_cannot_attach")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230412210423.900851-3-song@kernel.org
19 months agoselftests/bpf: Use read_perf_max_sample_freq() in perf_event_stackmap
Song Liu [Wed, 12 Apr 2023 21:04:21 +0000 (14:04 -0700)]
selftests/bpf: Use read_perf_max_sample_freq() in perf_event_stackmap

Currently, perf_event sample period in perf_event_stackmap is set too low
that the test fails randomly. Fix this by using the max sample frequency,
from read_perf_max_sample_freq().

Move read_perf_max_sample_freq() to testing_helpers.c. Replace the CHECK()
with if-printf, as CHECK is not available in testing_helpers.c.

Fixes: 1da4864c2b20 ("selftests/bpf: Add callchain_stackid")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230412210423.900851-2-song@kernel.org
19 months agoMerge tag 'drm-intel-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm-intel...
Daniel Vetter [Thu, 13 Apr 2023 12:24:44 +0000 (14:24 +0200)]
Merge tag 'drm-intel-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

drm/i915 fixes for v6.3-rc7:
- Fix dual link DSI for TGL+

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/877cugckzu.fsf@intel.com
19 months agoselftests/bpf: Fix use of uninitialized op_name in log tests
Lorenz Bauer [Thu, 13 Apr 2023 09:47:40 +0000 (10:47 +0100)]
selftests/bpf: Fix use of uninitialized op_name in log tests

One of the test assertions uses an uninitialized op_name, which leads
to some headscratching if it fails. Use a string constant instead.

Fixes: b1a7a480a112 ("selftests/bpf: Add fixed vs rotating verifier log tests")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230413094740.18041-1-lmb@isovalent.com
19 months agoMerge branch 'net-use-read_once-write_once-for-ring-index-accesses'
Paolo Abeni [Thu, 13 Apr 2023 11:30:22 +0000 (13:30 +0200)]
Merge branch 'net-use-read_once-write_once-for-ring-index-accesses'

Jakub Kicinski says:

====================
net: use READ_ONCE/WRITE_ONCE for ring index accesses

Small follow up to the lockless ring stop/start macros.
Update the doc and the drivers suggested by Eric:
https://lore.kernel.org/all/CANn89iJrBGSybMX1FqrhCEMWT3Nnz2=2+aStsbbwpWzKHjk51g@mail.gmail.com/

====================

Link: https://lore.kernel.org/r/20230412015038.674023-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agomlx4: use READ_ONCE/WRITE_ONCE for ring indexes
Jakub Kicinski [Wed, 12 Apr 2023 01:50:38 +0000 (18:50 -0700)]
mlx4: use READ_ONCE/WRITE_ONCE for ring indexes

Eric points out that we should make sure that ring index updates
are wrapped in the appropriate READ_ONCE/WRITE_ONCE macros.

Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agobnxt: use READ_ONCE/WRITE_ONCE for ring indexes
Jakub Kicinski [Wed, 12 Apr 2023 01:50:37 +0000 (18:50 -0700)]
bnxt: use READ_ONCE/WRITE_ONCE for ring indexes

Eric points out that we should make sure that ring index updates
are wrapped in the appropriate READ_ONCE/WRITE_ONCE macros.

Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: docs: update the sample code in driver.rst
Jakub Kicinski [Wed, 12 Apr 2023 01:50:36 +0000 (18:50 -0700)]
net: docs: update the sample code in driver.rst

The sample code talks about single-queue devices and uses locks.
Update it to something resembling more modern code.
Make sure we mention use of READ_ONCE() / WRITE_ONCE().

Change the comment which talked about consumer on the xmit side.
AFAIU xmit is the producer and completions are a consumer.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agoMerge branch 'add-emac3-support-for-sa8540p-ride'
Paolo Abeni [Thu, 13 Apr 2023 10:50:47 +0000 (12:50 +0200)]
Merge branch 'add-emac3-support-for-sa8540p-ride'

Andrew Halaney says:

====================
Add EMAC3 support for sa8540p-ride

This is a forward port / upstream refactor of code delivered
downstream by Qualcomm over at [0] to enable the DWMAC5 based
implementation called EMAC3 on the sa8540p-ride dev board.

From what I can tell with the board schematic in hand,
as well as the code delivered, the main changes needed are:

    1. A new address space layout for dwmac5/EMAC3 MTL/DMA regs
    2. A new programming sequence required for the EMAC3 based platforms

This series makes the changes above as well as other housekeeping items
such as converting dt-bindings to yaml, etc.

As requested[1], it has been split up by compilation deps / maintainer tree.
I will post a link to the associated devicetree changes that together
with this series get the hardware functioning.

Patches 1-3 are clean ups of the currently supported dt-bindings and
IMO could be picked up as is independent of the rest of the series to
improve the current codebase. They've all been reviewed in prior
versions of the series.

Patches 5-7 are also clean ups of the driver and are worth picking up
independently as well. They don't all have explicit reviews but should
be good to go (trivial changes on non-reviewed bits).

The rest of the patches have new changes, lack review, or are specificly
being made to support the new hardware, so they should wait until the
series as a whole is deemed ready to go by the community.

====================

Link: https://lore.kernel.org/r/20230411200409.455355-1-ahalaney@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: dwmac-qcom-ethqos: Add EMAC3 support
Andrew Halaney [Tue, 11 Apr 2023 20:04:09 +0000 (15:04 -0500)]
net: stmmac: dwmac-qcom-ethqos: Add EMAC3 support

Add the new programming sequence needed for EMAC3 based platforms such
as the sc8280xp family.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: dwmac-qcom-ethqos: Use loopback_en for all speeds
Andrew Halaney [Tue, 11 Apr 2023 20:04:08 +0000 (15:04 -0500)]
net: stmmac: dwmac-qcom-ethqos: Use loopback_en for all speeds

It seems that this variable should be used for all speeds, not just
1000/100.

While at it refactor it slightly to be more readable, including fixing
the typo in the variable name.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: dwmac-qcom-ethqos: Respect phy-mode and TX delay
Andrew Halaney [Tue, 11 Apr 2023 20:04:07 +0000 (15:04 -0500)]
net: stmmac: dwmac-qcom-ethqos: Respect phy-mode and TX delay

The driver currently sets a MAC TX delay of 2 ns no matter what the
phy-mode is. If the phy-mode indicates the phy is in charge of the
TX delay (rgmii-txid, rgmii-id), don't do it in the MAC.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: dwmac4: Allow platforms to specify some DMA/MTL offsets
Andrew Halaney [Tue, 11 Apr 2023 20:04:06 +0000 (15:04 -0500)]
net: stmmac: dwmac4: Allow platforms to specify some DMA/MTL offsets

Some platforms have dwmac4 implementations that have a different
address space layout than the default, resulting in the need to define
their own DMA/MTL offsets.

Extend the functions to allow a platform driver to indicate what its
addresses are, overriding the defaults.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: Pass stmmac_priv in some callbacks
Andrew Halaney [Tue, 11 Apr 2023 20:04:05 +0000 (15:04 -0500)]
net: stmmac: Pass stmmac_priv in some callbacks

Passing stmmac_priv to some of the callbacks allows hwif implementations
to grab some data that platforms can customize. Adjust the callbacks
accordingly in preparation of such a platform customization.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
19 months agonet: stmmac: Remove some unnecessary void pointers
Andrew Halaney [Tue, 11 Apr 2023 20:04:04 +0000 (15:04 -0500)]
net: stmmac: Remove some unnecessary void pointers

There's a few spots in the hardware interface where a void pointer is
used, but what's passed in and later cast out is always the same type.

Just use the proper type directly.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>