platform/kernel/linux-rpi.git
11 months agovirtchnl: fix fake 1-elem arrays for structures allocated as `nents`
Alexander Lobakin [Fri, 28 Jul 2023 15:52:07 +0000 (17:52 +0200)]
virtchnl: fix fake 1-elem arrays for structures allocated as `nents`

Finally, fix 3 structures which are allocated technically correctly,
i.e. the calculated size equals to the one that struct_size() would
return, except for sizeof(). For &virtchnl_vlan_filter_list_v2, use
the same approach when there are no enough space as taken previously
for &virtchnl_vlan_filter_list, i.e. let the maximum size be calculated
automatically instead of trying to guestimate it using maths.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
11 months agovirtchnl: fix fake 1-elem arrays in structures allocated as `nents + 1`
Alexander Lobakin [Fri, 28 Jul 2023 15:52:06 +0000 (17:52 +0200)]
virtchnl: fix fake 1-elem arrays in structures allocated as `nents + 1`

There are five virtchnl structures, which are allocated and checked in
the code as `nents + 1`, meaning that they always have memory for one
excessive element regardless of their actual number. This comes from
that their sizeof() includes space for 1 element and then they get
allocated via struct_size() or its open-coded equivalents, passing
the actual number of elements.
Expand virtchnl_struct_size() to handle such structures and replace
those 1-elem arrays with proper flex ones. Also fix several places
which open-code %IAVF_VIRTCHNL_VF_RESOURCE_SIZE. Finally, let the
virtchnl_ether_addr_list size be computed automatically when there's
no enough space for the whole list, otherwise we have to open-code
reverse struct_size() logics.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
11 months agovirtchnl: fix fake 1-elem arrays in structs allocated as `nents + 1` - 1
Alexander Lobakin [Fri, 28 Jul 2023 15:52:05 +0000 (17:52 +0200)]
virtchnl: fix fake 1-elem arrays in structs allocated as `nents + 1` - 1

The two most problematic virtchnl structures are virtchnl_rss_key and
virtchnl_rss_lut. Their "flex" arrays have the type of u8, thus, when
allocating / checking, the actual size is calculated as `sizeof +
nents - 1 byte`. But their sizeof() is not 1 byte larger than the size
of such structure with proper flex array, it's two bytes larger due to
the padding. That said, their size is always 1 byte larger unless
there are no tail elements -- then it's +2 bytes.
Add virtchnl_struct_size() macro which will handle this case (and later
other cases as well). Make its calling conv the same as we call
struct_size() to allow it to be drop-in, even though it's unlikely to
become possible to switch to generic API. The macro will calculate a
proper size of a structure with a flex array at the end, so that it
becomes transparent for the compilers, but add the difference from the
old values, so that the real size of sorta-ABI-messages doesn't change.
Use it on the allocation side in IAVF and the receiving side (defined
as static inline in virtchnl.h) for the mentioned two structures.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
11 months agoMerge branch 'ipv6-expired-routes'
David S. Miller [Wed, 16 Aug 2023 11:26:44 +0000 (12:26 +0100)]
Merge branch 'ipv6-expired-routes'

Kui-Feng Lee says:

====================
Remove expired routes with a separated list of routes.

FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree
can be expensive if the number of routes in a table is big, even if most of
them are permanent. Checking routes in a separated list of routes having
expiration will avoid this potential issue.

Background
==========

The size of a Linux IPv6 routing table can become a big problem if not
managed appropriately.  Now, Linux has a garbage collector to remove
expired routes periodically.  However, this may lead to a situation in
which the routing path is blocked for a long period due to an
excessive number of routes.

For example, years ago, there is a commit c7bb4b89033b ("ipv6: tcp:
drop silly ICMPv6 packet too big messages").  The root cause is that
malicious ICMPv6 packets were sent back for every small packet sent to
them. These packets add routes with an expiration time that prompts
the GC to periodically check all routes in the tables, including
permanent ones.

Why Route Expires
=================

Users can add IPv6 routes with an expiration time manually. However,
the Neighbor Discovery protocol may also generate routes that can
expire.  For example, Router Advertisement (RA) messages may create a
default route with an expiration time. [RFC 4861] For IPv4, it is not
possible to set an expiration time for a route, and there is no RA, so
there is no need to worry about such issues.

Create Routes with Expires
==========================

You can create routes with expires with the  command.

For example,

    ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \
        dev enp0s3 expires 30

The route that has been generated will be deleted automatically in 30
seconds.

GC of FIB6
==========

The function called fib6_run_gc() is responsible for performing
garbage collection (GC) for the Linux IPv6 stack. It checks for the
expiration of every route by traversing the trees of routing
tables. The time taken to traverse a routing table increases with its
size. Holding the routing table lock during traversal is particularly
undesirable. Therefore, it is preferable to keep the lock for the
shortest possible duration.

Solution
========

The cause of the issue is keeping the routing table locked during the
traversal of large trees. To solve this problem, we can create a separate
list of routes that have expiration. This will prevent GC from checking
permanent routes.

Result
======

We conducted a test to measure the execution times of fib6_gc_timer_cb()
and observed that it enhances the GC of FIB6. During the test, we added
permanent routes with the following numbers: 1000, 3000, 6000, and
9000. Additionally, we added a route with an expiration time.

Here are the average execution times for the kernel without the patch.
 - 120020 ns with 1000 permanent routes
 - 308920 ns with 3000 ...
 - 581470 ns with 6000 ...
 - 855310 ns with 9000 ...

The kernel with the patch consistently takes around 14000 ns to execute,
regardless of the number of permanent routes that are installed.

Major changes from v7:

 - Fix warings raised by the patchwork.

Major changes from v6:

 - Remove unnecessary check of tb6 in fib6_clean_expires_locked().

 - Use ib6_clean_expires_locked() instead in fib6_purge_rt().

Major changes from v5:

 - Change the order of adding new routes to the GC list and starting
   GC timer.

 - Remove time measurements from the test case.

 - Stop forcing GC flush.

Major changes from v4:

 - Detect existence of 'strace' in the test case.

Major changes from v3:

 - Fix the type of arg according to feedback.

 - Add 1k temporary routes and 5K permanent routes in the test case.
   Measure time spending on GC with strace.

Major changes from v2:

 - Remove unnecessary and incorrect sysctl restoring in the test case.

Major changes from v1:

 - Moved gc_link to avoid creating a hole in fib6_info.

 - Moved fib6_set_expires*() and fib6_clean_expires*() to the header
   file and inlined. And removed duplicated lines.

 - Added a test case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoselftests: fib_tests: Add a test case for IPv6 garbage collection
Kui-Feng Lee [Tue, 15 Aug 2023 18:07:06 +0000 (11:07 -0700)]
selftests: fib_tests: Add a test case for IPv6 garbage collection

Add 1000 IPv6 routes with expiration time (w/ and w/o additional 5000
permanet routes in the background.)  Wait for a few seconds to make sure
they are removed correctly.

The expected output of the test looks like the following example.

> Fib6 garbage collection test
>     TEST: ipv6 route garbage collection [ OK ]

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet/ipv6: Remove expired routes with a separated list of routes.
Kui-Feng Lee [Tue, 15 Aug 2023 18:07:05 +0000 (11:07 -0700)]
net/ipv6: Remove expired routes with a separated list of routes.

FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree
can be expensive if the number of routes in a table is big, even if most of
them are permanent. Checking routes in a separated list of routes having
expiration will avoid this potential issue.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoe1000e: Use PME poll to circumvent unreliable ACPI wake
Kai-Heng Feng [Tue, 15 Aug 2023 17:01:11 +0000 (10:01 -0700)]
e1000e: Use PME poll to circumvent unreliable ACPI wake

On some I219 devices, ethernet cable plugging detection only works once
from PCI D3 state. Subsequent cable plugging does set PME bit correctly,
but device still doesn't get woken up.

Since I219 connects to the root complex directly, it relies on platform
firmware (ACPI) to wake it up. In this case, the GPE from _PRW only
works for first cable plugging but fails to notify the driver for
subsequent plugging events.

The issue was originally found on CNP, but the same issue can be found
on ADL too. So workaround the issue by continuing use PME poll after
first ACPI wake. As PME poll is always used, the runtime suspend
restriction for CNP can also be removed.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Acked-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet-memcg: Fix scope of sockmem pressure indicators
Abel Wu [Mon, 14 Aug 2023 07:09:11 +0000 (15:09 +0800)]
net-memcg: Fix scope of sockmem pressure indicators

Now there are two indicators of socket memory pressure sit inside
struct mem_cgroup, socket_pressure and tcpmem_pressure, indicating
memory reclaim pressure in memcg->memory and ->tcpmem respectively.

When in legacy mode (cgroupv1), the socket memory is charged into
->tcpmem which is independent of ->memory, so socket_pressure has
nothing to do with socket's pressure at all. Things could be worse
by taking socket_pressure into consideration in legacy mode, as a
pressure in ->memory can lead to premature reclamation/throttling
in socket.

While for the default mode (cgroupv2), the socket memory is charged
into ->memory, and ->tcpmem/->tcpmem_pressure are simply not used.

So {socket,tcpmem}_pressure are only used in default/legacy mode
respectively for indicating socket memory pressure. This patch fixes
the pieces of code that make mixed use of both.

Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonfp: update maintainer
Louis Peens [Tue, 15 Aug 2023 12:43:25 +0000 (14:43 +0200)]
nfp: update maintainer

Take over maintainership of the nfp driver from Simon as he
is moving away from Corigine.

Signed-off-by: Louis Peens <louis.peens@corigine.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel mode
Grygorii Strashko [Tue, 15 Aug 2023 08:21:05 +0000 (11:21 +0300)]
net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel mode

This patch adds MQPRIO Qdisc offload in full 'channel' mode which allows
not only setting up pri:tc mapping, but also configuring TX shapers on
external port FIFOs. The K3 CPSW MQPRIO Qdisc offload is expected to work
with VLAN/priority tagged packets. Non-tagged packets have to be mapped
only to TC0.

- TX traffic classes must be rated starting from TC that has highest
priority and with no gaps
- Traffic classes are used starting from 0, that has highest priority
- min_rate defines Committed Information Rate (guaranteed)
- max_rate defines Excess Information Rate (non guaranteed) and offloaded
as (max_rate[i] - tcX_min_rate[i])
- VLAN/priority tagged packets mapped to TC0 will exit switch with VLAN tag
priority 0

The configuration example:
 ethtool -L eth1 tx 5
 ethtool --set-priv-flags eth1 p0-rx-ptype-rrobin off

 tc qdisc add dev eth1 parent root handle 100: mqprio num_tc 3 \
 map 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 \
 queues 1@0 1@1 1@2 hw 1 mode channel \
 shaper bw_rlimit min_rate 0 100mbit 200mbit max_rate 0 101mbit 202mbit

 tc qdisc replace dev eth2 handle 100: parent root mqprio num_tc 1 \
 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 hw 1

 ip link add link eth1 name eth1.100 type vlan id 100
 ip link set eth1.100 type vlan egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7

In the above example two ports share the same TX CPPI queue 0 for low
priority traffic. 3 traffic classes are defined for eth1 and mapped to:
TC0 - low priority, TX CPPI queue 0 -> ext Port 1 fifo0, no rate limit
TC1 - prio 2, TX CPPI queue 1 -> ext Port 1 fifo1, CIR=100Mbit/s, EIR=1Mbit/s
TC2 - prio 3, TX CPPI queue 2 -> ext Port 1 fifo2, CIR=200Mbit/s, EIR=2Mbit/s

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoMerge branch 'inet-data-races'
David S. Miller [Wed, 16 Aug 2023 10:09:18 +0000 (11:09 +0100)]
Merge branch 'inet-data-races'

Eric Dumazet says:

====================
inet: socket lock and data-races avoidance

In this series, I converted 20 bits in "struct inet_sock" and made
them truly atomic.

This allows to implement many IP_ socket options in a lockless
fashion (no need to acquire socket lock), and fixes data-races
that were showing up in various KCSAN reports.

I also took care of IP_TTL/IP_MINTTL, but left few other options
for another series.

v4: Rebased after recent mptcp changes.
  Added Reviewed-by: tags from Simon (thanks !)

v3: fixed patch 7, feedback from build bot about ipvs set_mcast_loop()

v2: addressed a feedback from a build bot in patch 9 by removing
 unused issk variable in mptcp_setsockopt_sol_ip_set_transparent()
 Added Acked-by: tags from Soheil (thanks !)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: implement lockless IP_MINTTL
Eric Dumazet [Wed, 16 Aug 2023 08:15:47 +0000 (08:15 +0000)]
inet: implement lockless IP_MINTTL

inet->min_ttl is already read with READ_ONCE().

Implementing IP_MINTTL socket option set/read
without holding the socket lock is easy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: implement lockless IP_TTL
Eric Dumazet [Wed, 16 Aug 2023 08:15:46 +0000 (08:15 +0000)]
inet: implement lockless IP_TTL

ip_select_ttl() is racy, because it reads inet->uc_ttl
without proper locking.

Add READ_ONCE()/WRITE_ONCE() annotations while
allowing IP_TTL socket option to be set/read without
holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->defer_connect to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:45 +0000 (08:15 +0000)]
inet: move inet->defer_connect to inet->inet_flags

Make room in struct inet_sock by removing this bit field,
using one available bit in inet_flags instead.

Also move local_port_range to fill the resulting hole,
saving 8 bytes on 64bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->bind_address_no_port to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:44 +0000 (08:15 +0000)]
inet: move inet->bind_address_no_port to inet->inet_flags

IP_BIND_ADDRESS_NO_PORT socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->nodefrag to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:43 +0000 (08:15 +0000)]
inet: move inet->nodefrag to inet->inet_flags

IP_NODEFRAG socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->is_icsk to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:42 +0000 (08:15 +0000)]
inet: move inet->is_icsk to inet->inet_flags

We move single bit fields to inet->inet_flags to avoid races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->transparent to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:41 +0000 (08:15 +0000)]
inet: move inet->transparent to inet->inet_flags

IP_TRANSPARENT socket option can now be set/read
without locking the socket.

v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent()
v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->mc_all to inet->inet_frags
Eric Dumazet [Wed, 16 Aug 2023 08:15:40 +0000 (08:15 +0000)]
inet: move inet->mc_all to inet->inet_frags

IP_MULTICAST_ALL socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->mc_loop to inet->inet_frags
Eric Dumazet [Wed, 16 Aug 2023 08:15:39 +0000 (08:15 +0000)]
inet: move inet->mc_loop to inet->inet_frags

IP_MULTICAST_LOOP socket option can now be set/read
without locking the socket.

v3: fix build bot error reported in ipvs set_mcast_loop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->hdrincl to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:38 +0000 (08:15 +0000)]
inet: move inet->hdrincl to inet->inet_flags

IP_HDRINCL socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->freebind to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:37 +0000 (08:15 +0000)]
inet: move inet->freebind to inet->inet_flags

IP_FREEBIND socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->recverr_rfc4884 to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:36 +0000 (08:15 +0000)]
inet: move inet->recverr_rfc4884 to inet->inet_flags

IP_RECVERR_RFC4884 socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: move inet->recverr to inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:35 +0000 (08:15 +0000)]
inet: move inet->recverr to inet->inet_flags

IP_RECVERR socket option can now be set/get without locking the socket.

This patch potentially avoid data-races around inet->recverr.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: set/get simple options locklessly
Eric Dumazet [Wed, 16 Aug 2023 08:15:34 +0000 (08:15 +0000)]
inet: set/get simple options locklessly

Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:

IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.

ip_sock_set_pktinfo() no longer hold the socket lock.

Similarly we can get the following options whithout holding
the socket lock:

IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoinet: introduce inet->inet_flags
Eric Dumazet [Wed, 16 Aug 2023 08:15:33 +0000 (08:15 +0000)]
inet: introduce inet->inet_flags

Various inet fields are currently racy.

do_ip_setsockopt() and do_ip_getsockopt() are mostly holding
the socket lock, but some (fast) paths do not.

Use a new inet->inet_flags to hold atomic bits in the series.

Remove inet->cmsg_flags, and use instead 9 bits from inet_flags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoMerge branch 'redundant-of_match_ptr'
David S. Miller [Wed, 16 Aug 2023 08:59:40 +0000 (09:59 +0100)]
Merge branch 'redundant-of_match_ptr'

Ruan Jinjie says:

====================
net: Remove redundant of_match_ptr() macro

Since these net drivers depend on CONFIG_OF, there is
no need to wrap the macro of_match_ptr() here.

Changes in v3:
- Collect responses from v1 and v2.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agowlcore: spi: Remove redundant of_match_ptr()
Ruan Jinjie [Mon, 14 Aug 2023 02:55:19 +0000 (10:55 +0800)]
wlcore: spi: Remove redundant of_match_ptr()

The driver depends on CONFIG_OF, it is not necessary to use
of_match_ptr() here.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: qualcomm: Remove redundant of_match_ptr()
Ruan Jinjie [Mon, 14 Aug 2023 02:55:18 +0000 (10:55 +0800)]
net: qualcomm: Remove redundant of_match_ptr()

The driver depends on CONFIG_OF, it is not necessary to use
of_match_ptr() here.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: gemini: Remove redundant of_match_ptr()
Ruan Jinjie [Mon, 14 Aug 2023 02:55:17 +0000 (10:55 +0800)]
net: gemini: Remove redundant of_match_ptr()

The driver depends on CONFIG_OF, it is not necessary to use
of_match_ptr() here.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: dsa: rzn1-a5psw: Remove redundant of_match_ptr()
Ruan Jinjie [Mon, 14 Aug 2023 02:55:16 +0000 (10:55 +0800)]
net: dsa: rzn1-a5psw: Remove redundant of_match_ptr()

The driver depends on CONFIG_OF, it is not necessary to use
of_match_ptr() here.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: dsa: realtek: Remove redundant of_match_ptr()
Ruan Jinjie [Mon, 14 Aug 2023 02:55:15 +0000 (10:55 +0800)]
net: dsa: realtek: Remove redundant of_match_ptr()

The driver depends on CONFIG_OF, it is not necessary to use
of_match_ptr() here.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonfc: virtual_ncidev: Use module_misc_device macro to simplify the code
Li Zetao [Tue, 15 Aug 2023 07:49:27 +0000 (15:49 +0800)]
nfc: virtual_ncidev: Use module_misc_device macro to simplify the code

Use the module_misc_device macro to simplify the code, which is the
same as declaring with module_init() and module_exit().

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoMerge branch 'hns3-ethtool'
David S. Miller [Wed, 16 Aug 2023 07:56:38 +0000 (08:56 +0100)]
Merge branch 'hns3-ethtool'

Jijie Shao says:

====================
hns3: refactor registers information for ethtool -d

refactor registers information for ethtool -d
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: hns3: fix wrong rpu tln reg issue
Jijie Shao [Tue, 15 Aug 2023 06:06:41 +0000 (14:06 +0800)]
net: hns3: fix wrong rpu tln reg issue

In the original RPU query command, the status register values of
multiple RPU tunnels are accumulated by default, which is unreasonable.
This patch Fix it by querying the specified tunnel ID.
The tunnel number of the device can be obtained from firmware
during initialization.

Fixes: ddb54554fa51 ("net: hns3: add DFX registers information for ethtool -d")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: hns3: Support tlv in regs data for HNS3 VF driver
Jijie Shao [Tue, 15 Aug 2023 06:06:40 +0000 (14:06 +0800)]
net: hns3: Support tlv in regs data for HNS3 VF driver

The dump register function is being refactored.
The third step in refactoring is to support tlv info in regs data for
HNS3 PF driver.

Currently, if we use "ethtool -d" to dump regs value,
the output is as follows:
  offset1: 00 01 02 03 04 05 ...
  offset2:10 11 12 13 14 15 ...
  ......

We can't get the value of a register directly.

This patch deletes the original separator information and
add tag_len_value information in regs data.
ethtool can parse register data in key-value format by -d command.

a patch will be added to the ethtool to parse regs data
in the following format:
  reg1 : value2
  reg2 : value2
  ......

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: hns3: Support tlv in regs data for HNS3 PF driver
Jijie Shao [Tue, 15 Aug 2023 06:06:39 +0000 (14:06 +0800)]
net: hns3: Support tlv in regs data for HNS3 PF driver

The dump register function is being refactored.
The second step in refactoring is to support tlv info in regs data for
HNS3 PF driver.

Currently, if we use "ethtool -d" to dump regs value,
the output is as follows:
  offset1: 00 01 02 03 04 05 ...
  offset2:10 11 12 13 14 15 ...
  ......

We can't get the value of a register directly.

This patch deletes the original separator information and
add tag_len_value information in regs data.
ethtool can parse register data in key-value format by -d command.

a patch will be added to the ethtool to parse regs data
in the following format:
  reg1 : value2
  reg2 : value2
  ......

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: hns3: move dump regs function to a separate file
Jijie Shao [Tue, 15 Aug 2023 06:06:38 +0000 (14:06 +0800)]
net: hns3: move dump regs function to a separate file

The dump register function is being refactored.
The first step in refactoring is put the dump regs function
into a separate file.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoMerge branch 'fec-XDP_TX'
David S. Miller [Wed, 16 Aug 2023 06:12:40 +0000 (07:12 +0100)]
Merge branch 'fec-XDP_TX'

Wei Fang says:

====================
net: fec: add XDP_TX feature support

This patch set is to support the XDP_TX feature of FEC driver, the first
patch is add initial XDP_TX support, and the second patch improves the
performance of XDP_TX by not using xdp_convert_buff_to_frame(). Please
refer to the commit message of each patch for more details.
====================

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: fec: improve XDP_TX performance
Wei Fang [Tue, 15 Aug 2023 05:19:55 +0000 (13:19 +0800)]
net: fec: improve XDP_TX performance

As suggested by Jesper and Alexander, we can avoid converting xdp_buff
to xdp_frame in case of XDP_TX to save a bunch of CPU cycles, so that
we can further improve the XDP_TX performance.

Before this patch on i.MX8MP-EVK board, the performance shows as follows.
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     353918 pkt/s
proto 17:     352923 pkt/s
proto 17:     353900 pkt/s
proto 17:     352672 pkt/s
proto 17:     353912 pkt/s
proto 17:     354219 pkt/s

After applying this patch, the performance is improved.
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     369261 pkt/s
proto 17:     369267 pkt/s
proto 17:     369206 pkt/s
proto 17:     369214 pkt/s
proto 17:     369126 pkt/s
proto 17:     369272 pkt/s

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agonet: fec: add XDP_TX feature support
Wei Fang [Tue, 15 Aug 2023 05:19:54 +0000 (13:19 +0800)]
net: fec: add XDP_TX feature support

The XDP_TX feature is not supported before, and all the frames
which are deemed to do XDP_TX action actually do the XDP_DROP
action. So this patch adds the XDP_TX support to FEC driver.

I tested the performance of XDP_TX in XDP_DRV mode and XDP_SKB
mode respectively on i.MX8MP-EVK platform, and as suggested by
Jesper, I also tested the performance of XDP_REDIRECT on the
same platform. And the test steps and results are as follows.

XDP_TX test:
Step 1: One board is used as generator and connects to switch,and
the FEC port of DUT also connects to the switch. Both boards with
flow control off. Then the generator runs the
pktgen_sample03_burst_single_flow.sh script to generate and send
burst traffic to DUT. Note that the size of packet was set to 64
bytes and the procotol of packet was UDP in my test scenario. In
addition, the SMAC of the packet need to be different from the MAC
of the generator, because the xdp2 program will swap the DMAC and
SMAC of the packet and send it back to the generator. If the SMAC
of the generated packet is the MAC of the generator, the generator
will receive the returned traffic which increase the CPU loading
and significantly degrade the transmit speed of the generator, and
finally it affects the test of XDP_TX performance.

Step 2: The DUT runs the xdp2 program to transmit received UDP
packets back out on the same port where they were received.

root@imx8mpevk:~# ./xdp2 eth0
proto 17:     353918 pkt/s
proto 17:     352923 pkt/s
proto 17:     353900 pkt/s
proto 17:     352672 pkt/s
proto 17:     353912 pkt/s
proto 17:     354219 pkt/s

root@imx8mpevk:~# ./xdp2 -S eth0
proto 17:     160604 pkt/s
proto 17:     160708 pkt/s
proto 17:     160564 pkt/s
proto 17:     160684 pkt/s
proto 17:     160640 pkt/s
proto 17:     160720 pkt/s

The above results show that the XDP_TX performance of XDP_DRV mode
is much better than XDP_SKB mode, more than twice that of XDP_SKB
mode, which is in line with our expectation.

XDP_REDIRECT test:
Step1: Both the generator and the FEC port of the DUT connet to the
switch port. All the ports with flow control off, then the generator
runs the pktgen script to generate and send burst traffic to DUT.
Note that the size of packet was set to 64 bytes and the procotol of
packet was UDP in my test scenario.

Step2: The DUT runs the xdp_redirect program to redirect the traffic
from the FEC port to the FEC port itself.

root@imx8mpevk:~# ./xdp_redirect eth0 eth0
Redirecting from eth0 (ifindex 2; driver fec) to eth0
(ifindex 2; driver fec)
Summary        232,302 rx/s        0 err,drop/s      232,344 xmit/s
Summary        234,579 rx/s        0 err,drop/s      234,577 xmit/s
Summary        235,548 rx/s        0 err,drop/s      235,549 xmit/s
Summary        234,704 rx/s        0 err,drop/s      234,703 xmit/s
Summary        235,504 rx/s        0 err,drop/s      235,504 xmit/s
Summary        235,223 rx/s        0 err,drop/s      235,224 xmit/s
Summary        234,509 rx/s        0 err,drop/s      234,507 xmit/s
Summary        235,481 rx/s        0 err,drop/s      235,482 xmit/s
Summary        234,684 rx/s        0 err,drop/s      234,683 xmit/s
Summary        235,520 rx/s        0 err,drop/s      235,520 xmit/s
Summary        235,461 rx/s        0 err,drop/s      235,461 xmit/s
Summary        234,627 rx/s        0 err,drop/s      234,627 xmit/s
Summary        235,611 rx/s        0 err,drop/s      235,611 xmit/s
  Packets received    : 3,053,753
  Average packets/s   : 234,904
  Packets transmitted : 3,053,792
  Average transmit/s  : 234,907

Compared the performance of XDP_TX with XDP_REDIRECT, XDP_TX is also
much better than XDP_REDIRECT. It's also in line with our expectation.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoselftests: bonding: remove redundant delete action of device link1_1
Zhengchao Shao [Sat, 12 Aug 2023 08:40:36 +0000 (16:40 +0800)]
selftests: bonding: remove redundant delete action of device link1_1

When run command "ip netns delete client", device link1_1 has been
deleted. So, it is no need to delete link1_1 again. Remove it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoMerge tag 'mlx5-updates-2023-08-14' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Wed, 16 Aug 2023 02:21:49 +0000 (19:21 -0700)]
Merge tag 'mlx5-updates-2023-08-14' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-08-14

1) Handle PTP out of order CQEs issue
2) Check FW status before determining reset successful
3) Expose maximum supported SFs via devlink resource
4) MISC cleanups

* tag 'mlx5-updates-2023-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Don't query MAX caps twice
  net/mlx5: Remove unused MAX HCA capabilities
  net/mlx5: Remove unused CAPs
  net/mlx5: Fix error message in mlx5_sf_dev_state_change_handler()
  net/mlx5: Remove redundant check of mlx5_vhca_event_supported()
  net/mlx5: Use mlx5_sf_start_function_id() helper instead of directly calling MLX5_CAP_GEN()
  net/mlx5: Remove redundant SF supported check from mlx5_sf_hw_table_init()
  net/mlx5: Use auxiliary_device_uninit() instead of device_put()
  net/mlx5: E-switch, Add checking for flow rule destinations
  net/mlx5: Check with FW that sync reset completed successfully
  net/mlx5: Expose max possible SFs via devlink resource
  net/mlx5e: Add recovery flow for tx devlink health reporter for unhealthy PTP SQ
  net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs
  net/mlx5: Consolidate devlink documentation in devlink/mlx5.rst
====================

Link: https://lore.kernel.org/r/20230814214144.159464-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoMerge branch 'net-warn-about-attempts-to-register-negative-ifindex'
Jakub Kicinski [Wed, 16 Aug 2023 02:18:36 +0000 (19:18 -0700)]
Merge branch 'net-warn-about-attempts-to-register-negative-ifindex'

Jakub Kicinski says:

====================
net: warn about attempts to register negative ifindex

Follow up to the recently posted fix for OvS lacking input
validation:
https://lore.kernel.org/all/20230814203840.2908710-1-kuba@kernel.org/

Warn about negative ifindex more explicitly and misc YNL updates.
====================

Link: https://lore.kernel.org/r/20230814205627.2914583-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agotools: ynl: add more info to KeyErrors on missing attrs
Jakub Kicinski [Mon, 14 Aug 2023 20:56:27 +0000 (13:56 -0700)]
tools: ynl: add more info to KeyErrors on missing attrs

When developing specs its useful to know which attr space
YNL was trying to find an attribute in on key error.

Instead of printing:
 KeyError: 0
add info about the space:
 Exception: Space 'vport' has no attribute with value '0'

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonetlink: specs: add ovs_vport new command
Jakub Kicinski [Mon, 14 Aug 2023 20:56:26 +0000 (13:56 -0700)]
netlink: specs: add ovs_vport new command

Add NEW to the spec, it was useful testing the fix for OvS
input validation.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonet: warn about attempts to register negative ifindex
Jakub Kicinski [Mon, 14 Aug 2023 20:56:25 +0000 (13:56 -0700)]
net: warn about attempts to register negative ifindex

Since the xarray changes we mix returning valid ifindex and negative
errno in a single int returned from dev_index_reserve(). This depends
on the fact that ifindexes can't be negative. Otherwise we may insert
into the xarray and return a very large negative value. This in turn
may break ERR_PTR().

OvS is susceptible to this problem and lacking validation (fix posted
separately for net).

Reject negative ifindex explicitly. Add a warning because the input
validation is better handled by the caller.

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoeth: r8152: try to use a normal budget
Jakub Kicinski [Mon, 14 Aug 2023 15:35:21 +0000 (08:35 -0700)]
eth: r8152: try to use a normal budget

Mario reports that loading r8152 on his system leads to a:

  netif_napi_add_weight() called with weight 256

warning getting printed. We don't have any solid data
on why such high budget was chosen, and it may cause
stalls in processing other softirqs and rt threads.
So try to switch back to the default (64) weight.

If this slows down someone's system we should investigate
which part of stopping starting the NAPI poll in this
driver are expensive.

Reported-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/all/0bfd445a-81f7-f702-08b0-bd5a72095e49@amd.com/
Acked-by: Hayes Wang <hayeswang@realtek.com>
Link: https://lore.kernel.org/r/20230814153521.2697982-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonet: e1000e: Remove unused declarations
Yue Haibing [Mon, 14 Aug 2023 13:58:21 +0000 (21:58 +0800)]
net: e1000e: Remove unused declarations

Commit bdfe2da6aefd ("e1000e: cosmetic move of function prototypes to the new mac.h")
declared but never implemented them.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20230814135821.4808-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoqed: remove unused 'resp_size' calculation
Arnd Bergmann [Mon, 14 Aug 2023 07:45:03 +0000 (09:45 +0200)]
qed: remove unused 'resp_size' calculation

Newer versions of clang warn about this variable being assigned but
never used:

drivers/net/ethernet/qlogic/qed/qed_vf.c:63:67: error: parameter 'resp_size' set but not used [-Werror,-Wunused-but-set-parameter]

There is no indication in the git history on how this was ever
meant to be used, so just remove the entire calculation and argument
passing for it to avoid the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814074512.1067715-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonet: phy: mediatek-ge-soc: support PHY LEDs
Daniel Golle [Mon, 14 Aug 2023 01:58:14 +0000 (02:58 +0100)]
net: phy: mediatek-ge-soc: support PHY LEDs

Implement netdev trigger and primitive bliking offloading as well as
simple set_brigthness function for both PHY LEDs of the in-SoC PHYs
found in MT7981 and MT7988.

For MT7988, read boottrap register and apply LED polarities accordingly
to get uniform behavior from all LEDs on MT7988.
This requires syscon phandle 'mediatek,pio' present in parenting MDIO bus
which should point to the syscon holding the boottrap register.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/dc324d48c00cd7350f3a506eaa785324cae97372.1691977904.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoMerge branch 'nexthop-various-cleanups'
Jakub Kicinski [Wed, 16 Aug 2023 01:54:54 +0000 (18:54 -0700)]
Merge branch 'nexthop-various-cleanups'

Ido Schimmel says:

====================
nexthop: Various cleanups

Benefit from recent bug fixes and simplify the nexthop dump code.

No regressions in existing tests:

 # ./fib_nexthops.sh
 [...]
 Tests passed: 234
 Tests failed:   0
====================

Link: https://lore.kernel.org/r/20230813164856.2379822-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonexthop: Do not increment dump sentinel at the end of the dump
Ido Schimmel [Sun, 13 Aug 2023 16:48:56 +0000 (19:48 +0300)]
nexthop: Do not increment dump sentinel at the end of the dump

The nexthop and nexthop bucket dump callbacks previously returned a
positive return code even when the dump was complete, prompting the core
netlink code to invoke the callback again, until returning zero.

Zero was only returned by these callbacks when no information was filled
in the provided skb, which was achieved by incrementing the dump
sentinel at the end of the dump beyond the ID of the last nexthop.

This is no longer necessary as when the dump is complete these callbacks
return zero.

Remove the unnecessary increment.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonexthop: Simplify nexthop bucket dump
Ido Schimmel [Sun, 13 Aug 2023 16:48:55 +0000 (19:48 +0300)]
nexthop: Simplify nexthop bucket dump

Before commit f10d3d9df49d ("nexthop: Make nexthop bucket dump more
efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return
code for each resilient nexthop group whose buckets it dumped,
regardless if it encountered an error or not.

This meant that the sentinel ('dd->ctx->nh.idx') used by the function
that walked the different nexthops could not be used as a sentinel for
the bucket dump, as otherwise buckets from the same group would be
dumped over and over again.

This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx')
that was incremented by rtm_dump_nexthop_bucket_nh() after successfully
dumping all the buckets from a given group.

After the previously mentioned commit this sentinel is no longer
necessary since the function no longer returns a non-zero return code
when successfully dumping all the buckets from a given group.

Remove this sentinel and simplify the code.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoMerge branch 'seg6-add-next-c-sid-support-for-srv6-end-x-behavior'
Jakub Kicinski [Wed, 16 Aug 2023 01:51:49 +0000 (18:51 -0700)]
Merge branch 'seg6-add-next-c-sid-support-for-srv6-end-x-behavior'

Andrea Mayer says:

====================
seg6: add NEXT-C-SID support for SRv6 End.X behavior

In the Segment Routing (SR) architecture a list of instructions, called
segments, can be added to the packet headers to influence the forwarding and
processing of the packets in an SR enabled network.

Considering the Segment Routing over IPv6 data plane (SRv6) [1], the segment
identifiers (SIDs) are IPv6 addresses (128 bits) and the segment list (SID
List) is carried in the Segment Routing Header (SRH). A segment may correspond
to a "behavior" that is executed by a node when the packet is received.
The Linux kernel currently supports a large subset of the behaviors described
in [2] (e.g., End, End.X, End.T and so on).

In some SRv6 scenarios, the number of segments carried by the SID List may
increase dramatically, reducing the MTU (Maximum Transfer Unit) size and/or
limiting the processing power of legacy hardware devices (due to longer IPv6
headers).

The NEXT-C-SID mechanism [3] extends the SRv6 architecture by providing several
ways to efficiently represent the SID List.
By leveraging the NEXT-C-SID, it is possible to encode several SRv6 segments
within a single 128 bit SID address (also referenced as Compressed SID
Container). In this way, the length of the SID List can be drastically reduced.

The NEXT-C-SID mechanism is built upon the "flavors" framework defined in [2].
This framework is already supported by the Linux SRv6 subsystem and is used to
modify and/or extend a subset of existing behaviors.

In this patchset, we extend the SRv6 End.X behavior in order to support the
NEXT-C-SID mechanism.

In details, the patchset is made of:
 - patch 1/2: add NEXT-C-SID support for SRv6 End.X behavior;
 - patch 2/2: add selftest for NEXT-C-SID in SRv6 End.X behavior.

From the user space perspective, we do not need to change the iproute2 code to
support the NEXT-C-SID flavor for the SRv6 End.X behavior. However, we will
update the man page considering the NEXT-C-SID flavor applied to the SRv6 End.X
behavior in a separate patch.

[1] - https://datatracker.ietf.org/doc/html/rfc8754
[2] - https://datatracker.ietf.org/doc/html/rfc8986
[3] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
====================

Link: https://lore.kernel.org/r/20230812180926.16689-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoselftests: seg6: add selftest for NEXT-C-SID flavor in SRv6 End.X behavior
Paolo Lungaroni [Sat, 12 Aug 2023 18:09:26 +0000 (20:09 +0200)]
selftests: seg6: add selftest for NEXT-C-SID flavor in SRv6 End.X behavior

This selftest is designed for testing the support of NEXT-C-SID flavor
for SRv6 End.X behavior. It instantiates a virtual network composed of
several nodes: hosts and SRv6 routers. Each node is realized using a
network namespace that is properly interconnected to others through veth
pairs, according to the topology depicted in the selftest script file.
The test considers SRv6 routers implementing IPv4/IPv6 L3 VPNs leveraged
by hosts for communicating with each other. Such routers i) apply
different SRv6 Policies to the traffic received from connected hosts,
considering the IPv4 or IPv6 protocols; ii) use the NEXT-C-SID
compression mechanism for encoding several SRv6 segments within a single
128-bit SID address, referred to as a Compressed SID (C-SID) container.

The NEXT-C-SID is provided as a "flavor" of the SRv6 End.X behavior,
enabling it to properly process the C-SID containers. The correct
execution of the enabled NEXT-C-SID SRv6 End.X behavior is verified
through reachability tests carried out between hosts belonging to the
same VPN.

Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Co-developed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-3-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoseg6: add NEXT-C-SID support for SRv6 End.X behavior
Andrea Mayer [Sat, 12 Aug 2023 18:09:25 +0000 (20:09 +0200)]
seg6: add NEXT-C-SID support for SRv6 End.X behavior

The NEXT-C-SID mechanism described in [1] offers the possibility of
encoding several SRv6 segments within a single 128 bit SID address. Such
a SID address is called a Compressed SID (C-SID) container. In this way,
the length of the SID List can be drastically reduced.

A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
logically structured in three main blocks: i) Locator-Block; ii)
Locator-Node Function; iii) Argument.

                        C-SID container
+------------------------------------------------------------------+
|     Locator-Block      |Loc-Node|            Argument            |
|                        |Function|                                |
+------------------------------------------------------------------+
<--------- B -----------> <- NF -> <------------- A --------------->

   (i) The Locator-Block can be any IPv6 prefix available to the provider;

  (ii) The Locator-Node Function represents the node and the function to
       be triggered when a packet is received on the node;

 (iii) The Argument carries the remaining C-SIDs in the current C-SID
       container.

This patch leverages the NEXT-C-SID mechanism previously introduced in the
Linux SRv6 subsystem [2] to support SID compression capabilities in the
SRv6 End.X behavior [3].
An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior
but it is capable of processing the compressed SID List encoded in C-SID
containers.

An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support
user-provided Locator-Block and Locator-Node Function lengths. In this
implementation, such lengths must be evenly divisible by 8 (i.e. must be
byte-aligned), otherwise the kernel informs the user about invalid
values with a meaningful error code and message through netlink_ext_ack.

If Locator-Block and/or Locator-Node Function lengths are not provided
by the user during configuration of an SRv6 End.X behavior instance with
NEXT-C-SID flavor, the kernel will choose their default values i.e.,
32-bit Locator-Block and 16-bit Locator-Node Function.

[1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
[2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/
[3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoMerge branch 'genetlink-provide-struct-genl_info-to-dumps'
Jakub Kicinski [Tue, 15 Aug 2023 22:01:03 +0000 (15:01 -0700)]
Merge branch 'genetlink-provide-struct-genl_info-to-dumps'

Jakub Kicinski says:

====================
genetlink: provide struct genl_info to dumps

One of the biggest (which is not to say only) annoyances with genetlink
handling today is that doit and dumpit need some of the same information,
but it is passed to them in completely different structs.

The implementations commonly end up writing a _fill() method which
populates a message and have to pass at least 6 parameters. 3 of which
are extracted manually from request info.

After a lot of umming and ahing I decided to populate struct genl_info
for dumps, without trying to factor out only the common parts.
This makes the adoption easiest.

In the future we may add a new version of dump which takes
struct genl_info *info as the second argument, instead of
struct netlink_callback *cb. For now developers have to call
genl_info_dump(cb) to get the info.

Typical genetlink families no longer get exposed to netlink protocol
internals like pid and seq numbers.

v3:
 - correct the condition in ethtool code (patch 10)
v2: https://lore.kernel.org/all/20230810233845.2318049-1-kuba@kernel.org/
 - replace the GENL_INFO_NTF() macro with init helper
 - fix the commit messages
v1: https://lore.kernel.org/all/20230809182648.1816537-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20230814214723.2924989-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoethtool: netlink: always pass genl_info to .prepare_data
Jakub Kicinski [Mon, 14 Aug 2023 21:47:23 +0000 (14:47 -0700)]
ethtool: netlink: always pass genl_info to .prepare_data

We had a number of bugs in the past because developers forgot
to fully test dumps, which pass NULL as info to .prepare_data.
.prepare_data implementations would try to access info->extack
leading to a null-deref.

Now that dumps and notifications can access struct genl_info
we can pass it in, and remove the info null checks.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agoethtool: netlink: simplify arguments to ethnl_default_parse()
Jakub Kicinski [Mon, 14 Aug 2023 21:47:22 +0000 (14:47 -0700)]
ethtool: netlink: simplify arguments to ethnl_default_parse()

Pass struct genl_info directly instead of its members.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonetdev-genl: use struct genl_info for reply construction
Jakub Kicinski [Mon, 14 Aug 2023 21:47:21 +0000 (14:47 -0700)]
netdev-genl: use struct genl_info for reply construction

Use the just added APIs to make the code simpler.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: add genlmsg_iput() API
Jakub Kicinski [Mon, 14 Aug 2023 21:47:20 +0000 (14:47 -0700)]
genetlink: add genlmsg_iput() API

Add some APIs and helpers required for convenient construction
of replies and notifications based on struct genl_info.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: add a family pointer to struct genl_info
Jakub Kicinski [Mon, 14 Aug 2023 21:47:19 +0000 (14:47 -0700)]
genetlink: add a family pointer to struct genl_info

Having family in struct genl_info is quite useful. It cuts
down the number of arguments which need to be passed to
helpers which already take struct genl_info.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: use attrs from struct genl_info
Jakub Kicinski [Mon, 14 Aug 2023 21:47:18 +0000 (14:47 -0700)]
genetlink: use attrs from struct genl_info

Since dumps carry struct genl_info now, use the attrs pointer
from genl_info and remove the one in struct genl_dumpit_info.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: add struct genl_info to struct genl_dumpit_info
Jakub Kicinski [Mon, 14 Aug 2023 21:47:17 +0000 (14:47 -0700)]
genetlink: add struct genl_info to struct genl_dumpit_info

Netlink GET implementations must currently juggle struct genl_info
and struct netlink_callback, depending on whether they were called
from doit or dumpit.

Add genl_info to the dump state and populate the fields.
This way implementations can simply pass struct genl_info around.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: remove userhdr from struct genl_info
Jakub Kicinski [Mon, 14 Aug 2023 21:47:16 +0000 (14:47 -0700)]
genetlink: remove userhdr from struct genl_info

Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: make genl_info->nlhdr const
Jakub Kicinski [Mon, 14 Aug 2023 21:47:15 +0000 (14:47 -0700)]
genetlink: make genl_info->nlhdr const

struct netlink_callback has a const nlh pointer, make the
pointer in struct genl_info const as well, to make copying
between the two easier.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agogenetlink: push conditional locking into dumpit/done
Jakub Kicinski [Mon, 14 Aug 2023 21:47:14 +0000 (14:47 -0700)]
genetlink: push conditional locking into dumpit/done

Add helpers which take/release the genl mutex based
on family->parallel_ops. Remove the separation between
handling of ops in locked and parallel families.

Future patches would make the duplicated code grow even more.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonet: dsa: mv88e6060: add phylink_get_caps implementation
Russell King (Oracle) [Sat, 12 Aug 2023 09:30:33 +0000 (10:30 +0100)]
net: dsa: mv88e6060: add phylink_get_caps implementation

Add a phylink_get_caps implementation for Marvell 88e6060 DSA switch.
This is a fast ethernet switch, with internal PHYs for ports 0 through
4. Port 4 also supports MII, REVMII, REVRMII and SNI. Port 5 supports
MII, REVMII, REVRMII and SNI without an internal PHY.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1qUkx7-003dMX-9b@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonet/mlx5: Don't query MAX caps twice
Shay Drory [Tue, 11 Jul 2023 13:32:05 +0000 (16:32 +0300)]
net/mlx5: Don't query MAX caps twice

Whenever mlx5 driver is probed or reloaded, it queries some capabilities
in MAX mode via set_hca_cap() API. Afterwards, the driver queries all
capabilities in MAX mode via mlx5_query_hca_caps() API.

Since MAX caps are read only caps, querying them twice is redundant.
Hence, delete the second query.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Remove unused MAX HCA capabilities
Shay Drory [Tue, 11 Jul 2023 12:56:08 +0000 (15:56 +0300)]
net/mlx5: Remove unused MAX HCA capabilities

Each device cap has two modes: MAX and CUR. The driver maintains a
cache of both modes of the capabilities. For most device caps, the MAX
cap mode is never used.

Hence, remove all driver queries of the MAX mode of the said caps as
well as their helper MACROs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Remove unused CAPs
Shay Drory [Sun, 2 Jan 2022 12:57:39 +0000 (14:57 +0200)]
net/mlx5: Remove unused CAPs

mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but
there isn't any user who requires them.
As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used.

Thus, drop all usages and definitions of the mentioned caps above.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Fix error message in mlx5_sf_dev_state_change_handler()
Jiri Pirko [Fri, 30 Jun 2023 10:45:39 +0000 (12:45 +0200)]
net/mlx5: Fix error message in mlx5_sf_dev_state_change_handler()

sw_function_id contains sfnum, so fix the error message to name the
value properly.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Remove redundant check of mlx5_vhca_event_supported()
Jiri Pirko [Fri, 30 Jun 2023 07:41:14 +0000 (09:41 +0200)]
net/mlx5: Remove redundant check of mlx5_vhca_event_supported()

Since mlx5_vhca_event_supported() is called in mlx5_sf_dev_supported(),
remove the redundant call.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Use mlx5_sf_start_function_id() helper instead of directly calling MLX5_CAP...
Jiri Pirko [Fri, 30 Jun 2023 07:37:04 +0000 (09:37 +0200)]
net/mlx5: Use mlx5_sf_start_function_id() helper instead of directly calling MLX5_CAP_GEN()

There is a helper called mlx5_sf_start_function_id() that
wraps up a query to get base SF function id. Use that instead of
calling MLX5_CAP_GEN() directly.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Remove redundant SF supported check from mlx5_sf_hw_table_init()
Jiri Pirko [Fri, 30 Jun 2023 07:32:14 +0000 (09:32 +0200)]
net/mlx5: Remove redundant SF supported check from mlx5_sf_hw_table_init()

Since mlx5_sf_supported() check is done as a first thing in
mlx5_sf_max_functions(), remove the redundant check.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Use auxiliary_device_uninit() instead of device_put()
Jiri Pirko [Wed, 28 Jun 2023 14:19:52 +0000 (16:19 +0200)]
net/mlx5: Use auxiliary_device_uninit() instead of device_put()

Instead of using device_put(), use auxiliary_device_uninit() for
auxiliary device uninit which internally just calls device_put().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: E-switch, Add checking for flow rule destinations
Jianbo Liu [Wed, 19 Apr 2023 03:17:57 +0000 (03:17 +0000)]
net/mlx5: E-switch, Add checking for flow rule destinations

Firmware doesn't allow flow rules in FDB to do header rewrite and send
packets to both internal and uplink vports. The following syndrome
will be generated when trying to offload such kind of rules:

mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 23569): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8c8f08), err(-22)

To avoid this syndrome, add a checking before creating FTE. If a rule
with header rewrite action forwards packets to both VF and PF, an
error is returned directly.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Check with FW that sync reset completed successfully
Moshe Shemesh [Wed, 31 May 2023 10:50:21 +0000 (13:50 +0300)]
net/mlx5: Check with FW that sync reset completed successfully

Even if the PF driver had no error on his part of the sync reset flow,
the firmware can see wider picture as it syncs all the PFs in the flow.
So add at end of sync reset flow check with firmware by reading MFRL
register and initialization segment that the flow had no issue from
firmware point of view too.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Expose max possible SFs via devlink resource
Shay Drory [Thu, 13 Jul 2023 11:54:57 +0000 (14:54 +0300)]
net/mlx5: Expose max possible SFs via devlink resource

Introduce devlink resource for exposing max possible SFs on mlx5
devices.

For example:
$ devlink resource show pci/0000:00:0b.0
pci/0000:00:0b.0:
  name max_local_SFs size 5 unit entry dpipe_tables none
  name max_external_SFs size 0 unit entry dpipe_tables none

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5e: Add recovery flow for tx devlink health reporter for unhealthy PTP SQ
Rahul Rameshbabu [Wed, 9 Aug 2023 04:10:21 +0000 (21:10 -0700)]
net/mlx5e: Add recovery flow for tx devlink health reporter for unhealthy PTP SQ

A new check for the tx devlink health reporter is introduced for
determining when the PTP port timestamping SQ is considered unhealthy. If
there are enough CQEs considered never to be delivered, the space that can
be utilized on the SQ decreases significantly, impacting performance and
usability of the SQ. The health reporter is triggered when the number of
likely never delivered port timestamping CQEs that utilize the space of the
PTP SQ is greater than 93.75% of the total capacity of the SQ. A devlink
health reporter recover method is also provided for this specific TX error
context that restarts the PTP SQ.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs
Rahul Rameshbabu [Tue, 2 May 2023 23:31:40 +0000 (16:31 -0700)]
net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs

Use a map structure for associating CQEs containing port timestamping
information with the appropriate skb. Track order of WQEs submitted using a
FIFO. Check if the corresponding port timestamping CQEs from the lookup
values in the FIFO are considered dropped due to time elapsed. Return the
lookup value to a freelist after consuming the skb. Reuse the freed lookup
in future WQE submission iterations.

The map structure uses an integer identifier for the key and returns an skb
corresponding to that identifier. Embed the integer identifier in the WQE
submitted to the WQ for the transmit path when the SQ is a PTP (port
timestamping) SQ. The embedded identifier can then be queried using a field
in the CQE of the corresponding port timestamping CQ. In the port
timestamping napi_poll context, the identifier is queried from the CQE
polled from CQ and used to lookup the corresponding skb from the WQE submit
path. The skb reference is removed from map and then embedded with the port
HW timestamp information from the CQE and eventually consumed.

The metadata freelist FIFO is an array containing integer identifiers that
can be pushed and popped in the FIFO. The purpose of this structure is
bookkeeping what identifier values can safely be used in a subsequent WQE
submission and should not contain identifiers that have still not been
reaped by processing a corresponding CQE completion on the port
timestamping CQ.

The ts_cqe_pending_list structure is a combination of an array and linked
list. The array is pre-populated with the nodes that will be added and
removed from the head of the linked list. Each node contains the unique
identifier value associated with the values submitted in the WQEs and
retrieved in the port timestamping CQEs. When a WQE is submitted, the node
in the array corresponding to the identifier popped from the metadata
freelist is added to the end of the CQE pending list and is marked as
"in-use". The node is removed from the linked list under two conditions.
The first condition is that the corresponding port timestamping CQE is
polled in the PTP napi_poll context. The second condition is that more than
a second has elapsed since the DMA timestamp value corresponding to the WQE
submission. When the first condition occurs, the "in-use" bit in the linked
list node is cleared, and the resources corresponding to the WQE submission
are then released. The second condition, however, indicates that the port
timestamping CQE will likely never be delivered. It's not impossible for
the device to post a CQE after an infinite amount of time though highly
improbable. In order to be resilient to this improbable case, resources
related to the corresponding WQE submission are still kept, the identifier
value is not returned to the freelist, and the "in-use" bit is cleared on
the node to indicate that it's no longer part of the linked list of "likely
to be delivered" port timestamping CQE identifiers. A count for the number
of port timestamping CQEs considered highly likely to never be delivered by
the device is maintained. This count gets decremented in the unlikely event
a port timestamping CQE considered unlikely to ever be delivered is polled
in the PTP napi_poll context.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agonet/mlx5: Consolidate devlink documentation in devlink/mlx5.rst
Rahul Rameshbabu [Mon, 27 Feb 2023 21:57:00 +0000 (13:57 -0800)]
net/mlx5: Consolidate devlink documentation in devlink/mlx5.rst

De-duplicate documentation by removing mellanox/mlx5/devlink.rst. Instead,
only use the generic devlink documentation directory to document mlx5
devlink parameters. Avoid providing general devlink tool usage information
in mlx5-specific documentation.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
11 months agoMerge branch 'devlink-introduce-selective-dumps'
Jakub Kicinski [Mon, 14 Aug 2023 18:47:27 +0000 (11:47 -0700)]
Merge branch 'devlink-introduce-selective-dumps'

Jiri Pirko says:

====================
devlink: introduce selective dumps

Motivation:

For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.

Solution:

Allow user to pass devlink handle (and possibly other attributes)
alongside the dump command and dump only objects which are matching
the selection.

Use split ops to generate policies for dump callbacks acccording to
the attributes used for selection.

The userspace can use ctrl genetlink GET_POLICY command to find out if
the selective dumps are supported by kernel for particular command.

Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false

$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

Extension:

patches #12 and #13 extends selection attributes by port index
for health reporter dumping.
====================

Link: https://lore.kernel.org/r/20230811155714.1736405-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonetlink: specs: devlink: extend health reporter dump attributes by port index
Jiri Pirko [Fri, 11 Aug 2023 15:57:14 +0000 (17:57 +0200)]
netlink: specs: devlink: extend health reporter dump attributes by port index

Allow user to pass port index for health reporter dump request.

Re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: extend health reporter dump selector by port index
Jiri Pirko [Fri, 11 Aug 2023 15:57:13 +0000 (17:57 +0200)]
devlink: extend health reporter dump selector by port index

Introduce a possibility for devlink object to expose attributes it
supports for selection of dumped objects.

Use this by health reporter to indicate it supports port index based
selection of dump objects. Implement this selection mechanism in
devlink_nl_cmd_health_reporter_get_dump_one()

Example:
$ devlink health
pci/0000:08:00.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32769:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32770:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98304:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98305:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98306:
  reporter vnic
    state healthy error 0 recover 0

$ devlink health show pci/0000:08:00.0
pci/0000:08:00.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32769:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32770:
  reporter vnic
    state healthy error 0 recover 0

$ devlink health show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0

The last command is possible because of this patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonetlink: specs: devlink: extend per-instance dump commands to accept instance attributes
Jiri Pirko [Fri, 11 Aug 2023 15:57:12 +0000 (17:57 +0200)]
netlink: specs: devlink: extend per-instance dump commands to accept instance attributes

Extend per-instance dump command definitions to accept instance
attributes. Allow parsing of devlink handle attributes so they could
be used for instance selection.

Re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: allow user to narrow per-instance dumps by passing handle attrs
Jiri Pirko [Fri, 11 Aug 2023 15:57:11 +0000 (17:57 +0200)]
devlink: allow user to narrow per-instance dumps by passing handle attrs

For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.

Allow user to pass devlink handle attributes alongside the dump command
and dump only objects which are under selected devlink instance.

Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false

$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: remove converted commands from small ops
Jiri Pirko [Fri, 11 Aug 2023 15:57:10 +0000 (17:57 +0200)]
devlink: remove converted commands from small ops

As the commands are already defined in split ops, remove them
from small ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: remove duplicate temporary netlink callback prototypes
Jiri Pirko [Fri, 11 Aug 2023 15:57:09 +0000 (17:57 +0200)]
devlink: remove duplicate temporary netlink callback prototypes

Remove the duplicate temporary netlink callback prototype as the
generated ones are already in place.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonetlink: specs: devlink: add commands that do per-instance dump
Jiri Pirko [Fri, 11 Aug 2023 15:57:08 +0000 (17:57 +0200)]
netlink: specs: devlink: add commands that do per-instance dump

Add the definitions for the commands that do per-instance dump
and re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: pass flags as an arg of dump_one() callback
Jiri Pirko [Fri, 11 Aug 2023 15:57:07 +0000 (17:57 +0200)]
devlink: pass flags as an arg of dump_one() callback

In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the
flags as an arg of dump_one() callback. Currently, it is always
NLM_F_MULTI.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: introduce dumpit callbacks for split ops
Jiri Pirko [Fri, 11 Aug 2023 15:57:06 +0000 (17:57 +0200)]
devlink: introduce dumpit callbacks for split ops

Introduce dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: rename doit callbacks for per-instance dump commands
Jiri Pirko [Fri, 11 Aug 2023 15:57:05 +0000 (17:57 +0200)]
devlink: rename doit callbacks for per-instance dump commands

Rename netlink doit callback functions for the commands that do
implement per-instance dump to match the generated names that are going
to be introduce in the follow-up patch.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: introduce devlink_nl_pre_doit_port*() helper functions
Jiri Pirko [Fri, 11 Aug 2023 15:57:04 +0000 (17:57 +0200)]
devlink: introduce devlink_nl_pre_doit_port*() helper functions

Define port handling helpers what don't rely on internal_flags.
Have __devlink_nl_pre_doit() to accept the flags as a function arg and
make devlink_nl_pre_doit() a wrapper helper function calling it.
Introduce new helpers devlink_nl_pre_doit_port() and
devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up
patch.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: parse rate attrs in doit() callbacks
Jiri Pirko [Fri, 11 Aug 2023 15:57:03 +0000 (17:57 +0200)]
devlink: parse rate attrs in doit() callbacks

No need to give the rate any special treatment in netlink attributes
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.

Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler
by moving the rate attributes parsing to rate_*_doit() ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agodevlink: parse linecard attr in doit() callbacks
Jiri Pirko [Fri, 11 Aug 2023 15:57:02 +0000 (17:57 +0200)]
devlink: parse linecard attr in doit() callbacks

No need to give the linecards any special treatment in netlink attribute
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.

Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler
by moving the linecard attribute parsing to linecard_[gs]et_doit() ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 months agonet: phy: Introduce PSGMII PHY interface mode
Gabor Juhos [Fri, 11 Aug 2023 11:10:07 +0000 (13:10 +0200)]
net: phy: Introduce PSGMII PHY interface mode

The PSGMII interface is similar to QSGMII. The main difference
is that the PSGMII interface combines five SGMII lines into a
single link while in QSGMII only four lines are combined.

Similarly to the QSGMII, this interface mode might also needs
special handling within the MAC driver.

It is commonly used by Qualcomm with their QCA807x PHY series and
modern WiSoC-s.

Add definitions for the PHY layer to allow to express this type
of connection between the MAC and PHY.

Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agodt-bindings: net: ethernet-controller: add PSGMII mode
Robert Marko [Fri, 11 Aug 2023 11:10:06 +0000 (13:10 +0200)]
dt-bindings: net: ethernet-controller: add PSGMII mode

Add a new PSGMII mode which is similar to QSGMII with the difference being
that it combines 5 SGMII lines into a single link compared to 4 on QSGMII.

It is commonly used by Qualcomm on their QCA807x PHY series.

Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 months agoMerge branch 'mlxsw-redirection'
David S. Miller [Mon, 14 Aug 2023 07:11:14 +0000 (08:11 +0100)]
Merge branch 'mlxsw-redirection'

Petr Machata says:

====================
mlxsw: Support traffic redirection from a locked bridge port

Ido Schimmel writes:

It is possible to add a filter that redirects traffic from the ingress
of a bridge port that is locked (i.e., performs security / SMAC lookup)
and has learning enabled. For example:

 # ip link add name br0 type bridge
 # ip link set dev swp1 master br0
 # bridge link set dev swp1 learning on locked on mab on
 # tc qdisc add dev swp1 clsact
 # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2

In the kernel's Rx path, this filter is evaluated before the Rx handler
of the bridge, which means that redirected traffic should not be
affected by bridge port configuration such as learning.

However, the hardware data path is a bit different and the redirect
action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
packet, which is later used by the L2 lookup stage to understand how to
forward the packet. Between both stages - ingress ACL and L2 lookup -
learning and security lookup are performed, which means that redirected
traffic is affected by bridge port configuration, unlike in the kernel's
data path.

The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") by simply
ignoring learning notifications generated by the redirected traffic. A
similar solution is not possible for the security / SMAC lookup since
- unlike learning - the CPU is not involved and packets that failed the
lookup are dropped by the device.

Instead, solve this by prepending the ignore action to the redirect
action and use it to instruct the device to disable both learning and
the security / SMAC lookup for redirected traffic.

Patch #1 adds the ignore action.

Patch #2 prepends the action to the redirect action in flower offload
code.

Patch #3 removes the workaround in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") since it is
no longer needed.

Patch #4 adds a test case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>