review.tizen.org Git - platform/kernel/linux-starfive.git/log

netfilter: nft_meta: add inner match support

Add support for inner meta matching on:

- NFT_META_PROTOCOL: to match on the ethertype, this can be used
regardless tunnel protocol provides no link layer header, in that case
nft_inner sets on the ethertype based on the IP header version field.
- NFT_META_L4PROTO: to match on the layer 4 protocol.

These meta expression are usually autogenerated as dependencies by
userspace nftables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_inner: add percpu inner context

Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are
available. Store nft_inner_tun_ctx object in percpu area to cache
existing inner offsets for this skbuff.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_inner: support for inner tunnel header matching

This new expression allows you to match on the inner headers that are
encapsulated by any of the existing tunneling protocols.

This expression parses the inner packet to set the link, network and
transport offsets, so the existing expressions (with a few updates) can
be reused to match on the inner headers.

The inner expression supports for different tunnel combinations such as:

- ethernet frame over IPv4/IPv6 packet, eg. VxLAN.
- IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP.
- IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE.
- transport header (ESP or SCTP) over transport header (usually UDP)

The following fields are used to describe the tunnel protocol:

- flags, which describe how to parse the inner headers:

  NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header.
  NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header.
  NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header.
  NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header.

For example, VxLAN sets on all of these flags. While GRE only sets on
NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over
UDP only sets on NFT_PAYLOAD_CTX_INNER_TH.

The tunnel description is composed of the following attributes:

- header size: in case the tunnel comes with its own header, eg. VxLAN.

- type: this provides a hint to userspace on how to delinearize the rule.
  This is useful for VxLAN and Geneve since they run over UDP, since
  transport does not provide a hint. This is also useful in case hardware
  offload is ever supported. The type is not currently interpreted by the
  kernel.

- expression: currently only payload supported. Follow up patch adds
  also inner meta support which is required by autogenerated
  dependencies. The exthdr expression should be supported too
  at some point. There is a new inner_ops operation that needs to be
  set on to allow to use an existing expression from the inner expression.

This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match
on the tunnel header fields, eg. vxlan vni.

The payload expression is embedded into nft_inner private area and this
private data area is passed to the payload inner eval function via
direct call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_payload: access ipip payload for inner offset

ipip is an special case, transport and inner header offset are set to
the same offset to use the upcoming inner expression for matching on
inner tunnel headers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_payload: access GRE payload via inner offset

Parse GRE v0 packets to properly set up inner offset, this allow for
matching on inner headers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_objref: make it builtin

nft_objref is needed to reference named objects, it makes
no sense to disable it.

Before:
   text    data     bss     dec filename
  4014     424       0    4438 nft_objref.o
  4174    1128       0    5302 nft_objref.ko
359351   15276     864 375491 nf_tables.ko
After:
  text    data     bss     dec filename
  3815     408       0    4223 nft_objref.o
363161   15692     864 379717 nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: reduce nft_pktinfo by 8 bytes

structure is reduced from 32 to 24 bytes. While at it, also check
that iphdrlen is sane, this is guaranteed for NFPROTO_IPV4 but not
for ingress or bridge, so add checks for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_payload: move struct nft_payload_set definition where it belongs

Not required to expose this header in nf_tables_core.h, move it to where
it is used, ie. nft_payload.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

bnx2: Use kmalloc_size_roundup() to match ksize() usage

Round up allocations with kmalloc_size_roundup() so that build_skb()'s
use of ksize() is always accurate and no special handling of the memory
is needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE.

Cc: Rasesh Mody <rmody@marvell.com>
Cc: GR-Linux-NIC-Dev@marvell.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221022021004.gonna.489-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mptcp-socket-option-updates'

Mat Martineau says:

====================
mptcp: Socket option updates

Patches 1 and 3 refactor a recent socket option helper function for more
generic use, and make use of it in a couple of places.

Patch 2 adds TCP_FASTOPEN_NO_COOKIE functionality to MPTCP sockets,
similar to TCP_FASTOPEN_CONNECT support recently added in v6.1
====================

Link: https://lore.kernel.org/r/20221022004505.160988-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: sockopt: use new helper for TCP_DEFER_ACCEPT

mptcp_setsockopt_sol_tcp_defer() was doing the same thing as
mptcp_setsockopt_first_sf_only() except for the returned code in case of
error.

Ignoring the error is needed to mimic how TCP_DEFER_ACCEPT is handled
when used with "plain" TCP sockets.

The specific function for TCP_DEFER_ACCEPT can be replaced by the new
mptcp_setsockopt_first_sf_only() helper and errors can be ignored to
stay compatible with TCP. A bit of cleanup.

Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: add TCP_FASTOPEN_NO_COOKIE support

The goal of this socket option is to configure MPTCP + TFO without
cookie per socket.

It was already possible to enable TFO without a cookie per netns by
setting net.ipv4.tcp_fastopen sysctl knob to the right value. Per route
was also supported by setting 'fastopen_no_cookie' option. This patch
adds a per socket support like it is possible to do with TCP thanks to
TCP_FASTOPEN_NO_COOKIE socket option.

The only thing to do here is to relay the request to the first subflow
like it is already done for TCP_FASTOPEN_CONNECT.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: sockopt: make 'tcp_fastopen_connect' generic

There are other socket options that need to act only on the first
subflow, e.g. all TCP_FASTOPEN* socket options.

This is similar to the getsockopt version.

In the next commit, this new mptcp_setsockopt_first_sf_only() helper is
used by other another option.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'soreuseport-fix-broken-so_incoming_cpu'

Kuniyuki Iwashima says:

====================
soreuseport: Fix broken SO_INCOMING_CPU.

setsockopt(SO_INCOMING_CPU) for UDP/TCP is broken since 4.5/4.6 due to
these commits:

  * e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
  * c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")

These commits introduced the O(1) socket selection algorithm and removed
O(n) iteration over the list, but it ignores the score calculated by
compute_score().  As a result, it caused two misbehaviours:

  * Unconnected sockets receive packets sent to connected sockets
  * SO_INCOMING_CPU does not work

The former is fixed by commit acdcecc61285 ("udp: correct reuseport
selection with connected sockets").  This series fixes the latter and
adds some tests for SO_INCOMING_CPU.
====================

Link: https://lore.kernel.org/r/20221021204435.4259-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftest: Add test for SO_INCOMING_CPU.

Some highly optimised applications use SO_INCOMING_CPU to make them
efficient, but they didn't test if it's working correctly by getsockopt()
to avoid slowing down.  As a result, no one noticed it had been broken
for years, so it's a good time to add a test to catch future regression.

The test does

  1) Create $(nproc) TCP listeners associated with each CPU.

  2) Create 32 child sockets for each listener by calling
     sched_setaffinity() for each CPU.

  3) Check if accept()ed sockets' sk_incoming_cpu matches
     listener's one.

If we see -EAGAIN, SO_INCOMING_CPU is broken.  However, we might not see
any error even if broken; the kernel could miraculously distribute all SYN
to correct listeners.  Not to let that happen, we must increase the number
of clients and CPUs to some extent, so the test requires $(nproc) >= 2 and
creates 64 sockets at least.

Test:
  $ nproc
  96
  $ ./so_incoming_cpu

Before the previous patch:

  # Starting 12 tests from 5 test cases.
  #  RUN           so_incoming_cpu.before_reuseport.test1 ...
  # so_incoming_cpu.c:191:test1:Expected cpu (5) == i (0)
  # test1: Test terminated by assertion
  #          FAIL  so_incoming_cpu.before_reuseport.test1
  not ok 1 so_incoming_cpu.before_reuseport.test1
  ...
  # FAILED: 0 / 12 tests passed.
  # Totals: pass:0 fail:12 xfail:0 xpass:0 skip:0 error:0

After:

  # Starting 12 tests from 5 test cases.
  #  RUN           so_incoming_cpu.before_reuseport.test1 ...
  # so_incoming_cpu.c:199:test1:SO_INCOMING_CPU is very likely to be working correctly with 3072 sockets.
  #            OK  so_incoming_cpu.before_reuseport.test1
  ok 1 so_incoming_cpu.before_reuseport.test1
  ...
  # PASSED: 12 / 12 tests passed.
  # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

soreuseport: Fix socket selection for SO_INCOMING_CPU.

Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-ipa-validation-cleanup'

Alex Elder says:

====================
net: ipa: validation cleanup

This series gathers a set of IPA driver cleanups, mostly involving
code that ensures certain things are known to be correct *early*
(either at build or initializatin time), so they can be assumed good
during normal operation.

The first removes three constant symbols, by making a (reasonable)
assumption that a routing table consists of entries for the modem
followed by entries for the AP, with no unused entries between them.

The second removes two checks that are redundant (they verify the
sizes of two memory regions are in range, which will have been done
earlier for all regions).

The third adds some new checks to routing and filter tables that
can be done at "init time" (without requiring any access to IPA
hardware).

The fourth moves a check that routing and filter table addresses can
be encoded within certain IPA immediate commands, so it's performed
earlier; the checks can be done without touching IPA hardware. The
fifth moves some other command-related checks earlier, for the same
reason.

The sixth removes the definition ipa_table_valid(), because what it
does has become redundant. Finally, the last patch moves two more
validation calls so they're done very early in the probe process.
This will be required by some upcoming patches, which will record
the size of the routing and filter tables at this time so they're
available for subsequent initialization.
====================

Link: https://lore.kernel.org/r/20221021191340.4187935-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: check table memory regions earlier

Verify that the sizes of the routing and filter table memory regions
are valid as part of memory initialization, rather than waiting for
table initialization. The main reason to do this is that upcoming
patches use these memory region sizes to determine the number of
entries in these tables, and we'll want to know these sizes are good
sooner.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: kill ipa_table_valid()

What ipa_table_valid() (and ipa_table_valid_one(), which it calls)
does is ensure that the memory regions that hold routing and filter
tables have reasonable size.  Specifically, it checks that the size
of a region is sufficient (or rather, exactly the right size) to
hold the maximum number of entries supported by the driver.  (There
is an additional check that's erroneous, but in practice it is never
reached.)

Recently ipa_table_mem_valid() was added, which is called by
ipa_table_init().  That function verifies that all table memory
regions are of sufficient size, and requires hashed tables to have
zero size if hashing is not supported.  It only ensures the filter
table is large enough to hold the number of endpoints that support
filtering, but that is adequate.

Therefore everything that ipa_table_valid() does is redundant, so
get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: introduce ipa_cmd_init()

Currently, ipa_cmd_data_valid() is called by ipa_mem_config().
Nothing it does requires access to hardware though, so it can be
done during the init phase of IPA driver startup.

Create a new function ipa_cmd_init(), whose purpose is to do early
initialization related to IPA immediate commands. It will call the
build-time validation function, then will make the two calls made
previously by ipa_cmd_data_valid(). This make ipa_cmd_data_valid()
unnecessary, so get rid of it.

Rename ipa_cmd_header_valid() to be ipa_cmd_header_init_local_valid(),
so its name is clearer about which IPA immediate command it is
associated with.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: verify table sizes fit in commands early

We currently verify the table size and offset fit in the immediate
command fields that must encode them in ipa_table_valid_one(). We
can now make this check earlier, in ipa_table_mem_valid().

The non-hashed IPv4 filter and route tables will always exist, and
their sizes will match the IPv6 tables, as well as the hashed tables
(if supported). So it's sufficient to verify the offset and size of
the IPv4 non-hashed tables fit into these fields.

Rename the function ipa_cmd_table_init_valid(), to reinforce that
it is the TABLE_INIT immediate command fields we're checking.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: validate IPA table memory earlier

Add checks in ipa_table_init() to ensure the memory regions defined
for IPA filter and routing tables are valid.

For routing tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 routing tables are defined
  - The non-hashed IPv4 and IPv6 routing tables are the same size
  - The number entries in the non-hashed IPv4 routing table is enough
    to hold the number entries available to the modem, plus at least
    one usable by the AP.

For filter tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 filter tables are defined
  - The non-hashed IPv4 and IPv6 filter tables are the same size
  - The number entries in the non-hashed IPv4 filter table is enough
    to hold the endpoint bitmap, plus an entry for each defined
    endpoint that supports filtering.

In addition, for both routing and filter tables:
  - If hashing isn't supported (IPA v4.2), hashed tables are zero size
  - If hashing *is* supported, all hashed tables are the same size as
    their non-hashed counterparts.

When validating the size of routing tables, require the AP to have
at least one entry (in addition to those used by the modem).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: remove two memory region checks

There's no need to ensure table memory regions fit within the
IPA-local memory range. And there's no need to ensure the modem
header memory region is in range either. These are verified for all
memory regions in ipa_mem_size_valid(), once we have settled on the
size of IPA memory.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: kill two constant symbols

The entries in each IPA routing table are divided between the modem
and the AP.  The modem always gets some number of entries located at
the base of the table; the AP gets all those that follow.

There's no reason to think the modem will use anything different
from the first entries in a routing table, so:
  - Get rid of IPA_ROUTE_MODEM_MIN (just assume it's 0)
  - Get rid of IPA_ROUTE_AP_MIN (just assume it's IPA_ROUTE_MODEM_COUNT)
And finally:
  - Open-code IPA_ROUTE_AP_COUNT and remove its definition

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'extend-action-skbedit-to-rx-queue-mapping'

Amritha Nambiar says:

====================
Extend action skbedit to RX queue mapping

Based on the discussion on
https://lore.kernel.org/netdev/166260012413.81018.8010396115034847972.stgit@anambiarhost.jf.intel.com/ ,
the following series extends skbedit tc action to RX queue mapping.
Currently, skbedit action in tc allows overriding of transmit queue.
Extending this ability of skedit action supports the selection of
receive queue for incoming packets. On the receive side, this action
is supported only in hardware, so the skip_sw flag is enforced.

Enabled ice driver to offload this type of filter into the hardware
for accepting packets to the device's receive queue.
====================

Link: https://lore.kernel.org/r/166633888716.52141.3425659377117969638.stgit@anambiarhost.jf.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Documentation: networking: TC queue based filtering

Add tc-queue-filters.rst with notes on TC filters for
selecting a set of queues and/or a queue.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ice: Enable RX queue selection using skbedit action

This patch uses TC skbedit queue_mapping action to support
forwarding packets to a device queue. Such filters with action
forward to queue will be the highest priority switch filter in
HW.
Example:
$ tc filter add dev ens4f0 protocol ip ingress flower\
dst_ip 192.168.1.12 ip_proto tcp dst_port 5001\
action skbedit queue_mapping 5 skip_sw

The above command adds an ingress filter, incoming packets
qualifying the match will be accepted into queue 5. The queue
number is in decimal format.

Refactored ice_add_tc_flower_adv_fltr() to consolidate code with
action FWD_TO_VSI and FWD_TO QUEUE.

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

act_skbedit: skbedit queue mapping for receive queue

Add support for skbedit queue mapping action on receive
side. This is supported only in hardware, so the skip_sw
flag is enforced. This enables offloading filters for
receive queue selection in the hardware using the
skbedit action. Traffic arrives on the Rx queue requested
in the skbedit action parameter. A new tc action flag
TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
traffic direction the action queue_mapping is requested
on during filter addition. This is used to disallow
offloading the skbedit queue mapping action on transmit
side.

Example:
$tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
action skbedit queue_mapping $rxq_id skip_sw

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sfp-improve-high-power-module-implementation'

Russell King says:

====================
net: sfp: improve high power module implementation

This series aims to improve the power level switching between standard
level 1 and the higher power levels.

The first patch updates the DT binding documentation to include the
minimum and default of 1W, which is the base level that every SFP cage
must support. Hence, it makes sense to document this in the binding.

The second patch enforces a minimum of 1W when parsing the firmware
description, and optimises the code for that case; there's no need to
check for SFF8472 compliance since we will not need to touch the
A2h registers.

Patch 3 validates that the module supports SFF-8472 rev 10.2 before
checking for power level 2 - rev 10.2 is where support for power
levels was introduced, so if the module doesn't support this revision,
it doesn't support power levels. Setting the power level 2 declaration
bit is likely to be spurious.

Patch 4 does the same for power level 3, except this was introduced in
SFF-8472 rev 11.9. The revision code was never updated, so we use the
rev 11.4 to signify this.

Patch 5 cleans up the code - rather than using BIT(0), we now use a
properly named value for the power level select bit.

Patch 6 introduces a read-modify-write helper.

Patch 7 gets rid of the DM7052 hack (which sets a power level
declaration bit but is not compatible with SFF-8472 rev 10.2, and
the module does not implement the A2h I2C address.)

Series tested with my DM7052.

v2: update sff.sfp.yaml with Rob's feedback
====================

Andrew's review tags from v1.

Link: https://lore.kernel.org/r/Y0%2F7dAB8OU3jrbz6@shell.armlinux.org.uk
Link: https://lore.kernel.org/r/Y1K17UtfFopACIi2@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: get rid of DM7052 hack when enabling high power

Since we no longer mis-detect high-power mode with the DM7052 module,
we no longer need the hack in sfp_module_enable_high_power(), and can
now switch this to use sfp_modify_u8().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: add sfp_modify_u8() helper

Add a helper to modify bits in a single byte in memory space, and use
it when updating the soft tx-disable flag in the module.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: provide a definition for the power level select bit

Provide a named definition for the power level select bit in the
extended status register, rather than using BIT(0) in the code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: ignore power level 3 prior to SFF-8472 Rev 11.4

Power level 3 was included in SFF-8472 revision 11.9, but this does
not have a compliance code. Use revision 11.4 as the minimum
compliance level instead.

This should avoid any spurious indication of 2W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: ignore power level 2 prior to SFF-8472 Rev 10.2

Power level 2 was introduced by SFF-8472 revision 10.2. Ignore
the power declaration bit for modules that are not compliant with
at least this revision.

This should remove any spurious indication of 1.5W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: check firmware provided max power

Check that the firmware provided maximum power is at least 1W, which
is the minimum power level for any SFP module.

Now that we enforce the minimum of 1W, we can exit early from
sfp_module_parse_power() if the module power is 1W or less.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: sff,sfp: update binding

Add a minimum and default for the maximum-power-milliwatt option;
module power levels were originally up to 1W, so this is the default
and the minimum power level we can have for a functional SFP cage.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-driver-updates'

Michael Chan says:

====================
bnxt_en: Driver updates

This patchset adds .get_module_eeprom_by_page() support and adds
an NVRAM resize step to allow larger firmware images to be flashed
to older firmware.
====================

Link: https://lore.kernel.org/r/1666334243-23866-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: check and resize NVRAM UPDATE entry before flashing

Resize of the UPDATE entry is required if the image to
be flashed is larger than the available space. Add this step,
otherwise flashing larger firmware images by ethtool or devlink
may fail.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: add .get_module_eeprom_by_page() support

Add support for .get_module_eeprom_by_page() callback which
implements generic solution for module`s eeprom access.

v3: Add bnxt_get_module_status() to get a more specific extack error
    string.
    Return -EINVAL from bnxt_get_module_eeprom_by_page() when we
    don't want to fallback to old method.
v2: Simplification suggested by Ido Schimmel

Link: https://lore.kernel.org/netdev/YzVJ%2FvKJugoz15yV@shredder/
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Update firmware interface to 1.10.2.118

The main changes are PTM timestamp support, CMIS EEPROM support, and
asymmetric CoS queues support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

include/linux/net.h
a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag")
e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.1-rc3-1' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.

  The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
  I'm biased.

  Current release - regressions:

   - eth: fman: re-expose location of the MAC address to userspace,
     apparently some udev scripts depended on the exact value

  Current release - new code bugs:

   - bpf:
       - wait for busy refill_work when destroying bpf memory allocator
       - allow bpf_user_ringbuf_drain() callbacks to return 1
       - fix dispatcher patchable function entry to 5 bytes nop

  Previous releases - regressions:

   - net-memcg: avoid stalls when under memory pressure

   - tcp: fix indefinite deferral of RTO with SACK reneging

   - tipc: fix a null-ptr-deref in tipc_topsrv_accept

   - eth: macb: specify PHY PM management done by MAC

   - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()

  Previous releases - always broken:

   - eth: amd-xgbe: SFP fixes and compatibility improvements

  Misc:

   - docs: netdev: offer performance feedback to contributors"

* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
  net-memcg: avoid stalls when under memory pressure
  tcp: fix indefinite deferral of RTO with SACK reneging
  tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
  net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
  net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
  docs: netdev: offer performance feedback to contributors
  kcm: annotate data-races around kcm->rx_wait
  kcm: annotate data-races around kcm->rx_psock
  net: fman: Use physical address for userspace interfaces
  net/mlx5e: Cleanup MACsec uninitialization routine
  atlantic: fix deadlock at aq_nic_stop
  nfp: only clean `sp_indiff` when application firmware is unloaded
  amd-xgbe: add the bit rate quirk for Molex cables
  amd-xgbe: fix the SFP compliance codes check for DAC cables
  amd-xgbe: enable PLL_CTL for fixed PHY modes only
  amd-xgbe: use enums for mailbox cmd and sub_cmds
  amd-xgbe: Yellow carp devices do not need rrc
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  MAINTAINERS: add keyword match on PTP
  ...

Merge tag 'rcu-urgent.2022.10.20a' of git://git./linux/kernel/git/paulmck/linux-rcu

Pull RCU fix from Paul McKenney:
"Fix a regression caused by commit bf95b2bc3e42 ("rcu: Switch polled
  grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
  interrupts enabled after an early-boot call to synchronize_rcu().

  Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
  to properly interact with polled grace periods, but the code did not
  take into account the possibility of synchronize_rcu() being invoked
  from the portion of the boot sequence during which interrupts are
  disabled.

  This commit therefore switches the lock acquisition and release from
  irq to irqsave/irqrestore"

* tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Keep synchronize_rcu() from enabling irqs in early boot

Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull KUnit fixes from Shuah Khan:
"One single fix to update alloc_string_stream() callers to check for
  IS_ERR() instead of NULL to be in sync with alloc_string_stream()
  returning an ERR_PTR()"

* tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: update NULL vs IS_ERR() tests

Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:

- futex, intel_pstate, kexec build fixes

- ftrace dynamic_events dependency check fix

- memory-hotplug fix to remove redundant warning from test report

* tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/ftrace: fix dynamic_events dependency check
  selftests/memory-hotplug: Remove the redundant warning information
  selftests/kexec: fix build for ARCH=x86_64
  selftests/intel_pstate: fix build for ARCH=x86_64
  selftests/futex: fix build for clang

Merge tag 'pinctrl-v6.1-3' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

- Fix typos in UART1 and MMC in the Ingenic driver

- A really well researched glitch bug fix to the Qualcomm driver that
   was tracked down and fixed by Dough Anderson from Chromium. Hats off
   for this one!

- Revert two patches on the Xilinx ZynqMP driver: this needs a proper
   solution making use of firmware version information to adapt to
   different firmware releases

- Fix interrupt triggers in the Ocelot driver

* tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: ocelot: Fix incorrect trigger of the interrupt.
  Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
  Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
  pinctrl: qcom: Avoid glitching lines when we first mux to output
  pinctrl: Ingenic: JZ4755 bug fixes

net-memcg: avoid stalls when under memory pressure

As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.

After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.

Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).

Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.

Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9fe5 ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: fix indefinite deferral of RTO with SACK reneging

This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.

The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.

This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.

Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.

Fixes: 5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK reneging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2022-10-23

We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).

The main changes are:

1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.

2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.

3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.

4) Prevent decl_tag from being referenced in func_proto, from Stanislav.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  bpf: Fix dispatcher patchable function entry to 5 bytes nop
  bpf: prevent decl_tag from being referenced in func_proto
  selftests/bpf: Add reproducer for decl_tag in func_proto return type
  selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
  bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================

Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ptp-ocxp-Oroli-ART-CARD'

Vadim Fedorenko says:

====================
ptp: ocp: add support for Orolia ART-CARD

Orolia company created alternative open source TimeCard. The hardware of
the card provides similar to OCP's card functions, that's why the support
is added to current driver.

The first patch in the series changes the way to store information about
serial ports and is more like preparation.

The patches 2 to 4 introduces actual hardware support.

The last patch removes fallback from devlink flashing interface to protect
against flashing wrong image. This became actual now as we have 2 different
boards supported and wrong image can ruin hardware easily.

v2:
  Address comments from Jonathan Lemon

v3:
  Fix issue reported by kernel test robot <lkp@intel.com>

v4:
  Fix clang build issue

v5:
  Fix warnings and per-patch build errors

v6:
  Fix more style issues
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: remove flash image header check fallback

Previously there was a fallback mode to flash firmware image without
proper header. But now we have different supported vendors and flashing
wrong image could destroy the hardware. Remove fallback mode and force
header check. Both vendors have published firmware images with headers.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: expose config and temperature for ART card

Orolia card has disciplining configuration and temperature table
stored in EEPROM. This patch exposes them as binary attributes to
have read and write access.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: add serial port of mRO50 MAC on ART card

ART card provides interface to access to serial port of miniature atomic
clock found on the card. Add support for this device and configure it
during init phase.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: add Orolia timecard support

This brings in the Orolia timecard support from the GitHub repository.
The card uses different drivers to provide access to i2c EEPROM and
firmware SPI flash. And it also has a bit different EEPROM map, but
other parts of the code are the same and could be reused.

Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: upgrade serial line information

Introduce structure to hold serial port line number and the baud rate
it supports.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix a signed-integer-overflow bug in tcp_add_backlog()

The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
in tcp_add_backlog(), the variable limit is caculated by adding
sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
of int and overflow. This patch reduces the limit budget by
halving the sndbuf to solve this issue since ACK packets are much
smaller than the payload.

Fixes: c9c3321257e1 ("tcp: add tcp_add_backlog()")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: skb: move skb_pp_recycle() to skbuff.c

skb_pp_recycle() is only used by skb_free_head() in
skbuff.c, so move it to skbuff.c.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY

The ndo_start_xmit() method must not free skb when returning
NETDEV_TX_BUSY, since caller is going to requeue freed skb.

Fixes: 504d4721ee8e ("MIPS: Lantiq: Add ethernet driver")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmveth: Always stop tx queues during close

netif_stop_all_queues must be called before calling H_FREE_LOGICAL_LAN.
As a result, we can remove the pool_config field from the ibmveth
adapter structure.

Some device configuration changes call ibmveth_close in order to free
the current resources held by the device. These functions then make
their changes and call ibmveth_open to reallocate and reserve resources
for the device.

Prior to this commit, the flag pool_config was used to tell ibmveth_close
that it should not halt the transmit queue. pool_config was introduced in
commit 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically")
to avoid interrupting the tx flow when making rx config changes. Since
then, other commits adopted this approach, even if making tx config
changes.

The issue with this approach was that the hypervisor freed all of
the devices control structures after the hcall H_FREE_LOGICAL_LAN
was performed but the transmit queues were never stopped. So the higher
layers in the network stack would continue transmission but any
H_SEND_LOGICAL_LAN hcall would fail with H_PARAMETER until the
hypervisor's structures for the device were allocated with the
H_REGISTER_LOGICAL_LAN hcall in ibmveth_open. This resulted in
no real networking harm but did cause several of these error
messages to be logged: "h_send_logical_lan failed with rc=-4"

So, instead of trying to keep the transmit queues alive during network
configuration changes, just stop the queues, make necessary changes then
restart the queues.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: remove useless parameter of __sock_cmsg_send

The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it
safely.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fec: Add support for periodic output signal of PPS

This patch adds the support for configuring periodic output
signal of PPS. So the PPS can be output at a specified time
and period.
For developers or testers, they can use the command "echo
<channel> <start.sec> <start.nsec> <period.sec> <period.
nsec> > /sys/class/ptp/ptp0/period" to specify time and
period to output PPS signal.
Notice that, the channel can only be set to 0. In addtion,
the start time must larger than the current PTP clock time.
So users can use the command "phc_ctl /dev/ptp0 -- get" to
get the current PTP clock time before.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed

When the ops_init() interface is invoked to initialize the net, but
ops->init() fails, data is released. However, the ptr pointer in
net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked
to release the net, invalid address access occurs.

The process is as follows:
setup_net()
ops_init()
data = kzalloc(...)   ---> alloc "data"
net_assign_generic()  ---> assign "date" to ptr in net->gen
...
ops->init()           ---> failed
...
kfree(data);          ---> ptr in net->gen is invalid
...
ops_exit_list()
...
nfqnl_nf_hook_drop()
*q = nfnl_queue_pernet(net) ---> q is invalid

The following is the Call Trace information:
BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280
Read of size 8 at addr ffff88810396b240 by task ip/15855
Call Trace:
<TASK>
dump_stack_lvl+0x8e/0xd1
print_report+0x155/0x454
kasan_report+0xba/0x1f0
nfqnl_nf_hook_drop+0x264/0x280
nf_queue_nf_hook_drop+0x8b/0x1b0
__nf_unregister_net_hook+0x1ae/0x5a0
nf_unregister_net_hooks+0xde/0x130
ops_exit_list+0xb0/0x170
setup_net+0x7ac/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>

Allocated by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0xa1/0xb0
__kmalloc+0x49/0xb0
ops_init+0xe7/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
____kasan_slab_free+0x155/0x1b0
slab_free_freelist_hook+0x11b/0x220
__kmem_cache_free+0xa4/0x360
ops_init+0xb9/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: f875bae06533 ("net: Automatically allocate per namespace data.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add a refcount tracker for kernel sockets

Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.

We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.

This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.

Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().

This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: netdev: offer performance feedback to contributors

Some of us gotten used to producing large quantities of peer feedback
at work, every 3 or 6 months. Extending the same courtesy to community
members seems like a logical step. It may be hard for some folks to
get validation of how important their work is internally, especially
at smaller companies which don't employ many kernel experts.

The concept of "peer feedback" may be a hyperscaler / silicon valley
thing so YMMV. Hopefully we can build more context as we go.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'kcm-data-races'

Eric Dumazet says:

====================
kcm: annotate data-races

This series address two different syzbot reports for KCM.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

kcm: annotate data-races around kcm->rx_wait

kcm->rx_psock can be read locklessly in kcm_rfree().
Annotate the read and writes accordingly.

syzbot reported:

BUG: KCSAN: data-race in kcm_rcv_strparser / kcm_rfree

write to 0xffff88810784e3d0 of 1 bytes by task 1823 on cpu 1:
reserve_rx_kcm net/kcm/kcmsock.c:283 [inline]
kcm_rcv_strparser+0x250/0x3a0 net/kcm/kcmsock.c:363
__strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
strp_recv+0x6d/0x80 net/strparser/strparser.c:335
tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
strp_read_sock net/strparser/strparser.c:358 [inline]
do_strp_work net/strparser/strparser.c:406 [inline]
strp_work+0xe8/0x180 net/strparser/strparser.c:415
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306

read to 0xffff88810784e3d0 of 1 bytes by task 17869 on cpu 0:
kcm_rfree+0x121/0x220 net/kcm/kcmsock.c:181
skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
skb_release_all net/core/skbuff.c:852 [inline]
__kfree_skb net/core/skbuff.c:868 [inline]
kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
kfree_skb include/linux/skbuff.h:1216 [inline]
kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
____sys_recvmsg+0x16c/0x2e0
___sys_recvmsg net/socket.c:2743 [inline]
do_recvmmsg+0x2f1/0x710 net/socket.c:2837
__sys_recvmmsg net/socket.c:2916 [inline]
__do_sys_recvmmsg net/socket.c:2939 [inline]
__se_sys_recvmmsg net/socket.c:2932 [inline]
__x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x01 -> 0x00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 17869 Comm: syz-executor.2 Not tainted 6.1.0-rc1-syzkaller-00010-gbb1a1146467a-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022

Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

kcm: annotate data-races around kcm->rx_psock

kcm->rx_psock can be read locklessly in kcm_rfree().
Annotate the read and writes accordingly.

We do the same for kcm->rx_wait in the following patch.

syzbot reported:
BUG: KCSAN: data-race in kcm_rfree / unreserve_rx_kcm

write to 0xffff888123d827b8 of 8 bytes by task 2758 on cpu 1:
unreserve_rx_kcm+0x72/0x1f0 net/kcm/kcmsock.c:313
kcm_rcv_strparser+0x2b5/0x3a0 net/kcm/kcmsock.c:373
__strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
strp_recv+0x6d/0x80 net/strparser/strparser.c:335
tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
strp_read_sock net/strparser/strparser.c:358 [inline]
do_strp_work net/strparser/strparser.c:406 [inline]
strp_work+0xe8/0x180 net/strparser/strparser.c:415
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306

read to 0xffff888123d827b8 of 8 bytes by task 5859 on cpu 0:
kcm_rfree+0x14c/0x220 net/kcm/kcmsock.c:181
skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
skb_release_all net/core/skbuff.c:852 [inline]
__kfree_skb net/core/skbuff.c:868 [inline]
kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
kfree_skb include/linux/skbuff.h:1216 [inline]
kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
____sys_recvmsg+0x16c/0x2e0
___sys_recvmsg net/socket.c:2743 [inline]
do_recvmmsg+0x2f1/0x710 net/socket.c:2837
__sys_recvmmsg net/socket.c:2916 [inline]
__do_sys_recvmmsg net/socket.c:2939 [inline]
__se_sys_recvmmsg net/socket.c:2932 [inline]
__x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0xffff88812971ce00 -> 0x0000000000000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 5859 Comm: syz-executor.3 Not tainted 6.0.0-syzkaller-12189-g19d17ab7c68b-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022

Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'udp-false-sharing'

Paolo Abeni says:

====================
udp: avoid false sharing on receive

Under high UDP load, the BH processing and the user-space receiver can
run on different cores.

The UDP implementation does a lot of effort to avoid false sharing in
the receive path, but recent changes to the struct sock layout moved
the sk_forward_alloc and the sk_rcvbuf fields on the same cacheline:

        /* --- cacheline 4 boundary (256 bytes) --- */
                struct sk_buff *   tail;
        } sk_backlog;
        int                        sk_forward_alloc;
        unsigned int               sk_reserved_mem;
        unsigned int               sk_ll_usec;
        unsigned int               sk_napi_id;
        int                        sk_rcvbuf;

sk_forward_alloc is updated by the BH, while sk_rcvbuf is accessed by
udp_recvmsg(), causing false sharing.

A possible solution would be to re-order the struct sock fields to avoid
the false sharing. Such change is subject to being invalidated by future
changes and could have negative side effects on other workload.

Instead this series uses a different approach, touching only the UDP
socket layout.

The first patch generalizes the custom setsockopt infrastructure, to
allow UDP tracking the buffer size, and the second patch addresses the
issue, copying the relevant buffer information into an already hot
cacheline.

Overall the above gives a 10% peek throughput increase under UDP flood.

v1 -> v2:
- introduce and use a common helper to initialize the UDP v4/v6 sockets
   (Kuniyuki)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

udp: track the forward memory release threshold in an hot cacheline

When the receiver process and the BH runs on different cores,
udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
as the latter shares the same cacheline with sk_forward_alloc, written
by the BH.

With this patch, UDP tracks the rcvbuf value and its update via custom
SOL_SOCKET socket options, and copies the forward memory threshold value
used by udp_rmem_release() in a different cacheline, already accessed by
the above function and uncontended.

Since the UDP socket init operation grown a bit, factor out the common
code between v4 and v6 in a shared helper.

Overall the above give a 10% peek throughput increase under UDP flood.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: introduce and use custom sockopt socket flag

We will soon introduce custom setsockopt for UDP sockets, too.
Instead of doing even more complex arbitrary checks inside
sock_use_custom_sol_socket(), add a new socket flag and set it
for the relevant socket types (currently only MPTCP).

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fman: Use physical address for userspace interfaces

Before 262f2b782e25 ("net: fman: Map the base address once"), the
physical address of the MAC was exposed to userspace in two places: via
sysfs and via SIOCGIFMAP. While this is not best practice, it is an
external ABI which is in use by userspace software.

The aforementioned commit inadvertently modified these addresses and
made them virtual. This constitutes and ABI break. Additionally, it
leaks the kernel's memory layout to userspace. Partially revert that
commit, reintroducing the resource back into struct mac_device, while
keeping the intended changes (the rework of the address mapping).

Fixes: 262f2b782e25 ("net: fman: Map the base address once")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-800Gbps-support'

Petr Machata says:

====================
net: Add support for 800Gbps speed

Amit Cohen <amcohen@nvidia.com> writes:

The next Nvidia Spectrum ASIC will support 800Gbps speed.
The IEEE 802 LAN/MAN Standards Committee already published standards for
800Gbps, see the last update [1] and the list of approved changes [2].

As first phase, add support for 800Gbps over 8 lanes (100Gbps/lane).
In the future 800Gbps over 4 lanes can be supported also.

Extend ethtool to support the relevant PMDs and extend mlxsw and bonding
drivers to support 800Gbps.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: 3ad: Add support for 800G speed

Add support for 800Gbps speed to allow using 3ad mode with 800G devices.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Add support for 800Gbps link modes

Add support for 800Gbps speed, link modes of 100Gbps per lane.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: Add support for 800Gbps link modes

Add support for 800Gbps speed, link modes of 100Gbps per lane.
As mentioned in slide 21 in IEEE documentation [1], all adopted 802.3df
copper and optical PMDs baselines using 100G/lane will be supported.

Add the relevant PMDs which are mentioned in slide 5 in IEEE
documentation [1] and were approved on 10-2022 [2]:
BP - KR8
Cu Cable - CR8
MMF 50m - VR8
MMF 100m - SR8
SMF 500m - DR8
SMF 2km - DR8-2

[1]: https://www.ieee802.org/3/df/public/22_10/22_1004/shrikhande_3df_01a_221004.pdf
[2]: https://ieee802.org/3/df/KeyMotions_3df_221005.pdf

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sparx5-IS2-VCAP'

Steen Hegelund says:

====================
Add support for Sparx5 IS2 VCAP

This provides initial support for the Sparx5 VCAP functionality via the
'tc' traffic control userspace tool and its flower filter.

Overview:
=========

The supported flower filter keys and actions are:

- source and destination MAC address keys
- trap action
- pass action

The supported Sparx5 VCAPs are: IS2 (see below for more info)

The VCAP (Versatile Content-Aware Processor) feature is essentially a TCAM
with rules consisting of:

- Programmable key fields
- Programmable action fields
- A counter (which may be only one bit wide)

Besides this each VCAP has:

- A number of independent lookups
- A keyset configuration typically per port per lookup

VCAPs are used in many of the TSN features such as PSFP, PTP, FRER as well
as the general shaping, policing and access control, so it is an important
building block for these advanced features.

Functionality:
==============

When a frame is passed to a VCAP the VCAP will generate a set of keys
(keyset) based on the traffic type.  If there is a rule created with this
keyset in the VCAP and the values of the keys matches the values in the
keyset of the frame, the rule is said to match and the actions in the rule
will be executed and the rule counter will be incremented.  No more rules
will be examined in this VCAP lookup.

If there is no match in the current lookup the frame will be matched
against the next lookup (some VCAPs do the processing of the lookups in
parallel).

The Sparx5 SoC has 6 different VCAP types:

- IS0: Ingress Stage 0 (AKA CLM) mostly handles classification
- IS2: Ingress Stage 2 mostly handles access control
- IP6PFX: IPv6 prefix: Provides tables for IPV6 address management
- LPM: Longest Path Match for IP guarding and routing
- ES0: Egress Stage 0 is mostly used for CPU copying and multicast handling
- ES2: Egress Stage 2 is known as the rewriter and mostly updates tags

Design:
=======

The VCAP implementation provides switchcore independent handling of rules
and supports:

- Creating and deleting rules
- Updating and getting rules

The platform specific API implementation as well as the platform specific
model of the VCAP instances are attached to the VCAP API and a client can
then access rules via the API in a platform independent way, with the
limitations that each VCAP has in terms of is supported keys and actions.

The VCAP model is generated from information delivered by the designers of
the VCAP hardware.

Here is an illustration of this:

  +------------------+     +------------------+
  | TC flower filter |     | PTP client       |
  | for Sparx5       |     | for Sparx5       |
  +-------------\----+     +---------/--------+
                 \                  /
                  \                /
                   \              /
                    \            /
                     \          /
                 +----v--------v----+
                 |     VCAP API     |
                 +---------|--------+
                           |
                           |
                           |
                           |
                 +---------v--------+
                 |   VCAP control   |
                 |   instance       |
                 +----/--------|----+
                     /         |
                    /          |
                   /           |
                  /            |
  +--------------v---+    +----v-------------+
  |   Sparx5 VCAP    |    | Sparx5 VCAP API  |
  |   model          |    | Implementation   |
  +------------------+    +---------|--------+
                                    |
                                    |
                                    |
                                    |
                          +---------v--------+
                          | Sparx5 VCAP HW   |
                          +------------------+

Delivery:
=========

For now only the IS2 is supported but later the IS0, ES0 and ES2 will be
added. There are currently no plans to support the IP6PFX and the LPM
VCAPs.

The IS2 VCAP has 4 lookups and they are accessible with a TC chain id:

- chain 8000000: IS2 Lookup 0
- chain 8100000: IS2 Lookup 1
- chain 8200000: IS2 Lookup 2
- chain 8300000: IS2 Lookup 3

These lookups are executed in parallel by the IS2 VCAP but the actions are
executed in series (the datasheet explains what happens if actions
overlap).

The functionality of TC flower as well as TC matchall filters will be
expanded in later submissions as well as the number of VCAPs supported.

This is current plan:

- add support for more TC flower filter keys and extend the Sparx5 port
  keyset configuration
- support for TC protocol all
- debugfs support for inspecting rules
- TC flower filter statistics
- Sparx5 IS0 VCAP support and more TC keys and actions to support this
- add TC policer and drop action support (depends on the Sparx5 QoS support
  upstreamed separately)
- Sparx5 ES0 VCAP support and more TC actions to support this
- TC flower template support
- TC matchall filter support for mirroring and policing ports
- TC flower filter mirror action support
- Sparx5 ES2 VCAP support

The LAN966x switchcore will also be updated to use the VCAP API as well as
future Microchip switches.
The LAN966x has 3 VCAPS (IS1, IS2 and ES0) and a slightly different keyset
and actionset portfolio than Sparx5.

Version History:
================
v3      Moved the sparx5_tc_flower_set_exterr function to the VCAP API and
        renamed it.
        Moved the sparx5_netbytes_copy function to the VCAP_API and renamed
        it (thanks Horatiu Vultur).
        Fixed indentation in the vcap_write_rule function.
        Added a comment mentioning the typegroup table terminator in the
        vcap_iter_skip_tg function.

v2      Made the KUNIT test model a superset of the real model to fix a
        kernel robot build error.

v1      Initial version
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding KUNIT test for the VCAP API

This provides a KUNIT test suite for the VCAP APIs encoding functionality.

The test can be run by adding these settings in a .kunitconfig file

CONFIG_KUNIT=y
CONFIG_NET=y
CONFIG_VCAP_KUNIT_TEST=y

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding KUNIT test VCAP model

This provides a test VCAP model for use in a KUNIT test. The model
provides 3 different VCAP types for better test coverage.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Writing rules to the IS2 VCAP

This adds rule encoding functionality to the VCAP API.

A rule consists of keys and actions in separate cache sections.

The maximum size of the keyset or actionset determines the size of the
rule.

The VCAP hardware need to be able to distinguish different rule sizes from
each other, and for that purpose some extra typegroup bits are added to the
rule when it is encoded.

The API provides a bit stream iterator that allows highlevel encoding
functionality to add key and action value bits independent of typegroup
bits.

This is handled by letting the concrete VCAP model provide the typegroup
table for the different rule sizes.
After the key and action values have been added to the encoding bit streams
the typegroup bits are set to their correct values just before the rule is
written to the VCAP hardware.

The key and action offsets provided in the VCAP model are the offset before
adding the typegroup bits.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Tested-by: Casper Andersson <casper.casan@gmail.com>
Reviewed-by: Casper Andersson <casper.casan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding basic rule management in VCAP API

This provides most of the rule handling needed to add a new rule to a VCAP.
To add a rule a client must follow these steps:

1) Allocate a new rule (provide an id or get one automatically assigned)
2) Add keys to the rule
3) Add actions to the rule
4) Optionally set a keyset on the rule
5) Optionally set an actionset on the rule
6) Validate the rule (this will add keyset and actionset if not specified
in the previous steps)
7) Add the rule (if the validation was successful)
8) Free the rule instance (a copy has been added to the VCAP)

The validation step will fail if there are no keysets with the requested
keys, or there are no actionsets with the requested actions.
The validation will also fail if the keyset is not configured for the port
for the requested protocol).

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Tested-by: Casper Andersson <casper.casan@gmail.com>
Reviewed-by: Casper Andersson <casper.casan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding port keyset config and callback interface

This provides a default port keyset configuration for the Sparx5 IS2 VCAP
where all ports and all lookups in IS2 use the same keyset (MAC_ETYPE) for
all types of traffic.

This means that no matter what frame type is received on any front port it
will generate the MAC_ETYPE keyset in the IS VCAP and any rule in the IS2
VCAP that uses this keyset will be matched against the keys in the
MAC_ETYPE keyset.

The callback interface used by the VCAP API is populated with Sparx5
specific handler functions that takes care of the actual reading and
writing to data to the Sparx5 IS2 VCAP instance.

A few functions are also added to the VCAP API to support addition of rule
fields such as the ingress port mask and the lookup bit.

The IS2 VCAP in Sparx5 is really divided in two instances with lookup 0
and 1 in the first instance and lookup 2 and 3 in the second instance.
The lookup bit selects lookup 0 or 3 in the respective instance when it is
set.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Tested-by: Casper Andersson <casper.casan@gmail.com>
Reviewed-by: Casper Andersson <casper.casan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding initial tc flower support for VCAP API

This adds initial TC flower filter support to Sparx5 for the IS2 VCAP.

The support consists of the source and destination MAC addresses,
and the trap and pass actions.

This is how you can create a rule that test the functionality:

tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress chain 8000000 prio 10 handle 10 \
      protocol all flower skip_sw \
      dst_mac 0a:0b:0c:0d:0e:0f \
      src_mac 2:0:0:0:0:1 \
      action trap

The IS2 chains in Sparx5 are assigned like this:

- chain 8000000: IS2 Lookup 0
- chain 8100000: IS2 Lookup 1
- chain 8200000: IS2 Lookup 2
- chain 8300000: IS2 Lookup 3

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Tested-by: Casper Andersson <casper.casan@gmail.com>
Reviewed-by: Casper Andersson <casper.casan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding IS2 VCAP register interface

This adds the register interface needed to access the Sparx5 Ingress Stage
2 VCAP (IS2).

The Sparx5 Chip Register Model can be browsed at this location:
https://github.com/microchip-ung/sparx-5_reginfo

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding IS2 VCAP model to VCAP API

This provides the Sparx5 Ingress Stage 2 (IS2) model and adds it to the
VCAP control instance that will be provided to the VCAP API.

The Sparx5 IS2 C code model is generated from the Sparx5 RTL design model.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: microchip: sparx5: Adding initial VCAP API support

This provides the initial VCAP API framework and Sparx5 specific VCAP
implementation.

When the Sparx5 Switchdev driver is initialized it will also initialize its
VCAP module, and this hooks up the concrete Sparx5 VCAP model to the VCAP
API, so that the VCAP API knows what VCAP instances are available.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: tunnel neigh support bond offload

Support hardware offload when tunnel neigh out port is bond.
These feature work with the nfp firmware. If the firmware
supports the NFP_FL_FEATS_TUNNEL_NEIGH_LAG feature, nfp driver
write the bond information to the firmware neighbor table or
do nothing for bond. when neighbor MAC changes, nfp driver
need to update the neighbor information too.

Signed-off-by: Yanguo Li <yanguo.li@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Cleanup MACsec uninitialization routine

The mlx5e_macsec_cleanup() routine has NULL pointer dereferencing if mlx5
device doesn't support MACsec (priv->macsec will be NULL).

While at it delete comment line, assignment and extra blank lines, so fix
everything in one patch.

Fixes: 1f53da676439 ("net/mlx5e: Create advanced steering operation (ASO) object for MACsec")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atlantic: fix deadlock at aq_nic_stop

NIC is stopped with rtnl_lock held, and during the stop it cancels the
'service_task' work and free irqs.

However, if CONFIG_MACSEC is set, rtnl_lock is acquired both from
aq_nic_service_task and aq_linkstate_threaded_isr. Then a deadlock
happens if aq_nic_stop tries to cancel/disable them when they've already
started their execution.

As the deadlock is caused by rtnl_lock, it causes many other processes
to stall, not only atlantic related stuff.

Fix it by introducing a mutex that protects each NIC's macsec related
data, and locking it instead of the rtnl_lock from the service task and
the threaded IRQ.

Before this patch, all macsec data was protected with rtnl_lock, but
maybe not all of it needs to be protected. With this new mutex, further
efforts can be made to limit the protected data only to that which
requires it. However, probably it doesn't worth it because all macsec's
data accesses are infrequent, and almost all are done from macsec_ops
or ethtool callbacks, called holding rtnl_lock, so macsec_mutex won't
never be much contended.

The issue appeared repeteadly attaching and deattaching the NIC to a
bond interface. Doing that after this patch I cannot reproduce the bug.

Fixes: 62c1c2e606f6 ("net: atlantic: MACSec offload skeleton")
Reported-by: Li Liang <liali@redhat.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'inet6_destroy_sock-calls-remove'

Kuniyuki Iwashima says:

====================
inet6: Remove inet6_destroy_sock() calls.

This is a follow-up series for commit d38afeec26ed ("tcp/udp: Call
inet6_destroy_sock() in IPv6 sk->sk_destruct().").

This series cleans up unnecessary inet6_destory_sock() calls in
sk->sk_prot->destroy() and call it from sk->sk_destruct() to make
sure we do not leak memory related to IPv6 specific-resources.

Changes:
  v2:
    * patch 1
      * Fix build failure for CONFIG_MPTCP_IPV6=y

  v1: https://lore.kernel.org/netdev/20221018190956.1308-1-kuniyu@amazon.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

inet6: Clean up failure path in do_ipv6_setsockopt().

We can reuse the unlock label above and need not repeat the same code.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet6: Remove inet6_destroy_sock().

The last user of inet6_destroy_sock() is its wrapper inet6_cleanup_sock().
Let's rename inet6_destroy_sock() to inet6_cleanup_sock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: Call inet6_destroy_sock() via sk->sk_destruct().

After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock()
in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in
sk->sk_destruct() by setting inet6_sock_destruct() to it to make
sure we do not leak inet6-specific resources.

SCTP sets its own sk->sk_destruct() in the sctp_init_sock(), and
SCTPv6 socket reuses it as the init function.

To call inet6_sock_destruct() from SCTPv6 sk->sk_destruct(), we
set sctp_v6_destruct_sock() in a new init function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dccp: Call inet6_destroy_sock() via sk->sk_destruct().

After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock()
in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in
sk->sk_destruct() by setting inet6_sock_destruct() to it to make
sure we do not leak inet6-specific resources.

DCCP sets its own sk->sk_destruct() in the dccp_init_sock(), and
DCCPv6 socket shares it by calling the same init function via
dccp_v6_init_sock().

To call inet6_sock_destruct() from DCCPv6 sk->sk_destruct(), we
export it and set dccp_v6_sk_destruct() in the init function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().

After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock()
in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in
sk->sk_destruct() by setting inet6_sock_destruct() to it to make
sure we do not leak inet6-specific resources.

Now we can remove unnecessary inet6_destroy_sock() calls in
sk->sk_prot->destroy().

DCCP and SCTP have their own sk->sk_destruct() function, so we
change them separately in the following patches.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dpaa2-eth-AF_XDP-zc'

Ioana Ciornei says:

====================
net: dpaa2-eth: AF_XDP zero-copy support

This patch set adds support for AF_XDP zero-copy in the dpaa2-eth
driver. The support is available on the LX2160A SoC and its variants and
only on interfaces (DPNIs) with a maximum of 8 queues (HW limitations
are the root cause).

We are first implementing the .get_channels() callback since this a
dependency for further work.

Patches 2-3 are working on making the necessary changes for multiple
buffer pools on a single interface. By default, without an AF_XDP socket
attached, only a single buffer pool will be used and shared between all
the queues. The changes in the functions are made in this patch, but the
actual allocation and setup of a new BP is done in patch#10.

Patches 4-5 are improving the information exposed in debugfs. We are
exposing a new file to show which buffer pool is used by what channels
and how many buffers it currently has.

The 6th patch updates the dpni_set_pools() firmware API so that we are
capable of setting up a different buffer per queue in later patches.

In the 7th patch the generic dev_open/close APIs are used instead of the
dpaa2-eth internal ones.

Patches 8-9 are rearranging the existing code in dpaa2-eth.c in order to
create new functions which will be used in the XSK implementation in
dpaa2-xsk.c

Finally, the last 3 patches are adding the actual support for both the
Rx and Tx path of AF_XDP zero-copy and some associated tracepoints.
Details on the implementation can be found in the actual patch.

Changes in v2:
- 3/12:  Export dpaa2_eth_allocate_dpbp/dpaa2_eth_free_dpbp in this
   patch to avoid a build warning. The functions will be used in next
   patches.
- 6/12:  Use __le16 instead of u16 for the dpbp_id field.
- 12/12: Use xdp_buff->data_hard_start when tracing the BP seeding.

Changes in v3:
- 3/12: fix leaking of bp on error path
====================

Acked-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dpaa2-eth: add trace points on XSK events

Define the dpaa2_tx_xsk_fd and dpaa2_rx_xsk_fd trace events for the XSK
zero-copy Rx and Tx path. Also, define the dpaa2_eth_buf as an event
class so that both dpaa2_eth_buf_seed and dpaa2_xsk_buf_seed traces can
derive from the same class.

Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dpaa2-eth: AF_XDP TX zero copy support

Add support in dpaa2-eth for packet processing on the Tx path using
AF_XDP zero copy mode.

The newly added dpaa2_xsk_tx() function will handle enqueuing AF_XDP Tx
packets into the appropriate queue and update any necessary statistics.

On a more detailed note, the dpaa2_xsk_tx_build_fd() function handles
creating a Scatter-Gather frame descriptor with only one data buffer.
This is needed because otherwise we would need to impose a headroom in
the Tx buffer to store our software annotation structures.
This tactic is already used on the normal data path of the dpaa2-eth
driver, thus we are reusing the dpaa2_eth_sgt_get/dpaa2_eth_sgt_recycle
functions in order to allocate and recycle the Scatter-Gather table
buffers.

In case we have reached the maximum number of Tx XSK packets to be sent
in a NAPI cycle, we'll exit the dpaa2_eth_poll() and hope to be
rescheduled again.

On the XSK Tx confirmation path, we are just unmapping the SGT buffer
and recycle it for further use.

Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dpaa2-eth: AF_XDP RX zero copy support

This patch adds the support for receiving packets via the AF_XDP
zero-copy mechanism in the dpaa2-eth driver. The support is available
only on the LX2160A SoC and variants because we are relying on the HW
capability to associate a buffer pool to a specific queue (QDBIN), only
available on newer WRIOP versions.

On the control path, the dpaa2_xsk_enable_pool() function is responsible
to allocate a buffer pool (BP), setup this new BP to be used only on the
requested queue and change the consume function to point to the XSK ZC
one.
We are forced to call dev_close() in order to change the queue to buffer
pool association (dpaa2_xsk_set_bp_per_qdbin) . This also works in our
favor since at dev_close() the buffer pools will be drained and at the
later dev_open() call they will be again seeded, this time with buffers
allocated from the XSK pool if needed.

On the data path, a new software annotation type is defined to be used
only for the XSK scenarios. This will enable us to pass keep necessary
information about a packet buffer between the moment in which it was
seeded and when it's received by the driver. In the XSK case, we are
keeping the associated xdp_buff.
Depending on the action returned by the BPF program, we will do the
following:
- XDP_PASS: copy the contents of the packet into a brand new skb,
recycle the initial buffer.
- XDP_TX: just enqueue the same frame descriptor back into the Tx path,
the buffer will get automatically released into the initial BP.
- XDP_REDIRECT: call xdp_do_redirect() and exit.

Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dpaa2-eth: create and export the dpaa2_eth_receive_skb() function

Carve out code from the dpaa2_eth_rx() function in order to create and
export the dpaa2_eth_receive_skb() function. Do this in order to reuse
this code also from the XSK path which will be introduced in a later
patch.

Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dpaa2-eth: create and export the dpaa2_eth_alloc_skb function

The dpaa2_eth_alloc_skb() function is added by moving code from the
dpaa2_eth_copybreak() previously defined function. What the new API does
is to allocate a new skb, copy the frame data from the passed FD to the
new skb and then return the skb.
Export this new function since we'll need the this functionality also
from the XSK code path.

Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>