review.tizen.org Git - platform/kernel/linux-rpi.git/log

net: ipa: define all IPA status mask bits

There is a 16 bit status mask defined in the IPA packet status
structure, of which only one (TAG_VALID) is currently used.

Define all other IPA status mask values in an enumerated type whose
numeric values are bit mask values (in CPU byte order) in the status
mask. Use the TAG_VALID value from that type rather than defining a
separate field mask.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: stop using sizeof(status)

The IPA packet status structure changes in IPA v5.0 in ways that are
difficult to represent cleanly. As a small step toward redefining
it as a parsed block of data, use a constant to define its size,
rather than the size of the IPA status structure type.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: refactor status buffer parsing

The packet length encoded in an IPA packet status buffer is computed
more than once in ipa_endpoint_status_parse(). It is also checked
again in ipa_endpoint_status_skip(), which that function calls.

Compute the length once, and use that computed value later rather
than recomputing it. Check for it being zero in the parse function
rather than in ipa_endpoint_status_skip().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: ocelot: build felix.c into a dedicated kernel module

The build system currently complains:

scripts/Makefile.build:252: drivers/net/dsa/ocelot/Makefile:
felix.o is added to multiple modules: mscc_felix mscc_seville

Since felix.c holds the DSA glue layer, create a mscc_felix_dsa_lib.ko.
This is similar to how mscc_ocelot_switch_lib.ko holds a library for
configuring the hardware.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Colin Foster <colin.foster@in-advantage.com>
Link: https://lore.kernel.org/r/20230125145716.271355-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
virtchnl: update and refactor

Jesse Brandeburg says:

The virtchnl.h file is used by i40e/ice physical function (PF) drivers
and irdma when talking to the iavf driver. This series cleans up the
header file by removing unused elements, adding/cleaning some comments,
fixing the data structures so they are explicitly defined, including
padding, and finally does a long overdue rename of the IWARP members in
the structures to RDMA, since the ice driver and it's associated Intel
Ethernet E800 series adapters support both RDMA and IWARP.

The whole series should result in no functional change, but hopefully
clearer code.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  virtchnl: i40e/iavf: rename iwarp to rdma
  virtchnl: do structure hardening
  virtchnl: update header and increase header clarity
  virtchnl: remove unused structure declaration
====================

Link: https://lore.kernel.org/r/20230125212441.4030014-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-ynl-prevent-reorder-and-fix-flags'

Jakub Kicinski says:

====================
tools: ynl: prevent reorder and fix flags

Some codegen improvements for YAML specs.

First, Lorenzon discovered when switching the XDP feature family
to use flags instead of pure enum that the kdoc got garbled.
The support for enum and flags is therefore unified.

Second when regenerating all families we discussed so far I noticed
that some netlink policies jumped around. We need to ensure we don't
render code based on their ordering in a hash.
====================

Link: https://lore.kernel.org/r/20230126000235.1085551-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: store ops in ordered dict to avoid random ordering

When rendering code we should walk the ops in the order in which
they are declared in the spec. This is both more intuitive and
prevents code from jumping around when hashing in the dict changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: rename ops_list -> msg_list

ops_list contains all the operations, but the main iteration use
case is to walk only ops which define attrs. Rename ops_list to
msg_list, because now it looks like the contents are the same,
just the format is different. While at it convert from tuple
to just keys, none of the users care about the name of the op.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: support kdocs for flags in code generation

Lorenzo reports that after switching from enum to flags netdev
family lost ability to render kdoc (and the enum contents got
generally garbled).

Combine the flags and enum handling in uAPI handling.

Reported-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'convert-drivers-to-return-xfrm-configuration-errors-through-extack'

Leon Romanovsky says:

====================
Convert drivers to return XFRM configuration errors through extack

This series continues effort started by Sabrina to return XFRM configuration
errors through extack. It allows for user space software stack easily present
driver failure reasons to users.

As a note, Intel drivers have a path where extack is equal to NULL, and error
prints won't be available in current patchset. If it is needed, it can be
changed by adding special to Intel macro to print to dmesg in case of
extack == NULL.
====================

Link: https://lore.kernel.org/r/cover.1674560845.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cxgb4: fill IPsec state validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: fill IPsec state validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: fill IPsec state validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbevf: fill IPsec state validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: fill IPsec state validation failure reason

Rely on extack to return failure reason.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: Fill IPsec state validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Fill IPsec state validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xfrm: extend add state callback to set failure reason

Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Fill IPsec policy validation failure reason

Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xfrm: extend add policy callback to set failure reason

Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: provide shims for stats aggregation helpers when CONFIG_ETHTOOL_NETLINK=n

ethtool_aggregate_*_stats() are implemented in net/ethtool/stats.c, a
file which is compiled out when CONFIG_ETHTOOL_NETLINK=n. In order to
avoid adding Kbuild dependencies from drivers (which call these helpers)
on CONFIG_ETHTOOL_NETLINK, let's add some shim definitions which simply
make the helpers dead code.

This means the function prototypes should have been located in
include/linux/ethtool_netlink.h rather than include/linux/ethtool.h.

Fixes: 449c5459641a ("net: ethtool: add helpers for aggregate statistics")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230125110214.4127759-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mptcp-add-mixed-v4-v6-support-for-the-in-kernel-pm'

Matthieu Baerts says:

====================
mptcp: add mixed v4/v6 support for the in-kernel PM

Before these patches, the in-kernel Path-Manager would not allow, for
the same MPTCP connection, having a mix of subflows in v4 and v6.

MPTCP's RFC 8684 doesn't forbid that and it is even recommended to do so
as the path in v4 and v6 are likely different. Some networks are also
v4 or v6 only, we cannot assume they all have both v4 and v6 support.

Patch 1 then removes this artificial constraint in the in-kernel PM
currently enforcing there are no mixed subflows in place, either in
address announcement or in subflow creation areas.

Patch 2 makes sure the sk_ipv6only attribute is also propagated to
subflows, just in case a new PM wouldn't respect it.

Some selftests have also been added for the in-kernel PM (patch 3).

Patches 4 to 8 are just some cleanups and small improvements in the
printed messages in the userspace PM. It is not linked to the rest but
identified when working on a related patch modifying this selftest,
already in -net:

commit 4656d72c1efa ("selftests: mptcp: userspace: validate v4-v6 subflows mix")
---
====================

Link: https://lore.kernel.org/r/20230123-upstream-net-next-pm-v4-v6-v1-0-43fac502bfbf@tessares.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mptcp: userspace: avoid read errors

During the cleanup phase, the server pids were killed with a SIGTERM
directly, not using a SIGUSR1 first to quit safely. As a result, this
test was often ending with two error messages:

read: Connection reset by peer

While at it, use a for-loop to terminate all the PIDs the same way.

Also the different files are now removed after having killed the PIDs
using them. It makes more sense to do that in this order.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mptcp: userspace: print error details if any

Before, only '[FAIL]' was printed in case of error during the validation
phase.

Now, in case of failure, the variable name, its value and expected one
are displayed to help understand what was wrong.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mptcp: userspace: refactor asserts

Instead of having a long list of conditions to check, it is possible to
give a list of variable names to compare with their 'e_XXX' version.

This will ease the introduction of the following commit which will print
which condition has failed (if any).

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mptcp: userspace: print titles

This script is running a few tests after having setup the environment.

Printing titles helps understand what is being tested.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: userspace pm: use a single point of exit

Like in all other functions in this file, a single point of exit is used
when extra operations are needed: unlock, decrement refcount, etc.

There is no functional change for the moment but it is better to do the
same here to make sure all cleanups are done in case of intermediate
errors.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mptcp: add test-cases for mixed v4/v6 subflows

Note that we can't guess the listener family anymore based on the client
target address: always use IPv6.

The fullmesh flag with endpoints from different families is also
validated here.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: propagate sk_ipv6only to subflows

Usually, attributes are propagated to subflows as well.

Here, if subflows are created by other ways than the MPTCP path-manager,
it is important to make sure they are in v6 if it is asked by the
userspace.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses

Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.

This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.

A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

icmp: Add counters for rate limits

There are multiple ICMP rate limiting mechanisms:

* Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
* v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
* v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask

However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.

Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.

Example output:

# nstat -sz "*RateLimit*"
IcmpOutRateLimitGlobal          134                0.0
IcmpOutRateLimitHost            770                0.0
Icmp6OutRateLimitHost           84                 0.0

Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'adding-sparx5-is0-vcap-support'

Steen Hegelund says:

====================
Adding Sparx5 IS0 VCAP support

This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
Processor) support for the Sparx5 platform.

The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
mainly extracts frame information to metadata that follows the frame in the
Sparx5 processing flow all the way to the egress port.

The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:

- chain 1000000: IS0 Lookup 0
- chain 1100000: IS0 Lookup 1
- chain 1200000: IS0 Lookup 2
- chain 1300000: IS0 Lookup 3
- chain 1400000: IS0 Lookup 4
- chain 1500000: IS0 Lookup 5

Each of these lookups have their own port keyset configuration that decides
which keys will be used for matching on which traffic type.

The IS0 VCAP has these traffic classifications:

- IPv4 frames
- IPv6 frames
- Unicast MPLS frames (ethertype = 0x8847)
- Multicast MPLS frames (ethertype = 0x8847)
- Other frame types than MPLS, IPv4 and IPv6

The IS0 VCAP has an action that allows setting the value of a PAG (Policy
Association Group) key field in the frame metadata, and this can be used
for matching in an IS2 VCAP rule.

This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.

The linking is exposed by using the TC "goto chain" action with an offset
from the IS2 chain ids.

As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
a rule in IS2 Lookup 0.
====================

Link: https://lore.kernel.org/r/20230124104511.293938-1-steen.hegelund@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add support for IS0 VCAP CVLAN TC keys

This adds support for parsing and matching on the CVLAN tags in the Sparx5
IS0 VCAP.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add support for IS0 VCAP ethernet protocol types

This allows the IS0 VCAP to have its own list of supported ethernet
protocol types matching what is supported by the VCAPs port lookup
classification.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add automatic selection of VCAP rule actionset

With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs

This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.

Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add TC support for IS0 VCAP

This enables the TC command to use the Sparx5 IS0 VCAP

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add actionset type id information to rule

This adds the actionset type id to the rule information. This is needed as
we now have more than one actionset in a VCAP instance (IS0).

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5

This adds the IS0 VCAP port keyset configuration for Sparx5 and also
updates the debugFS support to show the keyset configuration.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP model

This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5.
This VCAP provides classification actions for Sparx5.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'add-ip_local_port_range-socket-option'

Jakub Sitnicki says:

====================
Add IP_LOCAL_PORT_RANGE socket option

This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].

Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.

Documentation
-------------

Proposed update to the ip(7) man-page:

       IP_LOCAL_PORT_RANGE (since Linux X.Y)
              Set or get the per-socket default local  port  range.  This
              option  can  be  used  to  clamp down the global local port
              range, defined by the ip_local_port_range  /proc  interface
              described below, for a given socket.

              The  option  takes  an uint32_t value with the high 16 bits
              set to the upper range bound, and the low 16  bits  set  to
              the  lower  range  bound.  Range  bounds are inclusive. The
              16-bit values should be in host byte order.

              The lower bound has to be less than the  upper  bound  when
              both  bounds  are  not  zero. Otherwise, setting the option
              fails with EINVAL.

              If either bound is outside of the global local port  range,
              or is zero, then that bound has no effect.

              To  reset  the setting, pass zero as both the upper and the
              lower bound.

Interaction with SELinux bind() hook
------------------------------------

SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.

The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.

Hence, there is no interaction with SELinux, AFAICT.

RFC -> v1
RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/

* Allow either the high bound or the low bound, or both, to be zero
* Add getsockopt support
* Add selftests

Links:
------

[1]: https://lpc.events/event/16/contributions/1349/
====================

Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option

Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:

1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Kconfig: fix spellos

Fix spelling in net/ Kconfig files.
(reported by codespell)

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: coreteam@netfilter.org
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

virtchnl: i40e/iavf: rename iwarp to rdma

Since the latest Intel hardware does both IWARP and ROCE, rename the
term IWARP in the virtchnl header to be RDMA. Do this for both upper and
lower case instances. Many of the non-virtchnl.h changes were done with
regular expression replacements using perl like:
perl -p -i -e 's/_IWARP/_RDMA/' <files>
perl -p -i -e 's/_iwarp/_rdma/' <files>
and I had to pick up a few instances manually.

The virtchnl.h header has some comments and clarity added around when to
use certain defines.

note: had to fix a checkpatch warning for a long line by wrapping one of
the lines I changed.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

virtchnl: do structure hardening

The virtchnl interface can have a bunch of "soft" defined structures
hardened by using explicit sizes for declarations, and then referring to
the enum type that uses them in a comment. None of these changes should
change any of the structure sizes.

Also, remove a duplicate line in a switch statement and let two cases
uses the same code.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

virtchnl: update header and increase header clarity

We already have the SPDX header, so just leave a copyright notice with
an updated year and get rid of the boilerplate header (so 2002!).

In addition, update a couple of comments to clarify how the various
parts of the virtchannel header interaction work.

No functional changes.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

virtchnl: remove unused structure declaration

Nothing uses virtchnl_msg, just remove it.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

net: ethtool: fix NULL pointer dereference in pause_prepare_data()

In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
-> ctx->ops->prepare_data
-> pause_prepare_data

struct genl_info *info will be passed as NULL, and pause_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: fix NULL pointer dereference in stats_prepare_data()

In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
-> ctx->ops->prepare_data
-> stats_prepare_data

struct genl_info *info will be passed as NULL, and stats_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 's390-ism-generalized-interface'

Jan Karcher says:

====================
drivers/s390/net/ism: Add generalized interface

Previously, there was no clean separation between SMC-D code and the ISM
device driver.This patch series addresses the situation to make ISM available
for uses outside of SMC-D.
In detail: SMC-D offers an interface via struct smcd_ops, which only the
ISM module implements so far. However, there is no real separation between
the smcd and ism modules, which starts right with the ISM device
initialization, which calls directly into the SMC-D code.
This patch series introduces a new API in the ISM module, which allows
registration of arbitrary clients via include/linux/ism.h: struct ism_client.
Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of
dependencies on SMC-D in the device structure), and adds a number of API
calls for data transfers via ISM (see ism_register_dmb() & friends).
Still, the ISM module implements the SMC-D API, and therefore has a number
of internal helper functions for that matter.
Note that the ISM API is consciously kept thin for now (as compared to the
SMC-D API calls), as a number of API calls are only used with SMC-D and
hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls.

v1 -> v2:
Removed s390x dependency which broke config for other archs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: De-tangle ism and smc device initialization

The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/ism: Consolidate SMC-D-related code

The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Separate SMC-D and ISM APIs

We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Register SMC-D as ISM client

Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ism: Add new API for client registration

Add a new API that allows other drivers to concurrently access ISM devices.
To do so, we introduce a new API that allows other modules to register for
ISM device usage. Furthermore, we move the GID to struct ism, where it
belongs conceptually, and rename and relocate struct smcd_event to struct
ism_event.
This is the first part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/ism: Introduce struct ism_dmb

Conceptually, a DMB is a structure that belongs to ISM devices. However,
SMC currently 'owns' this structure. So future exploiters of ISM devices
would be forced to include SMC headers to work - which is just weird.
Therefore, we switch ISM to struct ism_dmb, introduce a new public header
with the definition (will be populated with further API calls later on),
and, add a thin wrapper to please SMC. Since structs smcd_dmb and ism_dmb
are identical, we can simply convert between the two for now.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ism: Add missing calls to disable bus-mastering

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Terminate connections prior to device removal

Removing an ISM device prior to terminating its associated connections
doesn't end well.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio-net: Reduce debug name field size to 16 bytes

virtio queue index can be maximum of 65535. 16 bytes are enough to store
the vq name with the existing string prefix.

With this change, send queue struct saves 24 bytes and receive
queue saves whole cache line worth 64 bytes per structure
due to saving in alignment bytes.

Pahole results before:

pahole -s drivers/net/virtio_net.o | \
    grep -e "send_queue" -e "receive_queue"
send_queue      1112    0
receive_queue   1280    1

Pahole results after:
pahole -s drivers/net/virtio_net.o | \
    grep -e "send_queue" -e "receive_queue"
send_queue      1088    0
receive_queue   1216    1

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: remove a dubious assumption in fmsg dumping

Build bot detects that err may be returned uninitialized in
devlink_fmsg_prepare_skb(). This is not really true because
all fmsgs users should create at least one outer nest, and
therefore fmsg can't be completely empty.

That said the assumption is not trivial to confirm, so let's
follow the bots advice, anyway.

This code does not seem to have changed since its inception in
commit 1db64e8733f6 ("devlink: Add devlink formatted message (fmsg) API")

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: fix incorrect verify_enabled reporting in ethtool get_mm()

We don't read the verify_enabled variable from hardware in the MAC Merge
layer state GET operation, instead we always leave it set to "false".
The user may think something is wrong if they set verify_enabled to
true, then read it back and see it's still false, even though the
configuration took place.

Fixes: 6505b6805655 ("net: mscc: ocelot: add MAC Merge layer support for VSC9959")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230123184538.3420098-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: flower: change get/set_eeprom logic and enable for flower reps

The changes in this patch are as follows:

- Alter the logic of get/set_eeprom functions to use the helper function
nfp_app_from_netdev() which handles differentiating between an nfp_net
and a nfp_repr. This allows us to get an agnostic backpointer to the
pdev.

- Enable the various eeprom commands by adding the 'get_eeprom_len',
'get_eeprom', 'set_eeprom' callbacks to the nfp_port_ethtool_ops struct.
This allows the eeprom commands to work on representor interfaces,
similar to a previous patch which added it to the vnics.
Currently these are being used to configure persistent MAC addresses for
the physical ports on the nfp.

Signed-off-by: James Hershaw <james.hershaw@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230123134135.293278-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Make ip6_route_output_flags_noref() static.

This function is only used in net/ipv6/route.c and has no reason to be
visible outside of it.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/50706db7f675e40b3594d62011d9363dce32b92e.1674495822.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: fix spelling mistake in dump size assert

Commit 2c7bc10d0f7b ("netlink: add macro for checking dump ctx size")
misspelled the name of the assert as asset, missing an R.

Reported-by: Ido Schimmel <idosch@idosch.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netlink-protocol-specs'

Jakub Kicinski says:

====================
Netlink protocol specs

I think the Netlink proto specs are far along enough to merge.
Filling in all attribute types and quirks will be an ongoing
effort but we have enough to cover FOU so it's somewhat complete.

I fully intend to continue polishing the code but at the same
time I'd like to start helping others base their work on the
specs (e.g. DPLL) and need to start working on some new families
myself.

That's the progress / motivation for merging. The RFC [1] has more
of a high level blurb, plus I created a lot of documentation, I'm
not going to repeat it here. There was also the talk at LPC [2].

[1] https://lore.kernel.org/all/20220811022304.583300-1-kuba@kernel.org/
[2] https://youtu.be/9QkXIQXkaQk?t=2562
v2: https://lore.kernel.org/all/20220930023418.1346263-1-kuba@kernel.org/
v3: https://lore.kernel.org/all/20230119003613.111778-1-kuba@kernel.org/1

v4:
- spec improvements (patch 2)
- Python cleanup (patch 3)
- rename auto-gen files and use the right comment style
====================

Link: https://lore.kernel.org/r/20230120175041.342573-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tools: ynl: add a completely generic client

Add a CLI sample which can take in arbitrary request
in JSON format, convert it to Netlink and do the inverse
for output.

It's meant as a development tool primarily and perhaps
for selftests which need to tickle netlink in a special way.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fou: use policy and operation tables generated from the spec

Generate and plug in the spec-based tables.

A little bit of renaming is needed in the FOU code.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fou: rename the source for linking

We'll need to link two objects together to form the fou module.
This means the source can't be called fou, the build system expects
fou.o to be the combined object.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fou: regenerate the uAPI from the spec

Regenerate the FOU uAPI header from the YAML spec.

The flags now come before attributes which use them,
and the comments for type disappear (coders should look
at the spec instead).

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netlink: add a proto specification for FOU

FOU has a reasonably modern Genetlink family. Add a spec.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: add basic C code generators for Netlink

Code generators to turn Netlink specs into C code.
I'm definitely not proud of it.

The main generator is in Python, there's a bash script
to regen all code-gen'ed files in tree after making
spec changes.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netlink: add schemas for YAML specs

Add schemas for Netlink spec files. As described in the docs
we have 4 "protocols" or compatibility levels, and each one
comes with its own schema, but the more general / legacy
schemas are superset of more modern ones: genetlink is
the smallest followed by genetlink-c and genetlink-legacy.
There is no schema for raw netlink, yet, I haven't found the time..

I don't know enough jsonschema to do inheritance or something
but the repetition is not too bad. I hope.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

docs: add more netlink docs (incl. spec docs)

Add documentation about the upcoming Netlink protocol specs.

Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sched-use-the-backlog-for-nested-mirred-ingress'

Davide Caratti says:

====================
net/sched: use the backlog for nested mirred ingress

TC mirred has a protection against excessive stack growth, but that
protection doesn't really guarantee the absence of recursion, nor
it guards against loops. Patch 1/2 rewords "recursion" to "nesting" to
make this more clear.
We can leverage on this existing mechanism to prevent TCP / SCTP from doing
soft lock-up in some specific scenarios that uses mirred egress->ingress:
patch 2 changes mirred so that the networking backlog is used for nested
mirred ingress actions.
====================

Link: https://lore.kernel.org/r/cover.1674233458.git.dcaratti@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

act_mirred: use the backlog for nested calls to mirred ingress

William reports kernel soft-lockups on some OVS topologies when TC mirred
egress->ingress action is hit by local TCP traffic [1].
The same can also be reproduced with SCTP (thanks Xin for verifying), when
client and server reach themselves through mirred egress to ingress, and
one of the two peers sends a "heartbeat" packet (from within a timer).

Enqueueing to backlog proved to fix this soft lockup; however, as Cong
noticed [2], we should preserve - when possible - the current mirred
behavior that counts as "overlimits" any eventual packet drop subsequent to
the mirred forwarding action [3]. A compromise solution might use the
backlog only when tcf_mirred_act() has a nest level greater than one:
change tcf_mirred_forward() accordingly.

Also, add a kselftest that can reproduce the lockup and verifies TC mirred
ability to account for further packet drops after TC mirred egress->ingress
(when the nest level is 1).

[1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/
[2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/
[3] such behavior is not guaranteed: for example, if RPS or skb RX
     timestamping is enabled on the mirred target device, the kernel
     can defer receiving the skb and return NET_RX_SUCCESS inside
     tcf_mirred_forward().

Reported-by: William Zhao <wizhao@redhat.com>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: act_mirred: better wording on protection against excessive stack growth

with commit e2ca070f89ec ("net: sched: protect against stack overflow in
TC act_mirred"), act_mirred protected itself against excessive stack growth
using per_cpu counter of nested calls to tcf_mirred_act(), and capping it
to MIRRED_RECURSION_LIMIT. However, such protection does not detect
recursion/loops in case the packet is enqueued to the backlog (for example,
when the mirred target device has RPS or skb timestamping enabled). Change
the wording from "recursion" to "nesting" to make it more clear to readers.

CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'fix-cpts-release-action-in-am65-cpts-driver'

Siddharth Vadapalli says:

====================
Fix CPTS release action in am65-cpts driver

Delete unreachable code in am65_cpsw_init_cpts() function, which was
Reported-by: Leon Romanovsky <leon@kernel.org>
at:
https://lore.kernel.org/r/Y8aHwSnVK9+sAb24@unreal

Remove the devm action associated with am65_cpts_release() and invoke the
function directly on the cleanup and exit paths.

v4:
https://lore.kernel.org/r/20230120044201.357950-1-s-vadapalli@ti.com/
v3:
https://lore.kernel.org/r/20230118095439.114222-1-s-vadapalli@ti.com/
v2:
https://lore.kernel.org/r/20230116044517.310461-1-s-vadapalli@ti.com/
v1:
https://lore.kernel.org/r/20230113104816.132815-1-s-vadapalli@ti.com/
====================

Link: https://lore.kernel.org/r/20230120070731.383729-1-s-vadapalli@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: ti: am65-cpsw/cpts: Fix CPTS release action

The am65_cpts_release() function is registered as a devm_action in the
am65_cpts_create() function in am65-cpts driver. When the am65-cpsw driver
invokes am65_cpts_create(), am65_cpts_release() is added in the set of devm
actions associated with the am65-cpsw driver's device.

In the event of probe failure or probe deferral, the platform_drv_probe()
function invokes dev_pm_domain_detach() which powers off the CPSW and the
CPSW's CPTS hardware, both of which share the same power domain. Since the
am65_cpts_disable() function invoked by the am65_cpts_release() function
attempts to reset the CPTS hardware by writing to its registers, the CPTS
hardware is assumed to be powered on at this point. However, the hardware
is powered off before the devm actions are executed.

Fix this by getting rid of the devm action for am65_cpts_release() and
invoking it directly on the cleanup and exit paths.

Fixes: f6bd59526ca5 ("net: ethernet: ti: introduce am654 common platform time sync driver")
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: ti: am65-cpsw: Delete unreachable error handling code

The am65_cpts_create() function returns -EOPNOTSUPP only when the config
"CONFIG_TI_K3_AM65_CPTS" is disabled. Also, in the am65_cpsw_init_cpts()
function, am65_cpts_create() can only be invoked if the config
"CONFIG_TI_K3_AM65_CPTS" is enabled. Thus, the error handling code for the
case in which the return value of am65_cpts_create() is -EOPNOTSUPP, is
unreachable. Hence delete it.

Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: microchip: run phy initialization during each link update

PHY initialization is supposed to run on every mode changes.
"lan87xx_config_aneg()" verifies every mode change using
"phy_modify_changed()" function. Earlier code had phy_modify_changed()
followed by genphy_soft_reset. But soft_reset resets all the
pre-configured register values to default state, and lost all the
initialization done. With this reason gen_phy_reset was removed.
But it need to go through init sequence each time the mode changed.
Update lan87xx_config_aneg() to invoke phy_init once successful mode
update is detected.

PHY init sequence added in lan87xx_phy_init() have slave init
commands executed every time. Update the init sequence to run
slave init only if phydev is in slave mode.

Test setup contains LAN9370 EVB connected to SAMA5D3 (Running DSA),
and issue can be reproduced by connecting link to any of the available
ports after SAMA5D3 boot-up. With this issue, port will fail to
update link state. But once the SAMA5D3 is reset with LAN9370 link in
connected state itself, on boot-up link state will be reported as UP. But
Again after some time, if link is moved to DOWN state, it will not get
reported.

Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230120104733.724701-1-rakesh.sankaranarayanan@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-microchip-add-support-for-credit-based-shaper'

Arun Ramadoss says:

====================
net: dsa: microchip: add support for credit based shaper

LAN937x switch family, KSZ9477, KSZ9567, KSZ9563 and KSZ8563 supports
the credit based shaper. But there were few difference between LAN937x and KSZ
switch like
- number of queues for LAN937x is 8 and for others it is 4.
- size of credit increment register for LAN937x is 24 and for other is 16-bit.
This patch series add the credit based shaper with common implementation for
LAN937x and KSZ swithes.
====================

Link: https://lore.kernel.org/r/20230120052135.32120-1-arun.ramadoss@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: add support for credit based shaper

KSZ9477, KSZ9567, KSZ9563, KSZ8563 and LAN937x supports Credit based
shaper. To differentiate the chip supporting cbs, tc_cbs_supported
flag is introduced in ksz_chip_data.
And KSZ series has 16bit Credit increment registers whereas LAN937x has
24bit register. The value to be programmed in the credit increment is
determined using the successive multiplication method to convert decimal
fraction to hexadecimal fraction.
For example: if idleslope is 10000 and sendslope is -90000, then
bandwidth is 10000 - (-90000) = 100000.
The 10% bandwidth of 100Mbps means 10/100 = 0.1(decimal). This value has
to be converted to hexa.
1) 0.1 * 16 = 1.6  --> fraction 0.6 Carry = 1 (MSB)
2) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
3) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
4) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
5) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9
6) 0.6 * 16 = 9.6  --> fraction 0.6 Carry = 9 (LSB)
Now 0.1(decimal) becomes 0.199999(Hex).
If it is LAN937x, 24 bit value will be programmed to Credit Inc
register, 0x199999. For others 16 bit value will be prgrammed, 0x1999.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: enable port queues for tc mqprio

LAN937x family of switches has 8 queues per port where the KSZ switches
has 4 queues per port. By default, only one queue per port is enabled.
The queues are configurable in 2, 4 or 8. This patch add 8 number of
queues for LAN937x and 4 for other switches.
In the tag_ksz.c file, prioirty of the packet is queried using the skb
buffer and the corresponding value is updated in the tag.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: avoid irqsave in skb_defer_free_flush

The spin_lock irqsave/restore API variant in skb_defer_free_flush can
be replaced with the faster spin_lock irq variant, which doesn't need
to read and restore the CPU flags.

Using the unconditional irq "disable/enable" API variant is safe,
because the skb_defer_free_flush() function is only called during
NAPI-RX processing in net_rx_action(), where it is known the IRQs
are enabled.

Expected gain is 14 cycles from avoiding reading and restoring CPU
flags in a spin_lock_irqsave/restore operation, measured via a
microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz.

Microbenchmark overhead of spin_lock+unlock:
- spin_lock_unlock_irq cost: 34 cycles(tsc) 9.486 ns
- spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns

We don't expect to see a measurable packet performance gain, as
skb_defer_free_flush() is called infrequently once per NIC device NAPI
bulk cycle and conditionally only if SKBs have been deferred by other
CPUs via skb_attempt_defer_free().

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Document that max_size sysctl is deprecated

v4: fix deprecated typo.

Document that max_size is deprecated due to:

commit af6d10345ca7 ("ipv6: remove max_size check inline with ipv4")

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Link: https://lore.kernel.org/r/20230120232331.1273881-1-jmaxwell37@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fix kfree_skb_list use of skb_mark_not_on_list

A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
invoking skb_mark_not_on_list().

In this patch we choose to remove the skb_mark_not_on_list() call as it
isn't necessary. It would be possible and correct to call
skb_mark_not_on_list() only when __kfree_skb_reason() returns true,
meaning the SKB is ready to be free'ed, as it calls/check skb_unref().

This fix is needed as kfree_skb_list() is also invoked on skb_shared_info
frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can
have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(),
which takes a reference on all SKBs in the list. This implies the
invariant that all SKBs in the list must have the same refcnt, when using
kfree_skb_list().

Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: microchip: sparx5: Fix uninitialized variable in vcap_path_exist()

The "eport" variable needs to be initialized to NULL for this code to
work.

Fixes: 814e7693207f ("net: microchip: vcap api: Add a storage state to a VCAP rule")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Link: https://lore.kernel.org/r/Y8qbYAb+YSXo1DgR@kili
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: warn once if addr parameter is invalid in mdiobus_get_phy()

If mdiobus_get_phy() is called with an invalid addr parameter, then the
caller has a bug. Print a call trace to help identifying the caller.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/daec3f08-6192-ba79-f74b-5beb436cab6c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-next-2023-01-23' of git://git./linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.3

First set of patches for v6.3. The most important change here is that
the old Wireless Extension user space interface is not supported on
Wi-Fi 7 devices at all. We also added a warning if anyone with modern
drivers (ie. cfg80211 and mac80211 drivers) tries to use Wireless
Extensions, everyone should switch to using nl80211 interface instead.

Static WEP support is removed, there wasn't any driver using that
anyway so there's no user impact. Otherwise it's smaller features and
fixes as usual.

Note: As mt76 had tricky conflicts due to the fixes in wireless tree,
we decided to merge wireless into wireless-next to solve them easily.
There should not be any merge problems anymore.

Major changes:

cfg80211
- remove never used static WEP support
- warn if Wireless Extention interface is used with cfg80211/mac80211 drivers
- stop supporting Wireless Extensions with Wi-Fi 7 devices
- support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting

rfkill
- add GPIO DT support

bitfield
- add FIELD_PREP_CONST()

mt76
- per-PHY LED support

rtw89
- support new Bluetooth co-existance version

rtl8xxxu
- support RTL8188EU

* tag 'wireless-next-2023-01-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (123 commits)
  wifi: wireless: deny wireless extensions on MLO-capable devices
  wifi: wireless: warn on most wireless extension usage
  wifi: mac80211: drop extra 'e' from ieeee80211... name
  wifi: cfg80211: Deduplicate certificate loading
  bitfield: add FIELD_PREP_CONST()
  wifi: mac80211: add kernel-doc for EHT structure
  mac80211: support minimal EHT rate reporting on RX
  wifi: mac80211: Add HE MU-MIMO related flags in ieee80211_bss_conf
  wifi: mac80211: Add VHT MU-MIMO related flags in ieee80211_bss_conf
  wifi: cfg80211: Use MLD address to indicate MLD STA disconnection
  wifi: cfg80211: Support 32 bytes KCK key in GTK rekey offload
  wifi: cfg80211: Fix extended KCK key length check in nl80211_set_rekey_data()
  wifi: cfg80211: remove support for static WEP
  wifi: rtl8xxxu: Dump the efuse only for untested devices
  wifi: rtl8xxxu: Print the ROM version too
  wifi: rtw88: Use non-atomic sta iterator in rtw_ra_mask_info_update()
  wifi: rtw88: Use rtw_iterate_vifs() for rtw_vif_watch_dog_iter()
  wifi: rtw88: Move register access from rtw_bf_assoc() outside the RCU
  wifi: rtl8xxxu: Use a longer retry limit of 48
  wifi: rtl8xxxu: Report the RSSI to the firmware
  ...
====================

Link: https://lore.kernel.org/r/20230123103338.330CBC433EF@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: asix,ax88796c: allow SPI peripheral properties

The AX88796C device node on SPI bus can use SPI peripheral properties in
certain configurations:

exynos3250-artik5-eval.dtb: ethernet@0: 'controller-data' does not match any of the regexes: 'pinctrl-[0-9]+'

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20230120144329.305655-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: tcp_mmap: populate pages in send path

In commit 72653ae5303c ("selftests: net: tcp_mmap:
Use huge pages in send path") I made a change to use hugepages
for the buffer used by the client (tx path)

Today, I understood that the cause for poor zerocopy
performance was that after a mmap() for a 512KB memory
zone, kernel uses a single zeropage, mapped 128 times.

This was really the reason for poor tx path performance
in zero copy mode, because this zero page refcount is
under high pressure, especially when TCP ACK packets
are processed on another cpu.

We need either to force a COW on all the memory range,
or use MAP_POPULATE so that a zero page is not abused.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230120181136.3764521-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: mux-meson-g12a: use devm_clk_get_enabled to simplify the code

Use devm_clk_get_enabled() to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mdiobus: Convert to use fwnode_device_is_compatible()

Replace open coded fwnode_device_is_compatible() in the driver.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: Add and use ethnl_update_bool.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'enetc-mac-merge-prep'

Vladimir Oltean says:

====================
ENETC MAC Merge cleanup

This is a preparatory patch set for MAC Merge layer support in enetc via
ethtool. It does the following:

- consolidates a software lockstep register write procedure for the pMAC
- detects per-port frame preemption capability and only writes pMAC
registers if a pMAC exists
- stops enabling the pMAC by default

Additionally, I noticed some build warnings in the driver which are new
in this kernel version, so patch 1/6 fixes those.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: stop auto-configuring the port pMAC

The pMAC (ENETC_PFPMR_PMACE) is probably unconditionally enabled in the
enetc driver to allow RX of preemptible packets and not see them as
error frames. I don't know why TX preemption (ENETC_MMCSR_ME) is enabled
though. With no way to say which traffic classes are preemptible (all
are express by default), no preemptible frames would be transmitted
anyway.

Lastly, it may have been believed that the register write lock-step mode
(now deleted) needed the pMAC to be enabled at all times. I don't know
if that's true. However, I've checked that driver writes to PM1
registers do propagate through to the ENETC IP even when the pMAC is
disabled.

With such incomplete support for frame preemption, it's best to just
remove whatever exists right now and come with something more coherent
later.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: implement software lockstep for port MAC registers

Currently the enetc driver duplicates its writes to the PM0 registers
also to PM1, but it doesn't do this consistently - for example we write
to ENETC_PM0_MAXFRM but not to ENETC_PM1_MAXFRM.

Create enetc_port_mac_wr() which writes both the PM0 and PM1 register
with the same value (if frame preemption is supported on this port).
Also create enetc_port_mac_rd() which reads from PM0 - the assumption
being that PM1 contains just the same value.

This will be necessary when we enable the MAC Merge layer properly, and
the pMAC becomes operational.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: stop configuring pMAC in lockstep with eMAC

The MWLM bit (MAC write lock-step mode) allows register writes to the
pMAC to be auto-performed whenever the corresponding eMAC register is
written by the driver. This allows their configuration to remain
in sync.

The driver has set this bit since the initial commit, but it doesn't do
anything, since the hardware feature doesn't work (and the bit has been
removed from more recent versions of the documentation).

The driver does attempt, more or less, to keep those MAC registers in
sync by writing the same value once to e.g. ENETC_PM0_CMD_CFG (eMAC) and
once to ENETC_PM1_CMD_CFG (pMAC). Because the lockstep feature doesn't
work, that's what it will stick to.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: enetc: add definition for offset between eMAC and pMAC regs

This is a preliminary patch which replaces the hardcoded 0x1000 present
in other PM1 (port MAC 1, aka pMAC) register definitions, which is an
offset to the PM0 (port MAC 0, aka eMAC) equivalent register.
This definition will be used in more places by future code.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>