review.tizen.org Git - platform/kernel/linux-starfive.git/log

ice: reduce initial wait for control queue messages

The ice_sq_send_cmd() function is used to send messages to the control
queues used to communicate with firmware, virtual functions, and even some
hardware.

When sending a control queue message, the driver is designed to
synchronously wait for a response from the queue. Currently it waits
between checks for 100 to 150 microseconds.

Commit f86d6f9c49f6 ("ice: sleep, don't busy-wait, for
ICE_CTL_Q_SQ_CMD_TIMEOUT") did recently change the behavior from an
unnecessary delay into a sleep which is a significant improvement over the
old behavior of polling using udelay.

Because of the nature of PCIe transactions, the hardware won't be informed
about a new message until the write to the tail register posts. This is
only guaranteed to occur at the next register read. In ice_sq_send_cmd(),
this happens at the ice_sq_done() call. Because of this, the driver
essentially forces a minimum of one full wait time regardless of how fast
the response is.

For the hardware-based sideband queue, this is especially slow. It is
expected that the hardware will respond within 2 or 3 microseconds, an
order of magnitude faster than the 100-150 microsecond sleep.

Allow such fast completions to occur without delay by introducing a small 5
microsecond delay first before entering the sleeping timeout loop. Ensure
the tail write has been posted by using ice_flush(hw) first.

While at it, lets also remove the ICE_CTL_Q_SQ_CMD_USEC macro as it
obscures the sleep time in the inner loop. It was likely introduced to
avoid "magic numbers", but in practice sleep and delay values are easier to
read and understand when using actual numbers instead of a named constant.

This change should allow the fast hardware based control queue messages to
complete quickly without delay, while slower firmware queue response times
will sleep while waiting for the response.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

Merge branch 'mptcp-expose-more-info-and-small-improvements'

Matthieu Baerts says:

====================
mptcp: expose more info and small improvements

Patch 1-3/9 track and expose some aggregated data counters at the MPTCP
level: the number of retransmissions and the bytes that have been
transferred. The first patch prepares the work by moving where snd_una
is updated for fallback sockets while the last patch adds some tests to
cover the new code.

Patch 4-6/9 introduce a new getsockopt for SOL_MPTCP: MPTCP_FULL_INFO.
This new socket option allows to combine info from MPTCP_INFO,
MPTCP_TCPINFO and MPTCP_SUBFLOW_ADDRS socket options into one. It can be
needed to have all info in one because the path-manager can close and
re-create subflows between getsockopt() and fooling the accounting. The
first patch introduces a unique subflow ID to easily detect when
subflows are being re-created with the same 5-tuple while the last patch
adds some tests to cover the new code.

Please note that patch 5/9 ("mptcp: introduce MPTCP_FULL_INFO getsockopt")
can reveal a bug that were there for a bit of time, see [1]. A fix has
recently been fixed to netdev for the -net tree: "mptcp: ensure listener
is unhashed before updating the sk status", see [2]. There is no
conflicts between the two patches but it might be better to apply this
series after the one for -net and after having merged "net" into
"net-next".

Patch 7/9 is similar to commit 47867f0a7e83 ("selftests: mptcp: join:
skip check if MIB counter not supported") recently applied in the -net
tree but here it adapts the new code that is only in net-next (and it
fixes a merge conflict resolution which didn't have any impact).

Patch 8 and 9/9 are two simple refactoring. One to consolidate the
transition to TCP_CLOSE in mptcp_do_fastclose() and avoid duplicated
code. The other one reduces the scope of an argument passed to
mptcp_pm_alloc_anno_list() function.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/407
Link: https://lore.kernel.org/netdev/20230620-upstream-net-20230620-misc-fixes-for-v6-4-v1-0-f36aa5eae8b9@tessares.net/
====================

Link: https://lore.kernel.org/r/20230620-upstream-net-next-20230620-mptcp-expose-more-info-and-misc-v1-0-62b9444bfd48@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pass addr to mptcp_pm_alloc_anno_list

Pass addr parameter to mptcp_pm_alloc_anno_list() instead of entry. We
can reduce the scope, e.g. in mptcp_pm_alloc_anno_list(), we only access
"entry->addr", we can then restrict to the pointer to "addr" then.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: consolidate transition to TCP_CLOSE in mptcp_do_fastclose()

The MPTCP code always set the msk state to TCP_CLOSE before
calling performing the fast-close. Move such state transition in
mptcp_do_fastclose() to avoid some code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: skip check if MIB counter not supported (part 2)

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the MPTCP MIB counters introduced in commit fc518953bc9c
("mptcp: add and use MIB counter infrastructure") and more later. The
MPTCP Join selftest heavily relies on these counters.

If a counter is not supported by the kernel, it is not displayed when
using 'nstat -z'. We can then detect that and skip the verification. A
new helper (get_counter()) has been added recently in the -net tree to
do the required checks and return an error if the counter is not
available.

This commit is similar to the one with the same title applied in the
-net tree but it modifies code only present in net-next for the moment,
see the Fixes commit below.

While at it, we can also remove the use of ${extra_msg} variable which
is never assigned in chk_rm_tx_nr() function and use 'echo' without '-n'
parameter.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 0639fa230a21 ("selftests: mptcp: add explicit check for new mibs")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add MPTCP_FULL_INFO testcase

Add a testcase explicitly triggering the newly introduce
MPTCP_FULL_INFO getsockopt.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/388
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: introduce MPTCP_FULL_INFO getsockopt

Some user-space applications want to monitor the subflows utilization.

Dumping the per subflow tcp_info is not enough, as the PM could close
and re-create the subflows under-the-hood, fooling the accounting.
Even checking the src/dst addresses used by each subflow could not
be enough, because new subflows could re-use the same address/port of
the just closed one.

This patch introduces a new socket option, allow dumping all the relevant
information all-at-once (everything, everywhere...), in a consistent
manner.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/388
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: add subflow unique id

The user-space need to properly account the data received/sent by
individual subflows. When additional subflows are created and/or
closed during the MPTCP socket lifetime, the information currently
exposed via MPTCP_TCPINFO are not enough: subflows are identified only
by the sequential position inside the info dumps, and that will change
with the above mentioned events.

To solve the above problem, this patch introduces a new subflow
identifier that is unique inside the given MPTCP socket scope.

The initial subflow get the id 1 and the other subflows get incremental
values at join time.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/388
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: explicitly tests aggregate counters

Update the existing sockopt test-case to do some basic checks
on the newly added counters.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/385
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: track some aggregate data counters

Currently there are no data transfer counters accounting for all
the subflows used by a given MPTCP socket. The user-space can compute
such figures aggregating the subflow info, but that is inaccurate
if any subflow is closed before the MPTCP socket itself.

Add the new counters in the MPTCP socket itself and expose them
via the existing diag and sockopt. While touching mptcp_diag_fill_info(),
acquire the relevant locks before fetching the msk data, to ensure
better data consistency

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/385
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: move snd_una update earlier for fallback socket

That will avoid an unneeded conditional in both the fast-path
and in the fallback case and will simplify a bit the next
patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Fix rst format issues in readme

This patch fixes a warning in the ena documentation
file identified by the kernel automatic tools.
The patch also adds a missing newline between
sections.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306171804.U7E92zoE-lkp@intel.com/
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230620133544.32584-1-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: Cleanup on charging memory for newly accepted sockets

If there is no net-memcg associated with the sock, don't bother
calculating its memory usage for charge.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lore.kernel.org/r/20230620092712.16217-1-wuyun.abel@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tc-testing: add one test for flushing explicitly created chain

Add the test for additional reference to chains that are explicitly
created by RTM_NEWCHAIN message.

The test result:

1..1
ok 1 c2b4 - soft lockup alarm will be not generated after delete the prio 0
filter of the chain

This is a follow up to commit c9a82bec02c3 ("net/sched: cls_api: Fix lockup on flushing explicitly created chain").

Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
Acked-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/r/20230620014939.2034054-1-renmingshuai@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: micrel,ks8851: allow SPI device properties

The Micrel KS8851 can be attached to SPI or parallel bus and the
difference is expressed in compatibles. Allow common SPI properties
when this is a SPI variant and narrow the parallel memory bus properties
to the second case.

This fixes dtbs_check warning:

qcom-msm8960-cdp.dtb: ethernet@0: Unevaluated properties are not allowed ('spi-max-frequency' was unexpected)

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230619170134.65395-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: bluetooth: qualcomm: document VDD_CH1

WCN3990 comes with two chains - CH0 and CH1 - where each takes VDD
regulator. It seems VDD_CH1 is optional (Linux driver does not care
about it), so document it to fix dtbs_check warnings like:

sdm850-lenovo-yoga-c630.dtb: bluetooth: 'vddch1-supply' does not match any of the regexes: 'pinctrl-[0-9]+'

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230617165716.279857-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hsr: Disable promiscuous mode in offload mode

When port-to-port forwarding for interfaces in HSR node is enabled,
disable promiscuous mode since L2 frame forward happens at the
offloaded hardware.

Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230614114710.31400-1-r-gunasekaran@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'leds-trigger-netdev-add-additional-modes'

Christian Marangi says:

====================
leds: trigger: netdev: add additional modes

This is a continue of [1]. It was decided to take a more gradual
approach to implement LEDs support for switch and phy starting with
basic support and then implementing the hw control part when we have all
the prereq done.

This should be the final part for the netdev trigger.
I added net-next tag and added netdev mailing list since I was informed
that this should be merged with netdev branch.

We collect some info around and we found a good set of modes that are
common in almost all the PHY and Switch.

These modes are:
- Modes for dedicated link speed(10, 100, 1000 mbps). Additional mode
can be added later following this example.
- Modes for half and full duplex.

The original idea was to add hw control only modes.
While the concept makes sense in practice it would results in lots of
additional code and extra check to make sure we are setting correct modes.

With the suggestion from Andrew it was pointed out that using the ethtool
APIs we can actually get the current link speed and duplex and this
effectively removed the problem of having hw control only modes since we
can fallback to software.

Since these modes are supported by software, we can skip providing an
user for this in the LED driver to support hw control for these new modes
(that will come right after this is merged) and prevent this to be another
multi subsystem series.

For link speed and duplex we use ethtool APIs.

To call ethtool APIs, rtnl lock is needed but this can be skipped on
handling netdev events as the lock is already held.

[1] https://lore.kernel.org/lkml/20230216013230.22978-1-ansuelsmth@gmail.com/
====================

Link: https://lore.kernel.org/r/20230619204700.6665-1-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

leds: trigger: netdev: expose hw_control status via sysfs

Expose hw_control status via sysfs for the netdev trigger to give
userspace better understanding of the current state of the trigger and
the LED.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Acked-by: Lee Jones <lee@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

leds: trigger: netdev: add additional specific link duplex mode

Add additional modes for specific link duplex. Use ethtool APIs to get the
current link duplex and enable the LED accordingly. Under netdev event
handler the rtnl lock is already held and is not needed to be set to
access ethtool APIs.

This is especially useful for PHY and Switch that supports LEDs hw
control for specific link duplex.

Add additional modes:
- half_duplex: Turn on LED when link is half duplex
- full_duplex: Turn on LED when link is full duplex

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Lee Jones <lee@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

leds: trigger: netdev: add additional specific link speed mode

Add additional modes for specific link speed. Use ethtool APIs to get the
current link speed and enable the LED accordingly. Under netdev event
handler the rtnl lock is already held and is not needed to be set to
access ethtool APIs.

This is especially useful for PHY and Switch that supports LEDs hw
control for specific link speed. (example scenario a PHY that have 2 LED
connected one green and one orange where the green is turned on with
1000mbps speed and orange is turned on with 10mpbs speed)

On mode set from sysfs we check if we have enabled split link speed mode
and reject enabling generic link mode to prevent wrong and redundant
configuration.

Rework logic on the set baseline state to support these new modes to
select if we need to turn on or off the LED.

Add additional modes:
- link_10: Turn on LED when link speed is 10mbps
- link_100: Turn on LED when link speed is 100mbps
- link_1000: Turn on LED when link speed is 1000mbps

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Lee Jones <lee@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Link representors to PCI device

Link VF representors to parent PCI device to benefit from
systemd defined naming scheme.

Without this change the representor is visible as ethN.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20230620144855.288443-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-preparations-for-out-of-order-operations-patches-in-mlxsw'

Petr Machata says:

====================
selftests: Preparations for out-of-order-operations patches in mlxsw

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

Over the course of the following several patchsets, mlxsw code is going to
be adjusted to diminish the space of wrongly offloaded configurations.
Ideally the offload state will reflect the actual state, regardless of the
sequence of operation used to construct that state.

Several selftests build configurations that will not be offloadable in the
future on some systems. The reason is that what will get offloaded is the
actual configuration, not the configuration steps.

For example, when a port is added to a bridge that has an IP address, that
bridge will get a RIF, which it would not have with the current code. But
on Nvidia Spectrum-1 machines, MAC addresses of all RIFs need to have the
same prefix, which the bridge will violate. The RIF thus couldn't be
created, and the enslavement is therefore canceled, because it would lead
to an unoffloadable configuration. This breaks some selftests.

In this patchset, adjust selftests to avoid the configurations that mlxsw
would be incapable of offloading, while maintaining relevance with regards
to the feature that is being tested. There are generally two cases of
fixes:

- Disabling IPv6 autogen on bridges that do not participate in routing,
  either because of the abovementioned requirement to keep the same MAC
  prefix on all in-HW router interfaces, or, on 802.1ad bridges, because
  in-HW router interfaces are not supported at all.

- Setting the bridge MAC address to what it will become after the first
  member port is attached, so that the in-HW router interface is created
  with a supported MAC address.

The patchset is then split thus:

- Patches #1-#7 adjust generic selftests
- Patches #8-#16 adjust mlxsw-specific selftests
====================

Link: https://lore.kernel.org/r/cover.1687265905.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: one_armed_router: Use port MAC for bridge address

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The bridge eventually inherits MAC address from its first member, after the
enslavement is acked. A number of (mainly VXLAN) selftests already work
around the problem by setting the MAC address to whatever it will
eventually be anyway. Do the same for this selftest.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: vxlan: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge (this holds
for all bridges used here), the bridge MAC address does not have the same
prefix as other interfaces in the system. On Nvidia Spectrum-1 machines all
the RIFs have to have the same 38-bit MAC address prefix. Since the bridge
does not obey this limitation, the RIF cannot be created, and the
enslavement attempt is vetoed on the grounds of the configuration not being
offloadable.

The selftest itself however checks various aspects of VXLAN offloading and
the bridges do not need to participate in routing traffic. The IP addresses
or the RIFs are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridges in this selftest, thus exempting them from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: spectrum: q_in_vni_veto: Disable IPv6 autogen on a bridge

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The selftest itself however checks vetoing of a different aspect of the
configuration and the bridge does not need to participate in routing
traffic. The IP address or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: qos_mc_aware: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge (this holds
for both bridges used here), the bridge MAC address does not have the same
prefix as other interfaces in the system. On Nvidia Spectrum-1 machines all
the RIFs have to have the same 38-bit MAC address prefix. Since the bridge
does not obey this limitation, the RIF cannot be created, and the
enslavement attempt is vetoed on the grounds of the configuration not being
offloadable.

The selftest itself however checks traffic prioritization and scheduling,
and the bridges serve for their L2 forwarding capabilities, and do not need
to participate in routing traffic. The IP addresses or the RIFs are
irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridges in this selftest, thus exempting them from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: qos_ets_strict: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge (this holds
for both bridges used here), the bridge MAC address does not have the same
prefix as other interfaces in the system. On Nvidia Spectrum-1 machines all
the RIFs have to have the same 38-bit MAC address prefix. Since the bridge
does not obey this limitation, the RIF cannot be created, and the
enslavement attempt is vetoed on the grounds of the configuration not being
offloadable.

The selftest itself however checks traffic prioritization and scheduling,
and the bridges serve for their L2 forwarding capabilities, and do not need
to participate in routing traffic. The IP addresses or the RIFs are
irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridges in this selftest, thus exempting them from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: qos_dscp_bridge: Disable IPv6 autogen on a bridge

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The selftest itself however checks DCB DSCP-based prioritization, and the
bridge serves for its L2 forwarding capabilities, and does not need to
participate in routing traffic. The IP address or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: mirror_gre_scale: Disable IPv6 autogen on a bridge

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The selftest itself however checks how many mirroring sessions a machine is
capable of offloading. The IP address or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: extack: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge (this holds
for all bridges used here), the bridge MAC address does not have the same
prefix as other interfaces in the system. On Nvidia Spectrum-1 machines all
the RIFs have to have the same 38-bit MAC address prefix. Since the bridge
does not obey this limitation, the RIF cannot be created, and the
enslavement attempt is vetoed on the grounds of the configuration not being
offloadable.

The selftest itself however checks whether a different vetoed aspect of the
configuration provides an extack. The IP address or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridges in this selftest, thus exempting them from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: q_in_q_veto: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

The swp enslavement to the 802.1ad bridge is not allowed, because RIFs are
not allowed to be created for 802.1ad bridges, but the address indicates
one needs to be created. Thus the veto selftests fail already during the
port enslavement. Then the attempt to create a VLAN on top of the same
bridge is not vetoed, because the bridge is not related to mlxsw, and the
selftest fails.

Fix by disabling automatic IPv6 address generation for the bridges in this
selftest, thus exempting them from the mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: router_bridge: Use port MAC for bridge address

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The bridge eventually inherits MAC address from its first member, after the
enslavement is acked. A number of (mainly VXLAN) selftests already work
around the problem by setting the MAC address to whatever it will
eventually be anyway. Do the same here.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: mirror_gre_*: Use port MAC for bridge address

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The bridge eventually inherits MAC address from its first member, after the
enslavement is acked. A number of (mainly VXLAN) selftests already work
around the problem by setting the MAC address to whatever it will
eventually be anyway. Do the same for several mirror_gre selftests.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: mirror_gre_*: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

These two selftests however check mirroring traffic to a gretap netdevice.
The bridge here does not participate in routing traffic and the IP address
or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridges in these selftests, thus exempting them from mlxsw router
attention. Since the bridges are only used for L2 forwarding, this change
should not hinder usefulness of this selftest for testing SW datapath or HW
datapaths in other devices.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: pedit_dsfield: Disable IPv6 autogen on a bridge

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The selftest itself however checks whether skbedit changes packet priority
as appropriate. The bridge thus does not need to participate in routing
traffic and the IP address or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.
Since the bridge is only used for L2 forwarding, this change should not
hinder usefulness of this selftest for testing SW datapath or HW datapaths
in other devices.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: skbedit_priority: Disable IPv6 autogen on a bridge

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

At the time that the front panel port is enslaved to the bridge, the bridge
MAC address does not have the same prefix as other interfaces in the
system. On Nvidia Spectrum-1 machines all the RIFs have to have the same
38-bit MAC address prefix. Since the bridge does not obey this limitation,
the RIF cannot be created, and the enslavement attempt is vetoed on the
grounds of the configuration not being offloadable.

The selftest itself however checks operation of pedit on IPv4 and IPv6
dsfield and its parts. The bridge thus does not need to participate in
routing traffic and the IP address or the RIF are irrelevant.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.
Since the bridge is only used for L2 forwarding, this change should not
hinder usefulness of this selftest for testing SW datapath or HW datapaths
in other devices.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: dual_vxlan_bridge: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

This will cause this selftest to fail spuriously. The swp enslavement to
the 802.1ad bridge is not allowed, because RIFs are not allowed to be
created for 802.1ad bridges, but the address indicates one needs to be
created.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: q_in_vni: Disable IPv6 autogen on bridges

In a future patch, mlxsw will start adding RIFs to uppers of front panel
port netdevices, if they have an IP address.

This will cause this selftest to fail spuriously. The swp enslavement to
the 802.1ad bridge is not allowed, because RIFs are not allowed to be
created for 802.1ad bridges, but the address indicates one needs to be
created.

Fix by disabling automatic IPv6 address generation for the HW-offloaded
bridge in this selftest, thus exempting it from mlxsw router attention.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: micrel: Change to receive timestamp in the frame for lan8841

Currently for each timestamp frame, the SW needs to go and read the
received timestamp over the MDIO bus. But the HW has the capability
to store the received nanoseconds part and the least significant two
bits of the seconds in the reserved field of the PTP header. In this
way we could save few MDIO transactions (actually a little more
transactions because the access to the PTP registers are indirect)
for each received frame.

Instead of reading the rest of seconds part of the timestamp of the
frame using MDIO transactions schedule PTP worker thread to read the
seconds part every 500ms and then for each of the received frames use
this information. Because if for example running with 512 frames per
second, there is no point to read 512 times the second part.

Doing all these changes will give a great CPU usage performance.
Running ptp4l with logSyncInterval of -9 will give a ~60% CPU
improvement.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-stmmac-dwmac-qcom-ethqos-add-support-for-emac4'

Bartosz Golaszewski says:

====================
net: stmmac: dwmac-qcom-ethqos: add support for EMAC4

Extend the dwmac-qcom-ethqos driver to support EMAC4. While at it: rework the
code somewhat. The bindings have been reviewed by DT maintainers.

This is a sub-series of [1] with only the patches targetting the net subsystem
as they can go in independently.

[1] https://lore.kernel.org/lkml/20230617001644.4e093326@kernel.org/T/
====================

Link: https://lore.kernel.org/r/20230619092402.195578-1-brgl@bgdev.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: add support for emac4 on sa8775p platforms

sa8775p uses EMAC version 4, add the relevant defines, rename the
has_emac3 switch to has_emac_ge_3 (has emac greater-or-equal than 3)
and add the new compatible.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: qcom,ethqos: add description for sa8775p

Add the compatible for the MAC controller on sa8775p platforms. This MAC
works with a single interrupt so add minItems to the interrupts property.
The fourth clock's name is different here so change it. Enable relevant
PHY properties. Add the relevant compatibles to the binding document for
snps,dwmac as well.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add new switch to struct plat_stmmacenet_data

On some platforms, the PCS can be integrated in the MAC so the driver
will not see any PCS link activity. Add a switch that allows the platform
drivers to let the core code know.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: add support for SGMII

On sa8775p the MAC is connected to the external PHY over SGMII so add
support for it to the driver.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: prepare the driver for more PHY modes

In preparation for supporting SGMII, let's make the code a bit more
generic. Add a new callback for MAC configuration so that we can assign
a different variant of it in the future.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: add support for the phyaux clock

On sa8775p, the EMAC revision is 4 and we use SGMII instead of RGMII.
There's no "rgmii" clock but there's a fourth clock under a different
name: "phyaux". Add a new field to the chip data struct that specifies
the link clock name. Default to "rgmii" for backward compatibility.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: add support for the optional serdes phy

On sa8775p platforms, there's a SGMII SerDes PHY between the MAC and
external PHY that we need to enable and configure.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: remove stray space

There's an unnecessary space in the rgmii_updatel() function, remove it.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: add a newline between headers

Typically we use a newline between global and local headers so add it
here as well.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: add missing include

device_get_phy_mode() is declared in linux/property.h but this header
is not included.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: use a helper variable for &pdev->dev

Shrink code and avoid line breaks by using a helper variable for
&pdev->dev.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: tweak the order of local variables

Make sure we follow the reverse-xmas tree convention.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: rename a label in probe()

The err_mem label's name is unclear. It actually should be reached on
any error after stmmac_probe_config_dt() succeeds. Name it after the
cleanup action that needs to be called before exiting.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-qcom-ethqos: shrink clock code with devres

We can use a devm action to completely drop the remove callback and use
stmmac_pltfr_remove() directly for remove. We can also drop one of the
goto labels.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: fix uninitialized variable use

The new efx_bind_neigh() function contains a broken code path when IPV6 is
disabled:

drivers/net/ethernet/sfc/tc_encap_actions.c:144:7: error: variable 'n' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
                if (encap->type & EFX_ENCAP_FLAG_IPV6) {
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/sfc/tc_encap_actions.c:184:8: note: uninitialized use occurs here
                if (!n) {
                     ^
drivers/net/ethernet/sfc/tc_encap_actions.c:144:3: note: remove the 'if' if its condition is always false
                if (encap->type & EFX_ENCAP_FLAG_IPV6) {
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/sfc/tc_encap_actions.c:141:22: note: initialize the variable 'n' to silence this warning
                struct neighbour *n;
                                   ^
                                    = NULL

Change it to use the existing error handling path here.

Fixes: 7e5e7d800011a ("sfc: neighbour lookup for TC encap action offload")
Suggested-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20230619091215.2731541-2-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: add CONFIG_INET dependency for TC offload

The driver now fails to link when CONFIG_INET is disabled, so
add an explicit Kconfig dependency:

ld.lld: error: undefined symbol: ip_route_output_flow
>>> referenced by tc_encap_actions.c
>>>               drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_tc_flower_create_encap_md) in archive vmlinux.a

ld.lld: error: undefined symbol: ip_send_check
>>> referenced by tc_encap_actions.c
>>>               drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_gen_encap_header) in archive vmlinux.a
>>> referenced by tc_encap_actions.c
>>>               drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_gen_encap_header) in archive vmlinux.a

ld.lld: error: undefined symbol: arp_tbl
>>> referenced by tc_encap_actions.c
>>>               drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_tc_netevent_event) in archive vmlinux.a
>>> referenced by tc_encap_actions.c
>>>               drivers/net/ethernet/sfc/tc_encap_actions.o:(efx_tc_netevent_event) in archive vmlinux.a

Fixes: a1e82162af0b8 ("sfc: generate encap headers for TC offload")
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306151656.yttECVTP-lkp@intel.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230619091215.2731541-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy-c45: Fix genphy_c45_ethtool_set_eee description

The text has been cut/paste from genphy_c45_ethtool_get_eee but not
changed to reflect it performs set.

Additionally, extend the comment. This function implements the logic
that eee_enabled has global control over EEE. When eee_enabled is
false, no link modes will be advertised, and as a result, the MAC
should not transmit LPI.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230619220332.4038924-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove sk_is_ipmr() and sk_is_icmpv6() helpers

Blamed commit added these helpers for sake of detecting RAW
sockets specific ioctl.

syzbot complained about it [1].

Issue here is that RAW sockets could pretend there was no need
to call ipmr_sk_ioctl()

Regardless of inet_sk(sk)->inet_num, we must be prepared
for ipmr_ioctl() being called later. This must happen
from ipmr_sk_ioctl() context only.

We could add a safety check in ipmr_ioctl() at the risk of breaking
applications.

Instead, remove sk_is_ipmr() and sk_is_icmpv6() because their
name would be misleading, once we change their implementation.

[1]
BUG: KASAN: stack-out-of-bounds in ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
Read of size 4 at addr ffffc90003aefae4 by task syz-executor105/5004

CPU: 0 PID: 5004 Comm: syz-executor105 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
print_report mm/kasan/report.c:462 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:572
ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
raw_ioctl+0x4e/0x1e0 net/ipv4/raw.c:881
sock_ioctl_out net/core/sock.c:4186 [inline]
sk_ioctl+0x151/0x440 net/core/sock.c:4214
inet_ioctl+0x18c/0x380 net/ipv4/af_inet.c:1001
sock_do_ioctl+0xcc/0x230 net/socket.c:1189
sock_ioctl+0x1f8/0x680 net/socket.c:1306
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2944bf6ad9
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd8897a028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2944bf6ad9
RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003
RBP: 00007f2944bbac80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2944bbad10
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>

The buggy address belongs to stack of task syz-executor105/5004
and is located at offset 36 in frame:
sk_ioctl+0x0/0x440 net/core/sock.c:4172

This frame has 2 objects:
[32, 36) 'karg'
[48, 88) 'buffer'

Fixes: e1d001fa5b47 ("net: ioctl: Use kernel memory on protocol ioctl callbacks")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230619124336.651528-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix a typo in ip6mr_sk_ioctl()

SIOCGETSGCNT_IN6 uses a "struct sioc_sg_req6 buffer".

Unfortunately the blamed commit made hard to ensure type safety.

syzbot reported:

BUG: KASAN: stack-out-of-bounds in ip6mr_ioctl+0xba3/0xcb0 net/ipv6/ip6mr.c:1917
Read of size 16 at addr ffffc900039afb68 by task syz-executor937/5008

CPU: 1 PID: 5008 Comm: syz-executor937 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
print_report mm/kasan/report.c:462 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:572
ip6mr_ioctl+0xba3/0xcb0 net/ipv6/ip6mr.c:1917
rawv6_ioctl+0x4e/0x1e0 net/ipv6/raw.c:1143
sock_ioctl_out net/core/sock.c:4186 [inline]
sk_ioctl+0x151/0x440 net/core/sock.c:4214
inet6_ioctl+0x1b8/0x290 net/ipv6/af_inet6.c:582
sock_do_ioctl+0xcc/0x230 net/socket.c:1189
sock_ioctl+0x1f8/0x680 net/socket.c:1306
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f255849bad9
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd06792778 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f255849bad9
RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003
RBP: 00007f255845fc80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f255845fd10
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>

The buggy address belongs to stack of task syz-executor937/5008
and is located at offset 40 in frame:
sk_ioctl+0x0/0x440 net/core/sock.c:4172

This frame has 2 objects:
[32, 36) 'karg'
[48, 88) 'buffer'

Fixes: e1d001fa5b47 ("net: ioctl: Use kernel memory on protocol ioctl callbacks")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230619072740.464528-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: TC flower offload support for rxqueue mapping

TC rule support to offload rx queue mapping rules.

Eg:
   tc filter add dev eth2 ingress protocol ip flower \
      dst_ip 192.168.8.100  \
      action skbedit queue_mapping 4 skip_sw
      action mirred ingress redirect dev eth5

Packets destined to 192.168.8.100 will be forwarded to rx
queue 4 of eth5 interface.

   tc filter add dev eth2 ingress protocol ip flower \
      dst_ip 192.168.8.100  \
      action skbedit queue_mapping 9 skip_sw

Packets destined to 192.168.8.100 will be forwarded to rx
queue 4 of eth2 interface.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://lore.kernel.org/r/20230619060638.1032304-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlabel: Reorder fields in 'struct netlbl_domaddr6_map'

Group some variables based on their sizes to reduce hole and avoid padding.
On x86_64, this shrinks the size of 'struct netlbl_domaddr6_map'
from 72 to 64 bytes.

It saves a few bytes of memory and is more cache-line friendly.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/aa109847260e51e174c823b6d1441f75be370f01.1687083361.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: Reorder fields in 'struct mptcp_pm_add_entry'

Group some variables based on their sizes to reduce hole and avoid padding.
On x86_64, this shrinks the size of 'struct mptcp_pm_add_entry'
from 136 to 128 bytes.

It saves a few bytes of memory and is more cache-line friendly.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/e47b71de54fd3e580544be56fc1bb2985c77b0f4.1687081558.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mctp: Reorder fields in 'struct mctp_route'

Group some variables based on their sizes to reduce hole and avoid padding.
On x86_64, this shrinks the size of 'struct mctp_route'
from 72 to 64 bytes.

It saves a few bytes of memory and is more cache-line friendly.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/393ad1a5aef0aa28d839eeb3d7477da0e0eeb0b0.1687080803.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fec: allow to build without PAGE_POOL_STATS

Commit 6970ef27ff7f ("net: fec: add xdp and page pool statistics") selected
CONFIG_PAGE_POOL_STATS from the FEC driver symbol, making it impossible
to build without the page pool statistics when this driver is enabled. The
help text of those statistics mentions increased overhead. Allow the user
to choose between usefulness of the statistics and the added overhead.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230616191832.2944130-1-l.stach@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: af_alg/hash: Fix recvmsg() after sendmsg(MSG_MORE)

If an AF_ALG socket bound to a hashing algorithm is sent a zero-length
message with MSG_MORE set and then recvmsg() is called without first
sending another message without MSG_MORE set to end the operation, an oops
will occur because the crypto context and result doesn't now get set up in
advance because hash_sendmsg() now defers that as long as possible in the
hope that it can use crypto_ahash_digest() - and then because the message
is zero-length, it the data wrangling loop is skipped.

Fix this by handling zero-length sends at the top of the hash_sendmsg()
function. If we're not continuing the previous sendmsg(), then just ignore
the send (hash_recvmsg() will invent something when called); if we are
continuing, then we finalise the request at this point if MSG_MORE is not
set to get any error here, otherwise the send is of no effect and can be
ignored.

Whilst we're at it, remove the code to create a kvmalloc'd scatterlist if
we get more than ALG_MAX_PAGES - this shouldn't happen.

Fixes: c662b043cdca ("crypto: af_alg/hash: Support MSG_SPLICE_PAGES")
Reported-by: syzbot+13a08c0bf4d212766c3c@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000b928f705fdeb873a@google.com/
Reported-by: syzbot+14234ccf6d0ef629ec1a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000c047db05fdeb8790@google.com/
Reported-by: syzbot+4e2e47f32607d0f72d43@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000bcca3205fdeb87fb@google.com/
Reported-by: syzbot+472626bb5e7c59fb768f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000b55d8805fdeb8385@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-and-tested-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/427646.1686913832@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mediatek: fix compile-test dependencies

The new phy driver attempts to select a driver from another subsystem,
but that fails when the NVMEM subsystem is disabled:

WARNING: unmet direct dependencies detected for NVMEM_MTK_EFUSE
  Depends on [n]: NVMEM [=n] && (ARCH_MEDIATEK [=n] || COMPILE_TEST [=y]) && HAS_IOMEM [=y]
  Selected by [y]:
  - MEDIATEK_GE_SOC_PHY [=y] && NETDEVICES [=y] && PHYLIB [=y] && (ARM64 && ARCH_MEDIATEK [=n] || COMPILE_TEST [=y])

I could not see an actual compile time dependency, so presumably this
is only needed for for working correctly but not technically a dependency
on that particular nvmem driver implementation, so it would likely
be safe to remove the select for compile testing.

To keep the spirit of the original 'select', just replace this with a
'depends on' that ensures that the driver will work but does not get in
the way of build testing.

Fixes: 98c485eaf509b ("net: phy: add driver for MediaTek SoC built-in GE PHYs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Link: https://lore.kernel.org/r/20230616093009.3511692-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ptp-adjphase-cleanups'

Rahul Rameshbabu says:

====================
ptp .adjphase cleanups

The goal of this patch series is to improve documentation of .adjphase, add
a new callback .getmaxphase to enable advertising the max phase offset a
device PHC can support, and support invoking .adjphase from the testptp
kselftest.

Changes:
  v2->v1:
    * Removes arbitrary rule that the PHC servo must restore the frequency
      to the value used in the last .adjfine call if any other PHC
      operation is used after a .adjphase operation.
    * Removes a macro introduced in v1 for adding PTP sysfs device
      attribute nodes using a callback for populating the data.

Link: https://lore.kernel.org/netdev/20230120160609.19160723@kernel.org/
Link: https://lore.kernel.org/netdev/20230510205306.136766-1-rrameshbabu@nvidia.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: Add .getmaxphase ptp_clock_info callback

Add a function that advertises a maximum offset of zero supported by
ptp_clock_info .adjphase in the OCP null ptp implementation.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: idt82p33: Add .getmaxphase ptp_clock_info callback

Advertise the maximum offset the .adjphase callback is capable of
supporting in nanoseconds for IDT devices.

Refactor the negation of the offset stored in the register to be after the
boundary check of the offset value rather than before. Boundary check based
on the intended value rather than its device-specific representation.
Depend on ptp_clock_adjtime for handling out-of-range offsets.
ptp_clock_adjtime returns -ERANGE instead of clamping out-of-range offsets.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Min Li <min.li.xe@renesas.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ptp_clockmatrix: Add .getmaxphase ptp_clock_info callback

Advertise the maximum offset the .adjphase callback is capable of
supporting in nanoseconds for IDT ClockMatrix devices. Depend on
ptp_clock_adjtime for handling out-of-range offsets. ptp_clock_adjtime
returns -ERANGE instead of clamping out-of-range offsets.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Vincent Cheng <vincent.cheng.xh@renesas.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Add .getmaxphase ptp_clock_info callback

Implement .getmaxphase callback of ptp_clock_info in mlx5 driver. No longer
do a range check in .adjphase callback implementation. Handled by the ptp
stack.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: Add .getmaxphase callback to ptp_clock_info

Enables advertisement of the maximum offset supported by the phase control
functionality of PHCs. The callback is used to return an error if an offset
not supported by the PHC is used in ADJ_OFFSET. The ioctls
PTP_CLOCK_GETCAPS and PTP_CLOCK_GETCAPS2 now advertise the maximum offset a
PHC's phase control functionality is capable of supporting. Introduce new
sysfs node, max_phase_adjustment.

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

testptp: Add support for testing ptp_clock_info .adjphase callback

Invoke clock_adjtime syscall with tx.modes set with ADJ_OFFSET when testptp
is invoked with a phase adjustment offset value. Support seconds and
nanoseconds for the offset value.

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

testptp: Remove magic numbers related to nanosecond to second conversion

Use existing NSEC_PER_SEC declaration in place of hardcoded magic numbers.

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: ptp.rst: Add information about NVIDIA Mellanox devices

The mlx5_core driver has implemented ptp clock driver functionality but
lacked documentation about the PTP devices. This patch adds information
about the Mellanox device family.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: Clarify ptp_clock_info .adjphase expects an internal servo to be used

.adjphase expects a PHC to use an internal servo algorithm to correct the
provided phase offset target in the callback. Implementation of the
internal servo algorithm are defined by the individual devices.

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv6-random-cleanup-for-extension-header'

Kuniyuki Iwashima says:

====================
ipv6: Random cleanup for Extension Header.

This series (1) cleans up pskb_may_pull() in some functions, where needed
data are already pulled by their caller, (2) removes redundant multicast
test, and (3) optimises reload timing of the header.
====================

Link: https://lore.kernel.org/r/20230614230107.22301-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: Remove redundant skb_headlen() check in ip6_parse_tlv().

ipv6_destopt_rcv() and ipv6_parse_hopopts() pulls these data

- Hop-by-Hop/Destination Options Header : 8
- Hdr Ext Len : skb_transport_header(skb)[1] << 3

and calls ip6_parse_tlv(), so it need not check if skb_headlen() is less
than skb_transport_offset(skb) + (skb_transport_header(skb)[1] << 3).

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: Reload hdr only when needed in ipv6_srh_rcv().

We need not reload hdr in ipv6_srh_rcv() unless we call
pskb_expand_head().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: Replace pskb_pull() with skb_pull() in ipv6_srh_rcv().

ipv6_rthdr_rcv() pulls these data

- Segment Routing Header : 8
- Hdr Ext Len : skb_transport_header(skb)[1] << 3

needed by ipv6_srh_rcv(), so pskb_pull() in ipv6_srh_rcv() never
fails and can be replaced with skb_pull().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: rpl: Remove redundant multicast tests in ipv6_rpl_srh_rcv().

ipv6_rpl_srh_rcv() checks if ipv6_hdr(skb)->daddr or ohdr->rpl_segaddr[i]
is the multicast address with ipv6_addr_type().

We have the same check for ipv6_hdr(skb)->daddr in ipv6_rthdr_rcv(), so we
need not recheck it in ipv6_rpl_srh_rcv().

Also, we should use ipv6_addr_is_multicast() for ohdr->rpl_segaddr[i]
instead of ipv6_addr_type().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: rpl: Remove pskb(_may)?_pull() in ipv6_rpl_srh_rcv().

As Eric Dumazet pointed out [0], ipv6_rthdr_rcv() pulls these data

  - Segment Routing Header : 8
  - Hdr Ext Len            : skb_transport_header(skb)[1] << 3

needed by ipv6_rpl_srh_rcv().  We can remove pskb_may_pull() and
replace pskb_pull() with skb_pull() in ipv6_rpl_srh_rcv().

Link: https://lore.kernel.org/netdev/CANn89iLboLwLrHXeHJucAqBkEL_S0rJFog68t7wwwXO-aNf5Mg@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mlx5-updates-2023-06-16' of git://git./linux/kernel/git/saeed/linux

mlx5-updates-2023-06-16

1) Added a new event handler to firmware sync reset, which is used to
   support firmware sync reset flow on smart NIC. Adding this new stage to
   the flow enables the firmware to ensure host PFs unload before ECPFs
   unload, to avoid race of PFs recovery.

2) Debugfs for mlx5 eswitch bridge offloads

3) Added two new counters for vport stats

4) Minor Fixups and cleanups for net-next branch

Signed-off-by: David S. Miller <davem@davemloft.net>

gro: move the tc_ext comparison to a helper

The double ifdefs (one for the variable declaration and
one around the code) are quite aesthetically displeasing.
Factor this code out into a helper for easier wrapping.

This will become even more ugly when another skb ext
comparison is added in the future.

The resulting machine code looks the same, the compiler
seems to try to use %rax more and some blocks more around
but I haven't spotted minor differences.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: at803x: Use devm_regulator_get_enable_optional()

Use devm_regulator_get_enable_optional() instead of hand writing it. It
saves some line of code.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: phy: gpy2xx: more precise description

Mention that the interrupt line is just asserted for a random period of
time, not the entire time.

Suggested-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Walle <mwalle@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: also use netdev_hold() in ip6_route_check_nh()

In blamed commit, we missed the fact that ip6_validate_gw()
could change dev under us from ip6_route_check_nh()

In this fix, I use GFP_ATOMIC in order to not pass too many additional
arguments to ip6_validate_gw() and ip6_route_check_nh() only
for a rarely used debug feature.

syzbot reported:

refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 5006 at lib/refcount.c:31 refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31
Modules linked in:
CPU: 0 PID: 5006 Comm: syz-executor403 Not tainted 6.4.0-rc5-syzkaller-01229-g97c5209b3d37 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
RIP: 0010:refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31
Code: 05 fb 8e 51 0a 01 e8 98 95 38 fd 0f 0b e9 d3 fe ff ff e8 ac d9 70 fd 48 c7 c7 00 d3 a6 8a c6 05 d8 8e 51 0a 01 e8 79 95 38 fd <0f> 0b e9 b4 fe ff ff 48 89 ef e8 1a d7 c3 fd e9 5c fe ff ff 0f 1f
RSP: 0018:ffffc900039df6b8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888026d71dc0 RSI: ffffffff814c03b7 RDI: 0000000000000001
RBP: ffff888146a505fc R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 1ffff9200073bedc
R13: 00000000ffffffef R14: ffff888146a505fc R15: ffff8880284eb5a8
FS: 0000555556c88300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004585c0 CR3: 000000002b1b1000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:344 [inline]
refcount_dec include/linux/refcount.h:359 [inline]
ref_tracker_free+0x539/0x820 lib/ref_tracker.c:236
netdev_tracker_free include/linux/netdevice.h:4097 [inline]
netdev_put include/linux/netdevice.h:4114 [inline]
netdev_put include/linux/netdevice.h:4110 [inline]
fib6_nh_init+0xb96/0x1bd0 net/ipv6/route.c:3624
ip6_route_info_create+0x10f3/0x1980 net/ipv6/route.c:3791
ip6_route_add+0x28/0x150 net/ipv6/route.c:3835
ipv6_route_ioctl+0x3fc/0x570 net/ipv6/route.c:4459
inet6_ioctl+0x246/0x290 net/ipv6/af_inet6.c:569
sock_do_ioctl+0xcc/0x230 net/socket.c:1189
sock_ioctl+0x1f8/0x680 net/socket.c:1306
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]

Fixes: 70f7457ad6d6 ("net: create device lookup API with reference tracking")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

crypto: Fix af_alg_sendmsg(MSG_SPLICE_PAGES) sglist limit

When af_alg_sendmsg() calls extract_iter_to_sg(), it passes MAX_SGL_ENTS as
the maximum number of elements that may be written to, but some of the
elements may already have been used (as recorded in sgl->cur), so
extract_iter_to_sg() may end up overrunning the scatterlist.

Fix this to limit the number of elements to "MAX_SGL_ENTS - sgl->cur".

Note: It probably makes sense in future to alter the behaviour of
extract_iter_to_sg() to stop if "sgtable->nents >= sg_max" instead, but
this is a smaller fix for now.

The bug causes errors looking something like:

BUG: KASAN: slab-out-of-bounds in sg_assign_page include/linux/scatterlist.h:109 [inline]
BUG: KASAN: slab-out-of-bounds in sg_set_page include/linux/scatterlist.h:139 [inline]
BUG: KASAN: slab-out-of-bounds in extract_bvec_to_sg lib/scatterlist.c:1183 [inline]
BUG: KASAN: slab-out-of-bounds in extract_iter_to_sg lib/scatterlist.c:1352 [inline]
BUG: KASAN: slab-out-of-bounds in extract_iter_to_sg+0x17a6/0x1960 lib/scatterlist.c:1339

Fixes: bf63e250c4b1 ("crypto: af_alg: Support MSG_SPLICE_PAGES")
Reported-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000b2585a05fdeb8379@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: netdev@vger.kernel.org
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: Use per-vma locking for receive zerocopy

Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.

Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.

Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.

However, with per-vma locking, both of these problems can be avoided.

As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.

When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: report devlink_port_type_warn source device

devlink_port_type_warn is scheduled for port devlink and warning
when the port type is not set. But from this warning it is not easy
found out which device (driver) has no devlink port set.

[ 3709.975552] Type was not set for devlink port.
[ 3709.975579] WARNING: CPU: 1 PID: 13092 at net/devlink/leftover.c:6775 devlink_port_type_warn+0x11/0x20
[ 3709.993967] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs vhost_net vhost vhost_iotlb tap tun bridge stp llc qrtr intel_rapl_msr intel_rapl_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal mlx5_ib intel_powerclamp coretemp dell_wmi ledtrig_audio sparse_keymap ipmi_ssif kvm_intel ib_uverbs rfkill ib_core video kvm iTCO_wdt acpi_ipmi intel_vsec irqbypass ipmi_si iTCO_vendor_support dcdbas ipmi_devintf mei_me ipmi_msghandler rapl mei intel_cstate isst_if_mmio isst_if_mbox_pci dell_smbios intel_uncore isst_if_common i2c_i801 dell_wmi_descriptor wmi_bmof i2c_smbus intel_pch_thermal pcspkr acpi_power_meter xfs libcrc32c sd_mod sg nvme_tcp mgag200 i2c_algo_bit nvme_fabrics drm_shmem_helper drm_kms_helper nvme syscopyarea ahci sysfillrect sysimgblt nvme_core fb_sys_fops crct10dif_pclmul libahci mlx5_core sfc crc32_pclmul nvme_common drm
[ 3709.994030]  crc32c_intel mtd t10_pi mlxfw libata tg3 mdio megaraid_sas psample ghash_clmulni_intel pci_hyperv_intf wmi dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse
[ 3710.108431] CPU: 1 PID: 13092 Comm: kworker/1:1 Kdump: loaded Not tainted 5.14.0-319.el9.x86_64 #1
[ 3710.108435] Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022
[ 3710.108437] Workqueue: events devlink_port_type_warn
[ 3710.108440] RIP: 0010:devlink_port_type_warn+0x11/0x20
[ 3710.108443] Code: 84 76 fe ff ff 48 c7 03 20 0e 1a ad 31 c0 e9 96 fd ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 48 c7 c7 18 24 4e ad e8 ef 71 62 ff <0f> 0b c3 cc cc cc cc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f6 87
[ 3710.108445] RSP: 0018:ff3b6d2e8b3c7e90 EFLAGS: 00010282
[ 3710.108447] RAX: 0000000000000000 RBX: ff366d6580127080 RCX: 0000000000000027
[ 3710.108448] RDX: 0000000000000027 RSI: 00000000ffff86de RDI: ff366d753f41f8c8
[ 3710.108449] RBP: ff366d658ff5a0c0 R08: ff366d753f41f8c0 R09: ff3b6d2e8b3c7e18
[ 3710.108450] R10: 0000000000000001 R11: 0000000000000023 R12: ff366d753f430600
[ 3710.108451] R13: ff366d753f436900 R14: 0000000000000000 R15: ff366d753f436905
[ 3710.108452] FS:  0000000000000000(0000) GS:ff366d753f400000(0000) knlGS:0000000000000000
[ 3710.108453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3710.108454] CR2: 00007f1c57bc74e0 CR3: 000000111d26a001 CR4: 0000000000773ee0
[ 3710.108456] PKRU: 55555554
[ 3710.108457] Call Trace:
[ 3710.108458]  <TASK>
[ 3710.108459]  process_one_work+0x1e2/0x3b0
[ 3710.108466]  ? rescuer_thread+0x390/0x390
[ 3710.108468]  worker_thread+0x50/0x3a0
[ 3710.108471]  ? rescuer_thread+0x390/0x390
[ 3710.108473]  kthread+0xdd/0x100
[ 3710.108477]  ? kthread_complete_and_exit+0x20/0x20
[ 3710.108479]  ret_from_fork+0x1f/0x30
[ 3710.108485]  </TASK>
[ 3710.108486] ---[ end trace 1b4b23cd0c65d6a0 ]---

After patch:
[  402.473064] ice 0000:41:00.0: Type was not set for devlink port.
[  402.473064] ice 0000:41:00.1: Type was not set for devlink port.

Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230615095447.8259-1-poros@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mctp: remove redundant RTN_UNICAST check

Current mctp_newroute() contains two exactly same check against
rtm->rtm_type

static int mctp_newroute(...)
{
...
    if (rtm->rtm_type != RTN_UNICAST) { // (1)
        NL_SET_ERR_MSG(extack, "rtm_type must be RTN_UNICAST");
        return -EINVAL;
    }
...
    if (rtm->rtm_type != RTN_UNICAST) // (2)
        return -EINVAL;
...
}

This commits removes the (2) check as it is redundant.

Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20230615152240.1749428-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: fixup openvswitch specs for code generation

Refine the ovs_* specs to align exactly with the ovs netlink UAPI
definitions to enable code generation.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20230615151405.77649-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: Remove unused qdisc_l2t()

This is unused since switch to psched_l2t_ns().

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230615124810.34020-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

kcm: Fix unnecessary psock unreservation.

kcm_write_msgs() calls unreserve_psock() to release its hold on the
underlying TCP socket if it has run out of things to transmit, but if we
have nothing in the write queue on entry (e.g. because someone did a
zero-length sendmsg), we don't actually go into the transmission loop and
as a consequence don't call reserve_psock().

Fix this by skipping the call to unreserve_psock() if we didn't reserve a
psock.

Fixes: c31a25e1db48 ("kcm: Send multiple frags in one sendmsg()")
Reported-by: syzbot+dd1339599f1840e4cc65@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000a61ffe05fe0c3d08@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: syzbot+dd1339599f1840e4cc65@syzkaller.appspotmail.com
cc: Tom Herbert <tom@herbertland.com>
cc: Tom Herbert <tom@quantonium.net>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/20787.1686828722@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Remove unused ecpu field from struct mlx5_sf_table

"ecpu" field in struct mlx5_sf_table is not used anywhere. Remove it.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Add header file for events

Separate the event API defined in the generic mlx5.h header into
a dedicated header. And remove the TODO comment in commit
69c1280b1f3b ("net/mlx5: Device events, Use async events chain").

Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, update query of HCA caps for EC VFs

This change is needed to use EC VFs with metadata based steering.

There was an assumption that vport was equal to function ID. That's
not the case for EC VF functions. Adjust to function ID and set the
ec_vf_function bit accordingly.

Fixes: 9ac0b128248e ("net/mlx5: Update vport caps query/set for EC VFs")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Fix the macro for accessing EC VF vports

The last value is not set correctly. This results in representors not
being created for all EC VFs when the base value is higher than 0.

Fixes: a7719b29a821 ("net/mlx5: Add management of EC VF vports")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>