review.tizen.org Git - platform/kernel/linux-starfive.git/log

Merge branch 'mlx4-better-big-tcp-support'

Eric Dumazet says:

====================
mlx4: better BIG-TCP support

mlx4 uses a bounce buffer in TX whenever the tx descriptors
wrap around the right edge of the ring.

Size of this bounce buffer was hard coded and can be
increased if/when needed.
====================

Link: https://lore.kernel.org/r/20221207141237.2575012-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx4: small optimization in mlx4_en_xmit()

Test against MLX4_MAX_DESC_TXBBS only matters if the TX
bounce buffer is going to be used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS

Google production kernel has increased MAX_SKB_FRAGS to 45
for BIG-TCP rollout.

Unfortunately mlx4 TX bounce buffer is not big enough whenever
an skb has up to 45 page fragments.

This can happen often with TCP TX zero copy, as one frag usually
holds 4096 bytes of payload (order-0 page).

Tested:
Kernel built with MAX_SKB_FRAGS=45
ip link set dev eth0 gso_max_size 185000
netperf -t TCP_SENDFILE

I made sure that "ethtool -G eth0 tx 64" was properly working,
ring->full_size being set to 15.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wei Wang <weiwan@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx4: rename two constants

MAX_DESC_SIZE is really the size of the bounce buffer used
when reaching the right side of TX ring buffer.

MAX_DESC_TXBBS get a MLX4_ prefix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-Support-tc-police-jump-conform-exceed-attribute'

Saeed Mahameed says:

====================
Support tc police jump conform-exceed attribute

The tc police action conform-exceed option defines how to handle
packets which exceed or conform to the configured bandwidth limit.
One of the possible conform-exceed values is jump, which skips over
a specified number of actions.
This series adds support for conform-exceed jump action.

The series adds platform support for branching actions by providing
true/false flow attributes to the branching action.
This is necessary for supporting police jump, as each branch may
execute a different action list.

The first five patches are preparation patches:
- Patches 1 and 2 add support for actions with no destinations (e.g. drop)
- Patch 3 refactor the code for subsequent function reuse
- Patch 4 defines an abstract way for identifying terminating actions
- Patch 5 updates action list validations logic considering branching actions

The following three patches introduce an interface for abstracting branching
actions:
- Patch 6 introduces an abstract api for defining branching actions
- Patch 7 generically instantiates the branching flow attributes using
  the abstract API

Patch 8 adds the platform support for jump actions, by executing the following
sequence:
  a. Store the jumping flow attr
  b. Identify the jump target action while iterating the actions list.
  c. Instantiate a new flow attribute after the jump target action.
     This is the flow attribute that the branching action should jump to.
  d. Set the target post action id on:
    d.1. The jumping attribute, thus realizing the jump functionality.
    d.2. The attribute preceding the target jump attr, if not terminating.

The next patches apply the platform's branching attributes to the police
action:
- Patch 9 is a refactor patch
- Patch 10 initializes the post meter table with the red/green flow attributes,
           as were initialized by the platform
- Patch 11 enables the offload of meter actions using jump conform-exceed
           value.
====================

Link: https://lore.kernel.org/all/20221203221337.29267-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, allow meter jump control action

Separate the matchall police action validation from flower validation.
Isolate the action validation logic in the police action parser.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-12-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, init post meter rules with branching attributes

Instantiate the post meter actions with the platform initialized branching
action attributes.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-11-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, rename post_meter actions

Currently post meter supports only the pipe/drop conform-exceed policy.
This assumption is reflected in several variable names.
Rename the following variables as a pre-step for using the generalized
branching action platform.

Rename fwd_green_rule/drop_red_rule to green_rule/red_rule respectively.
Repurpose red_counter/green_counter to act_counter/drop_counter to allow
police conform-exceed configurations that do not drop.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-10-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, initialize branching action with target attr

Identify the jump target action when iterating the action list.
Initialize the jump target attr with the jumping attribute during the
parsing phase. Initialize the jumping attr post action with the target
during the offload phase.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-9-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, initialize branch flow attributes

Initialize flow attribute for drop, accept, pipe and jump branching actions.

Instantiate a flow attribute instance according to the specified branch
control action. Store the branching attributes on the branching action
flow attribute during the parsing phase. Then, during the offload phase,
allocate the relevant mod header objects to the branching actions.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-8-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, set control params for branching actions

Extend the act tc api to set the branch control params aligning with
the police conform/exceed use case.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-7-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, validate action list per attribute

Currently the entire flow action list is validate for offload limitations.
For example, flow with both forward and drop actions are declared invalid
due to hardware restrictions.
However, a multi-table hardware model changes the limitations from a flow
scope to a single flow attribute scope.

Apply offload limitations to flow attributes instead of the entire flow.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-6-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, add terminating actions

Extend act api to identify actions that terminate action list.
Pre-step for terminating branching actions.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, reuse flow attribute post parser processing

After the tc action parsing phase the flow attribute is initialized with
relevant eswitch offload objects such as tunnel, vlan, header modify and
counter attributes. The post processing is done both for fdb and post-action
attributes.

Reuse the flow attribute post parsing logic by both fdb and post-action
offloads.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-4-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: fs, assert null dest pointer when dest_num is 0

Currently create_flow_handle() assumes a null dest pointer when there
are no destinations.
This might not be the case as the caller may pass an allocated dest
array while setting the dest_num parameter to 0.

Assert null dest array for flow rules that have no destinations (e.g. drop
rule).

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: E-Switch, handle flow attribute with no destinations

Rules with drop action are not required to have a destination.
Currently the destination list is allocated with the maximum number of
destinations and passed to the fs_core layer along with the actual number
of destinations.

Remove redundant passing of dest pointer when count of dest is 0.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221203221337.29267-2-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: add stats64 support for ksz8 series of switches

Add stats64 support for ksz8xxx series of switches.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20221205052904.2834962-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ethernet-ti-am65-cpsw-fix-set-channel-operation'

Roger Quadros says:

====================
net: ethernet: ti: am65-cpsw: Fix set channel operation

This contains a critical bug fix for the recently merged suspend/resume
support [1] that broke set channel operation. (ethtool -L eth0 tx <n>)

As there were 2 dependent patches on top of the offending commit [1]
first revert them and then apply them back after the correct fix.

[1] fd23df72f2be ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")
====================

Link: https://lore.kernel.org/r/20221206094419.19478-1-rogerq@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: Fix hardware switch mode on suspend/resume

On low power during system suspend the ALE table context is lost.
Save the ALE context before suspend and restore it after resume.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: retain PORT_VLAN_REG after suspend/resume

During suspend resume the context of PORT_VLAN_REG is lost so
save it during suspend and restore it during resume for
host port and slave ports.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: Add suspend/resume support

Add PM handlers for System suspend/resume.

As DMA driver doesn't yet support suspend/resume we free up
the DMA channels at suspend and acquire and initialize them
at resume.

In this revised approach we do not free the TX/RX IRQs at
am65_cpsw_nuss_common_stop() as it causes problems.
We will now free them only on .suspend() as we need to release
the DMA channels (as DMA looses context) and re-acquiring
them on .resume() may not necessarily give us the same
IRQs.

To make this easier:
- introduce am65_cpsw_nuss_remove_rx_chns() which is
   similar to am65_cpsw_nuss_remove_tx_chns(). These will
   be invoked in pm.suspend() to release the DMA channels
   and free up the IRQs.
- move napi_add() and request_irq() calls to
   am65_cpsw_nuss_init_rx/tx_chns() so we can invoke them
   in pm.resume() to acquire the DMA channels and IRQs.

As CPTS looses contect during suspend/resume, invoke the
necessary CPTS suspend/resume helpers.

ALE_CLEAR command is issued in cpsw_ale_start() so no need
to issue it before the call to cpsw_ale_start().

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "net: ethernet: ti: am65-cpsw: Add suspend/resume support"

This reverts commit fd23df72f2be317d38d9fde0a8996b8e7454fd2a.

This commit broke set channel operation. Revert this and
implement it with a different approach in a separate patch.

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "net: ethernet: ti: am65-cpsw: retain PORT_VLAN_REG after suspend/resume"

This reverts commit 643cf0e3ab5ccee37b3c53c018bd476c45c4b70e.

This is to make it easier to revert the offending commit
fd23df72f2be ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "net: ethernet: ti: am65-cpsw: Fix hardware switch mode on suspend/resume"

This reverts commit 1af3cb3702d02167926a2bd18580cecb2d64fd94.

This is to make it easier to revert the offending commit
fd23df72f2be ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")

Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'devlink-add-port-function-attribute-to-enable-disable-roce-and-migratable'

Shay Drory says:

====================
devlink: Add port function attribute to enable/disable Roce and migratable

This series is a complete rewrite of the series "devlink: Add port
function attribute to enable/disable roce"
link:
https://lore.kernel.org/netdev/20221102163954.279266-1-danielj@nvidia.com/

Currently mlx5 PCI VF and SF are enabled by default for RoCE
functionality. And mlx5 PCI VF is disable by dafault for migratable
functionality.

Currently a user does not have the ability to disable RoCE for a PCI
VF/SF device before such device is enumerated by the driver.

User is also incapable to do such setting from smartnic scenario for a
VF from the smartnic.

Current 'enable_roce' device knob is limited to do setting only at
driverinit time. By this time device is already created and firmware has
already allocated necessary system memory for supporting RoCE.

Also, Currently a user does not have the ability to enable migratable
for a PCI VF.

The above are a hyper visor level control, to set the functionality of
devices passed through to guests.

This is achieved by extending existing 'port function' object to control
capabilities of a function. This enables users to control capability of
the device before enumeration.

Examples when user prefers to disable RoCE for a VF when using switchdev
mode:

$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
pfnum 0 vfnum 0 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 roce enable

$ devlink port function set pci/0000:06:00.0/1 roce disable

$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
pfnum 0 vfnum 0 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 roce disable

FAQs:
-----
1. What does roce enable/disable do?
Ans: It disables RoCE capability of the function before its enumerated,
so when driver reads the capability from the device firmware, it is
disabled.
At this point RDMA stack will not be able to create UD, QP1, RC, XRC
type of QPs. When RoCE is disabled, the GID table of all ports of the
device is disabled in the device and software stack.

2. How is the roce 'port function' option different from existing
devlink param?
Ans: RoCE attribute at the port function level disables the RoCE
capability at the specific function level; while enable_roce only does
at the software level.

3. Why is this option for disabling only RoCE and not the whole RDMA
device?
Ans: Because user still wants to use the RDMA device for non RoCE
commands in more memory efficient way.
====================

Link: https://lore.kernel.org/r/20221206185119.380138-1-shayd@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, Implement devlink port function cmds to control migratable

Implement devlink port function commands to enable / disable migratable.
This is used to control the migratable capability of the device.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Expose port function commands to control migratable

Expose port function commands to enable / disable migratable
capability, this is used to set the port function as migratable.

Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.

In order for a VM to be able to perform LM, all the VM components must
be able to perform migration. e.g.: to be migratable.
In order for VF to be migratable, VF must be bound to VFIO driver with
migration support.

When migratable capability is enabled for a function of the port, the
device is making the necessary preparations for the function to be
migratable, which might include disabling features which cannot be
migrated.

Example of LM with migratable function configuration:
Set migratable of the VF's port function.

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 migratable disable

$ devlink port function set pci/0000:06:00.0/2 migratable enable

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 migratable enable

Bind VF to VFIO driver with migration support:
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
$ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind

Attach VF to the VM.
Start the VM.
Perform LM.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, Implement devlink port function cmds to control RoCE

Implement devlink port function commands to enable / disable RoCE.
This is used to control the RoCE device capabilities.

This patch implement infrastructure which will be used by downstream
patches that will add additional capabilities.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add generic getters for other functions caps

Downstream patch requires to get other function GENERAL2 caps while
mlx5_vport_get_other_func_cap() gets only one type of caps (general).
Rename it to represent this and introduce a generic implementation
of mlx5_vport_get_other_func_cap().

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Expose port function commands to control RoCE

Expose port function commands to enable / disable RoCE, this is used to
control the port RoCE device capabilities.

When RoCE is disabled for a function of the port, function cannot create
any RoCE specific resources (e.g GID table).
It also saves system memory utilization. For example disabling RoCE enable a
VF/SF saves 1 Mbytes of system memory per function.

Example of a PCI VF port which supports function configuration:
Set RoCE of the VF's port function.

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 roce enable

$ devlink port function set pci/0000:06:00.0/2 roce disable

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 roce disable

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Move devlink port function hw_addr attr documentation

devlink port function hw_addr attr documentation is in mlx5 specific
file while there is nothing mlx5 specific about it.
Move it to devlink-port.rst.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Validate port function request

In order to avoid partial request processing, validate the request
before processing it.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Introduce IFC bits for migratable

Introduce IFC related capabilities to enable setting VF to be able to
perform live migration. e.g.: to be migratable.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bridge-mcast-preparations-for-evpn-extensions'

Ido Schimmel says:

====================
bridge: mcast: Preparations for EVPN extensions

This patchset was split from [1] and includes non-functional changes
aimed at making it easier to add additional netlink attributes later on.
Future extensions are available here [2].

The idea behind these patches is to create an MDB configuration
structure into which netlink messages are parsed into. The structure is
then passed in the entry creation / deletion call chain instead of
passing the netlink attributes themselves. The same pattern is used by
other rtnetlink objects such as routes and nexthops.

I initially tried to extend the current code, but it proved to be too
difficult, which is why I decided to refactor it to the extensible and
familiar pattern used by other rtnetlink objects.

Tested using existing selftests and using a new selftest that will be
submitted together with the planned extensions.

[1] https://lore.kernel.org/netdev/20221018120420.561846-1-idosch@nvidia.com/
[2] https://github.com/idosch/linux/commits/submit/mdb_v1
====================

Link: https://lore.kernel.org/r/20221206105809.363767-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Constify 'group' argument in br_multicast_new_port_group()

The 'group' argument is not modified, so mark it as 'const'. It will
allow us to constify arguments of the callers of this function in future
patches.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Remove redundant function arguments

Drop the first three arguments and instead extract them from the MDB
configuration structure.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Move checks out of critical section

The checks only require information parsed from the RTM_NEWMDB netlink
message and do not rely on any state stored in the bridge driver.
Therefore, there is no need to perform the checks in the critical
section under the multicast lock.

Move the checks out of the critical section.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Remove br_mdb_parse()

The parsing of the netlink messages and the validity checks are now
performed in br_mdb_config_init() so we can remove br_mdb_parse().

This finally allows us to stop passing netlink attributes deep in the
MDB control path and only use the MDB configuration structure.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Use MDB group key from configuration structure

The MDB group key (i.e., {source, destination, protocol, VID}) is
currently determined under the multicast lock from the netlink
attributes. Instead, use the group key from the MDB configuration
structure that was prepared before acquiring the lock.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Propagate MDB configuration structure further

As an intermediate step towards only using the new MDB configuration
structure, pass it further in the control path instead of passing
individual attributes.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Use MDB configuration structure where possible

The MDB configuration structure (i.e., struct br_mdb_config) now
includes all the necessary information from the parsed RTM_{NEW,DEL}MDB
netlink messages, so use it.

This will later allow us to delete the calls to br_mdb_parse() from
br_mdb_add() and br_mdb_del().

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Remove redundant checks

These checks are now redundant as they are performed by
br_mdb_config_init() while parsing the RTM_{NEW,DEL}MDB messages.

Remove them.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Centralize netlink attribute parsing

Netlink attributes are currently passed deep in the MDB creation call
chain, making it difficult to add new attributes. In addition, some
validity checks are performed under the multicast lock although they can
be performed before it is ever acquired.

As a first step towards solving these issues, parse the RTM_{NEW,DEL}MDB
messages into a configuration structure, relieving other functions from
the need to handle raw netlink attributes.

Subsequent patches will convert the MDB code to use this configuration
structure.

This is consistent with how other rtnetlink objects are handled, such as
routes and nexthops.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: use sysfs_emit() to instead of scnprintf()

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/202212051918564721658@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ieee802154-for-net-next-2022-12-05' of git://git./linux/kernel/git/sschmidt/wpan-next

Stefan Schmidt says:

====================
ieee802154-next 2022-12-05

Miquel continued his work towards full scanning support. For this,
we now allow the creation of dedicated coordinator interfaces
to allow a PAN coordinator to serve in the network and set
the needed address filters with the hardware.

On top of this we have the first part to allow scanning for available
15.4 networks. A new netlink scan group, within the existing nl802154
API, was added.

In addition Miquel fixed two issues that have been introduced in the former
patches to free an skb correctly and clarifying an expression in the stack.

From David Girault we got tracing support when registering new PANs.

* tag 'ieee802154-for-net-next-2022-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next:
  mac802154: Trace the registration of new PANs
  ieee802154: Advertize coordinators discovery
  mac802154: Allow the creation of coordinator interfaces
  mac802154: Clarify an expression
  mac802154: Move an skb free within the rx path
====================

Link: https://lore.kernel.org/r/20221205131909.1871790-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: asix: add support for the Linux Automation GmbH USB 10Base-T1L

Add ASIX based USB 10Base-T1L adapter support:
https://linux-automation.com/en/products/usb-t1l.html

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20221205132102.2941732-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'cn10kb-mac-block-support'

Hariprasad Kelam says:

====================
CN10KB MAC block support

OcteonTx2's next gen platform the CN10KB has RPM_USX MAC which has a
different serdes when compared to RPM MAC. Though the underlying
HW is different, the CSR interface has been designed largely inline
with RPM MAC, with few exceptions though. So we are using the same
CGX driver for RPM_USX MAC as well and will have a different set of APIs
for RPM_USX where ever necessary.

The RPM and RPM_USX blocks support a different number of LMACS.
RPM_USX support 8 LMACS per MAC block whereas legacy RPM supports only 4
LMACS per MAC. with this RPM_USX support double the number of DMAC filters
and fifo size.

This patchset adds initial support for CN10KB's RPM_USX MAC i.e
registering the driver and defining MAC operations (mac_ops). With these
changes PF and VF netdev packet path will work and PF and VF netdev drivers
are able to configure MAC features like pause frames,PFC and loopback etc.

Also implements FEC stats for CN10K Mac block RPM and CN10KB Mac block
RPM_USX and extends ethtool support for PF and VF drivers by defining
get_fec_stats API to display FEC stats.
====================

Link: https://lore.kernel.org/r/20221205070521.21860-1-hkelam@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-af: Add FEC stats for RPM/RPM_USX block

CN10K silicon MAC block RPM and CN10KB silicon MAC block RPM_USX
both support BASER and RSFEC modes.

Also MAC (CGX) on OcteonTx2 silicon variants and MAC (RPM) on
OcteonTx3 CN10K are different and FEC stats need to be read
differently. CN10KB MAC block (RPM_USX) fec csr offsets are same
as CN10K MAC block (RPM) mac_ops points to same fn(). Upper layer
interface between RVU AF and PF netdev is kept same. Based on
silicon variant appropriate fn() pointer is called to read FEC stats

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-pf: ethtool: Implement get_fec_stats

This patch registers a callback for get_fec_stats such that
FEC stats can be queried from the below command

"ethtool -I --show-fec eth0"

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-af: cn10kb: Add RPM_USX MAC support

OcteonTx2's next gen platform the CN10KB has RPM_USX MAC which has a
different serdes when compared to RPM MAC. Though the underlying
HW is different, the CSR interface has been designed largely inline
with RPM MAC, with few exceptions though. So we are using the same
CGX driver for RPM_USX MAC as well and will have a different set of APIs
for RPM_USX where ever necessary.

The RPM and RPM_USX blocks support a different number of LMACS.
RPM_USX support 8 LMACS per MAC block whereas legacy RPM supports only 4
LMACS per MAC. with this RPM_USX support double the number of DMAC filters
and fifo size.

This patch adds initial support for CN10KB's RPM_USX MAC i.e registering
the driver and defining MAC operations (mac_ops). Adds the logic to
configure internal loopback and pause frames and assign FIFO length to
LMACS.

Kernel reads lmac features like lmac type, autoneg, etc from shared
firmware data this structure only supports 4 lmacs per MAC, this patch
extends this structure to accommodate 8 lmacs.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-af: Support variable number of lmacs

Most of the code in CGX/RPM driver assumes that max lmacs per
given MAC as always, 4 and the number of MAC blocks also as 4.
With this assumption, the max number of interfaces supported is
hardcoded to 16. This creates a problem as next gen CN10KB silicon
MAC supports 8 lmacs per MAC block.

This patch solves the problem by using "max lmac per MAC block"
value from constant csrs and uses cgx_cnt_max value which is
populated based number of MAC blocks supported by silicon.

Signed-off-by: Rakesh Babu Saladi <rsaladi2@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-dsa-microchip-add-mtu-support-for-ksz8-series'

Oleksij Rempel says:

====================
net: dsa: microchip: add MTU support for KSZ8 series
====================

Link: https://lore.kernel.org/r/20221205052232.2834166-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: ksz8: move all DSA configurations to one location

To make the code more comparable to KSZ9477 code, move DSA
configurations to the same location.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: enable MTU normalization for KSZ8795 and KSZ9477 compatible switches

KSZ8795 and KSZ9477 compatible series of switches use global max frame
size configuration register. So, enable MTU normalization for this reason.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: ksz8: add MTU configuration support

Make MTU configurable on KSZ87xx and KSZ88xx series of switches.

Before this patch, pre-configured behavior was different on different
switch series, due to opposite meaning of the same bit:
- KSZ87xx: Reg 4, Bit 1 - if 1, max frame size is 1532; if 0 - 1514
- KSZ88xx: Reg 4, Bit 1 - if 1, max frame size is 1514; if 0 - 1532

Since the code was telling "... SW_LEGAL_PACKET_DISABLE, true)", I
assume, the idea was to set max frame size to 1532.

With this patch, by setting MTU size 1500, both switch series will be
configured to the 1532 frame limit.

This patch was tested on KSZ8873.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: add ksz_rmw8() function

Add ksz_rmw8(), it will be used in the next patch.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: do not store max MTU for all ports

If we have global MTU configuration, it is enough to configure it on CPU
port only.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: move max mtu to one location

There are no HW specific registers, so we can process all of them
in one location.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Arun Ramadoss <arun.ramadoss@microchip.com> (KSZ9893 and LAN937x)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: mtk_wed: Fix missing of_node_put() in mtk_wed_wo_hardware_init()

The np needs to be released through of_node_put() in the error handling
path of mtk_wed_wo_hardware_init().

Fixes: 799684448e3e ("net: ethernet: mtk_wed: introduce wed wo support")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20221205034339.112163-1-yuancan@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: mtk_wed: add reset to rx_ring_setup callback

This patch adds reset parameter to mtk_wed_rx_ring_setup signature
in order to align rx_ring_setup callback to tx_ring_setup one introduced
in 'commit 23dca7a90017 ("net: ethernet: mtk_wed: add reset to
tx_ring_setup callback")'

Co-developed-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/29c6e7a5469e784406cf3e2920351d1207713d05.1670239984.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: microchip: vcap: Remove unneeded semicolons

Semicolons after "}" are not needed.

Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/202212051422158113766@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: use sysfs_emit() to instead of scnprintf()

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Link: https://lore.kernel.org/r/202212051021451139126@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: clean up i2c-bus property parsing

We currently have some complicated code in sfp_probe() which gets the
I2C bus depending on whether the sfp node is DT or ACPI, and we use
completely separate lookup functions.

This could do with being in a separate function to make the code more
readable, so move it to a new function, sfp_i2c_get(). We can also use
fwnode_find_reference() to lookup the I2C bus fwnode before then
decending into fwnode-type specific parsing.

A future cleanup would be to move the fwnode-type specific parsing into
the i2c layer, which is where it really should be.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1p1WGJ-0098wS-4w@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/ncsi: Silence runtime memcpy() false positive warning

The memcpy() in ncsi_cmd_handler_oem deserializes nca->data into a
flexible array structure that overlapping with non-flex-array members
(mfr_id) intentionally. Since the mem_to_flex() API is not finished,
temporarily silence this warning, since it is a false positive, using
unsafe_memcpy().

Reported-by: Joel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/netdev/CACPK8Xdfi=OJKP0x0D1w87fQeFZ4A2DP2qzGCRcuVbpU-9=4sQ@mail.gmail.com/
Cc: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221202212418.never.837-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-lan966x-enable-ptp-on-bridge-interfaces'

Horatiu Vultur says:

====================
net: lan966x: Enable PTP on bridge interfaces

Before it was not allowed to run ptp on ports that are part of a bridge
because in case of transparent clock the HW will still forward the frames
so there would be duplicate frames.
Now that there is VCAP support, it is possible to add entries in the VCAP
to trap frames to the CPU and the CPU will forward these frames.
The first part of the patch series, extends the VCAP support to be able to
modify and get the rule, while the last patch uses the VCAP to trap the ptp
frames.
====================

Link: https://lore.kernel.org/r/20221203104348.1749811-1-horatiu.vultur@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: lan966x: Add ptp trap rules

Currently lan966x, doesn't allow to run PTP over interfaces that are
part of the bridge. The reason is when the lan966x was receiving a
PTP frame (regardless if L2/IPv4/IPv6) the HW it would flood this
frame.
Now that it is possible to add VCAP rules to the HW, such to trap these
frames to the CPU, it is possible to run PTP also over interfaces that
are part of the bridge.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: vcap: Add vcap_rule_get_key_u32

Add the function vcap_rule_get_key_u32 which allows to get the value and
the mask of a key that exist on the rule. If the key doesn't exist,
it would return error.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: vcap: Add vcap_mod_rule

Add the function vcap_mod_rule which allows to update an existing rule
in the vcap. It is required for the rule to exist in the vcap to be able
to modify it.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: microchip: vcap: Add vcap_get_rule

Add function vcap_get_rule which returns a rule based on the internal
rule id.
The entire functionality of reading and decoding the rule from the VCAP
was inside vcap_api_debugfs file. So move the entire implementation in
vcap_api as this is used also by vcap_get_rule.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mtk_eth_soc: enable flow offload support for MT7986 SoC

Since Wireless Ethernet Dispatcher is now available for mt7986 in mt76,
enable hw flow support for MT7986 SoC.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/fdcaacd827938e6a8c4aa1ac2c13e46d2c08c821.1670072898.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: add netlink based get rss support

Add netlink based support for "ethtool -x <dev> [context x]"
command by implementing ETHTOOL_MSG_RSS_GET netlink message.
This is equivalent to functionality provided via ETHTOOL_GRSSH
in ioctl path. It sends RSS table, hash key and hash function
of an interface to user space.

This patch implements existing functionality available
in ioctl path and enables addition of new RSS context
based parameters in future.

Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Link: https://lore.kernel.org/r/20221202002555.241580-1-sudheer.mogilappagari@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mxl-gpy: rename MMD_VEND1 macros to match datasheet

Rename the temperature sensors macros to match the names in the
datasheet.

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add support for multicast filter

Rewrite nfp_net_set_rx_mode() to implement interface to delivery
mc address and operations to firmware by using general mailbox
for filtering multicast packets.

The operations include add mc address and delete mc address.
And the limitation of mc addresses number is 1024 for each net
device.

User triggers adding mc address by using command below:
ip maddress add <mc address> dev <interface name>

User triggers deleting mc address by using command below:
ip maddress del <mc address> dev <interface name>

Signed-off-by: Diana Wang <na.wang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: use sysfs_emit() to instead of scnprintf()

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'rxrpc-next-20221201-b' of git://git./linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Increasing SACK size and moving away from softirq, parts 2 & 3

Here are the second and third parts of patches in the process of moving
rxrpc from doing a lot of its stuff in softirq context to doing it in an
I/O thread in process context and thereby making it easier to support a
larger SACK table.

The full description is in the description for the first part[1] which is
already in net-next.

The second part includes some cleanups, adds some testing and overhauls
some tracing:

(1) Remove declaration of rxrpc_kernel_call_is_complete() as the
     definition is no longer present.

(2) Remove the knet() and kproto() macros in favour of using tracepoints.

(3) Remove handling of duplicate packets from recvmsg.  The input side
     isn't now going to insert overlapping/duplicate packets into the
     recvmsg queue.

(4) Don't use the rxrpc_conn_parameters struct in the rxrpc_connection or
     rxrpc_bundle structs - rather put the members in directly.

(5) Extract the abort code from a received abort packet right up front
     rather than doing it in multiple places later.

(6) Use enums and symbol lists rather than __builtin_return_address() to
     indicate where a tracepoint was triggered for local, peer, conn, call
     and skbuff tracing.

(7) Add a refcount tracepoint for the rxrpc_bundle struct.

(8) Implement an in-kernel server for the AFS rxperf testing program to
     talk to (enabled by a Kconfig option).

This is tagged as rxrpc-next-20221201-a.

The third part introduces the I/O thread and switches various bits over to
running there:

(1) Fix call timers and call and connection workqueues to not hold refs on
     the rxrpc_call and rxrpc_connection structs to thereby avoid messy
     cleanup when the last ref is put in softirq mode.

(2) Split input.c so that the call packet processing bits are separate
     from the received packet distribution bits.  Call packet processing
     gets bumped over to the call event handler.

(3) Create a per-local endpoint I/O thread.  Barring some tiny bits that
     still get done in softirq context, all packet reception, processing
     and transmission is done in this thread.  That will allow a load of
     locking to be removed.

(4) Perform packet processing and error processing from the I/O thread.

(5) Provide a mechanism to process call event notifications in the I/O
     thread rather than queuing a work item for that call.

(6) Move data and ACK transmission into the I/O thread.  ACKs can then be
     transmitted at the point they're generated rather than getting
     delegated from softirq context to some process context somewhere.

(7) Move call and local processor event handling into the I/O thread.

(8) Move cwnd degradation to after packets have been transmitted so that
     they don't shorten the window too quickly.

A bunch of simplifications can then be done:

(1) The input_lock is no longer necessary as exclusion is achieved by
     running the code in the I/O thread only.

(2) Don't need to use sk->sk_receive_queue.lock to guard socket state
     changes as the socket mutex should suffice.

(3) Don't take spinlocks in RCU callback functions as they get run in
     softirq context and thus need _bh annotations.

(4) RCU is then no longer needed for the peer's error_targets list.

(5) Simplify the skbuff handling in the receive path by dropping the ref
     in the basic I/O thread loop and getting an extra ref as and when we
     need to queue the packet for recvmsg or another context.

(6) Get the peer address earlier in the input process and pass it to the
     users so that we only do it once.

This is tagged as rxrpc-next-20221201-b.

Changes:
========
ver #2)
- Added a patch to change four assertions into warnings in rxrpc_read()
   and fixed a checker warning from a __user annotation that should have
   been removed..
- Change a min() to min_t() in rxperf as PAGE_SIZE doesn't seem to match
   type size_t on i386.
- Three error handling issues in rxrpc_new_incoming_call():
   - If not DATA or not seq #1, should drop the packet, not abort.
   - Fix a goto that went to the wrong place, dropping a non-held lock.
   - Fix an rcu_read_lock that should've been an unlock.

Tested-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: kafs-testing+fedora36_64checkkafs-build-144@auristor.com
Link: https://lore.kernel.org/r/166794587113.2389296.16484814996876530222.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/166982725699.621383.2358362793992993374.stgit@warthog.procyon.org.uk/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: tegra: Add MGBE support

Add support for the Multi-Gigabit Ethernet (MGBE/XPCS) IP found on
NVIDIA Tegra234 SoCs.

Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Bhadram Varka <vbhadram@nvidia.com>
Co-developed-by: Revanth Kumar Uppala <ruppala@nvidia.com>
Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com>
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Power up SERDES after the PHY link

The Tegra MGBE ethernet controller requires that the SERDES link is
powered-up after the PHY link is up, otherwise the link fails to
become ready following a resume from suspend. Add a variable to indicate
that the SERDES link must be powered-up after the PHY link.

Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com>
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8169-irq-coalesce'

Heiner Kallweit says:

====================
net: add and use netdev_sw_irq_coalesce_default_on()

There are reports about r8169 not reaching full line speed on certain
systems (e.g. SBC's) with a 2.5Gbps link.
There was a time when hardware interrupt coalescing was enabled per
default, but this was changed due to ASPM-related issues on few systems.

Meanwhile we have sysfs attributes for controlling kind of
"software interrupt coalescing" on the GRO level. However most distros
and users don't know about it. So lets set a conservative default for
both involved parameters. Users can still override the defaults via
sysfs. Don't enable these settings on the fast ethernet chip versions,
they are slow enough.

Even with these conservative setting interrupt load on my 1Gbps test
system reduced significantly.

Follow Jakub's suggestion and put this functionality into net core
so that other MAC drivers can reuse it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: enable GRO software interrupt coalescing per default

There are reports about r8169 not reaching full line speed on certain
systems (e.g. SBC's) with a 2.5Gbps link.
There was a time when hardware interrupt coalescing was enabled per
default, but this was changed due to ASPM-related issues on few systems.
So let's use software interrupt coalescing instead and enable it
using new function netdev_sw_irq_coalesce_default_on().

Even with these conservative settings interrupt load on my 1Gbps test
system reduced significantly.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add netdev_sw_irq_coalesce_default_on()

Add a helper for drivers wanting to set SW IRQ coalescing
by default. The related sysfs attributes can be used to
override the default values.

Follow Jakub's suggestion and put this functionality into
net core so that drivers wanting to use software interrupt
coalescing per default don't have to open-code it.

Note that this function needs to be called before the
netdevice is registered.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: mtk_wed: fix sleep while atomic in mtk_wed_wo_queue_refill

In order to fix the following sleep while atomic bug always alloc pages
with GFP_ATOMIC in mtk_wed_wo_queue_refill since page_frag_alloc runs in
spin_lock critical section.

[    9.049719] Hardware name: MediaTek MT7986a RFB (DT)
[    9.054665] Call trace:
[    9.057096]  dump_backtrace+0x0/0x154
[    9.060751]  show_stack+0x14/0x1c
[    9.064052]  dump_stack_lvl+0x64/0x7c
[    9.067702]  dump_stack+0x14/0x2c
[    9.071001]  ___might_sleep+0xec/0x120
[    9.074736]  __might_sleep+0x4c/0x9c
[    9.078296]  __alloc_pages+0x184/0x2e4
[    9.082030]  page_frag_alloc_align+0x98/0x1ac
[    9.086369]  mtk_wed_wo_queue_refill+0x134/0x234
[    9.090974]  mtk_wed_wo_init+0x174/0x2c0
[    9.094881]  mtk_wed_attach+0x7c8/0x7e0
[    9.098701]  mt7915_mmio_wed_init+0x1f0/0x3a0 [mt7915e]
[    9.103940]  mt7915_pci_probe+0xec/0x3bc [mt7915e]
[    9.108727]  pci_device_probe+0xac/0x13c
[    9.112638]  really_probe.part.0+0x98/0x2f4
[    9.116807]  __driver_probe_device+0x94/0x13c
[    9.121147]  driver_probe_device+0x40/0x114
[    9.125314]  __driver_attach+0x7c/0x180
[    9.129133]  bus_for_each_dev+0x5c/0x90
[    9.132953]  driver_attach+0x20/0x2c
[    9.136513]  bus_add_driver+0x104/0x1fc
[    9.140333]  driver_register+0x74/0x120
[    9.144153]  __pci_register_driver+0x40/0x50
[    9.148407]  mt7915_init+0x5c/0x1000 [mt7915e]
[    9.152848]  do_one_initcall+0x40/0x25c
[    9.156669]  do_init_module+0x44/0x230
[    9.160403]  load_module+0x1f30/0x2750
[    9.164135]  __do_sys_init_module+0x150/0x200
[    9.168475]  __arm64_sys_init_module+0x18/0x20
[    9.172901]  invoke_syscall.constprop.0+0x4c/0xe0
[    9.177589]  do_el0_svc+0x48/0xe0
[    9.180889]  el0_svc+0x14/0x50
[    9.183929]  el0t_64_sync_handler+0x9c/0x120
[    9.188183]  el0t_64_sync+0x158/0x15c

Fixes: 799684448e3e ("net: ethernet: mtk_wed: introduce wed wo support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/67ca94bdd3d9eaeb86e52b3050fbca0bcf7bb02f.1669908312.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: use 2-arg optimal variant of kfree_rcu()

kfree_rcu(1-arg) should be avoided as much as possible,
since this is only possible from sleepable contexts,
and incurr extra rcu barriers.

I wish the 1-arg variant of kfree_rcu() would
get a distinct name, like kfree_rcu_slow()
to avoid it being abused.

Fixes: 459837b522f7 ("net/tcp: Disable TCP-MD5 static key on tcp_md5sig_info destruction")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Link: https://lore.kernel.org/r/20221202052847.2623997-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-next-2022-12-02' of git://git./linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.2

Third set of patches for v6.2. mt76 has a new driver for mt7996 Wi-Fi 7
devices and iwlwifi also got initial Wi-Fi 7 support. Otherwise
smaller features and fixes.

Major changes:

ath10k
- store WLAN firmware version in SMEM image table

mt76
- mt7996: new driver for MediaTek Wi-Fi 7 (802.11be) devices
- mt7986, mt7915: enable Wireless Ethernet Dispatch (WED) offload support
- mt7915: add ack signal support
- mt7915: enable coredump support
- mt7921: remain_on_channel support
- mt7921: channel context support

iwlwifi
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
- 320 MHz channels support

* tag 'wireless-next-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (144 commits)
  wifi: ath10k: fix QCOM_SMEM dependency
  wifi: mt76: mt7921e: add pci .shutdown() support
  wifi: mt76: mt7915: mmio: fix naming convention
  wifi: mt76: mt7996: add support to configure spatial reuse parameter set
  wifi: mt76: mt7996: enable ack signal support
  wifi: mt76: mt7996: enable use_cts_prot support
  wifi: mt76: mt7915: rely on band_idx of mt76_phy
  wifi: mt76: mt7915: enable per bandwidth power limit support
  wifi: mt76: mt7915: introduce mt7915_get_power_bound()
  mt76: mt7915: Fix PCI device refcount leak in mt7915_pci_init_hif2()
  wifi: mt76: do not send firmware FW_FEATURE_NON_DL region
  wifi: mt76: mt7921: Add missing __packed annotation of struct mt7921_clc
  wifi: mt76: fix coverity overrun-call in mt76_get_txpower()
  wifi: mt76: mt7996: add driver for MediaTek Wi-Fi 7 (802.11be) devices
  wifi: mt76: mt76x0: remove dead code in mt76x0_phy_get_target_power
  wifi: mt76: mt7915: fix band_idx usage
  wifi: mt76: mt7915: enable .sta_set_txpwr support
  wifi: mt76: mt7915: add basedband Txpower info into debugfs
  wifi: mt76: mt7915: add support to configure spatial reuse parameter set
  wifi: mt76: mt7915: add missing MODULE_PARM_DESC
  ...
====================

Link: https://lore.kernel.org/r/20221202214254.D0D3DC433C1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wifi: ath10k: fix QCOM_SMEM dependency

Nathan noticed that when HWSPINLOCK is disabled there's a Kconfig warning:

  WARNING: unmet direct dependencies detected for QCOM_SMEM
    Depends on [n]: (ARCH_QCOM [=y] || COMPILE_TEST [=n]) && HWSPINLOCK [=n]
    Selected by [m]:
    - ATH10K_SNOC [=m] && NETDEVICES [=y] && WLAN [=y] && WLAN_VENDOR_ATH [=y] && ATH10K [=m] && (ARCH_QCOM [=y] || COMPILE_TEST [=n])

The problem here is that QCOM_SMEM depends on HWSPINLOCK so we cannot select
QCOM_SMEM and instead we neeed to use 'depends on'.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/all/Y4YsyaIW+CPdHWv3@dev-arch.thelio-3990X/
Fixes: 4d79f6f34bbb ("wifi: ath10k: Store WLAN firmware version in SMEM image table")
Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20221202103027.25974-1-kvalo@kernel.org

tsnep: Rework RX buffer allocation

Refill RX queue in batches of descriptors to improve performance. Refill
is allowed to fail as long as a minimum number of descriptors is active.
Thus, a limited number of failed RX buffer allocations is now allowed
for normal operation. Previously every failed allocation resulted in a
dropped frame.

If the minimum number of active descriptors is reached, then RX buffers
are still reused and frames are dropped. This ensures that the RX queue
never runs empty and always continues to operate.

Prework for future XDP support.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Throttle interrupts

Without interrupt throttling, iperf server mode generates a CPU load of
100% (A53 1.2GHz). Also the throughput suffers with less than 900Mbit/s
on a 1Gbit/s link. The reason is a high interrupt load with interrupts
every ~20us.

Reduce interrupt load by throttling of interrupts. Interrupt delay
default is 64us. For iperf server mode the CPU load is significantly
reduced to ~20% and the throughput reaches the maximum of 941MBit/s.
Interrupts are generated every ~140us.

RX and TX coalesce can be configured with ethtool. RX coalesce has
priority over TX coalesce if the same interrupt is used.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Add ethtool::get_channels support

Allow user space to read number of TX and RX queue. This is useful for
device dependent qdisc configurations like TAPRIO with hardware offload.
Also ethtool::get_per_queue_coalesce / set_per_queue_coalesce requires
that interface.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Consistent naming of struct net_device

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: bonding: correct xmit hash steps

Correct xmit hash steps for layer3+4 as introduced by commit
49aefd131739 ("bonding: do not discard lowest hash bit for non layer3+4
hashing").

Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: bonding: update miimon default to 100

With commit c1f897ce186a ("bonding: set default miimon value for non-arp
modes if not set") the miimon default was changed from zero to 100 if
arp_interval is also zero. Document this fact in bonding.rst.

Fixes: c1f897ce186a ("bonding: set default miimon value for non-arp modes if not set")
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderbolt: Use bitwise types in the struct thunderbolt_ip_frame_header

The main usage of the struct thunderbolt_ip_frame_header is to handle
the packets on the media layer. The header is bound to the protocol
in which the byte ordering is crucial. However the data type definition
doesn't use that and sparse is unhappy, for example (17 altogether):

  .../thunderbolt.c:718:23: warning: cast to restricted __le32

  .../thunderbolt.c:966:42: warning: incorrect type in assignment (different base types)
  .../thunderbolt.c:966:42:    expected unsigned int [usertype] frame_count
  .../thunderbolt.c:966:42:    got restricted __le32 [usertype]

Switch to the bitwise types in the struct thunderbolt_ip_frame_header to
reduce this, but not completely solving (9 left), because the same data
type is used for Rx header handled locally (in CPU byte order).

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderbolt: Switch from __maybe_unused to pm_sleep_ptr() etc

Letting the compiler remove these functions when the kernel is built
without CONFIG_PM_SLEEP support is simpler and less heavier for builds
than the use of __maybe_unused attributes.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: devlink: convert port_list into xarray

Some devlink instances may contain thousands of ports. Storing them in
linked list and looking them up is not scalable. Convert the linked list
into xarray.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hsr'

Sebastian Andrzej Siewior says:

====================
I started playing with HSR and run into a problem. Tested latest
upstream -rc and noticed more problems. Now it appears to work.
For testing I have a small three node setup with iperf and ping. While
iperf doesn't complain ping reports missing packets and duplicates.
====================

Link: https://lore.kernel.org/r/20221129164815.128922-1-bigeasy@linutronix.de/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: Add a basic HSR test.

This test adds a basic HSRv0 network with 3 nodes. In its current shape
it sends and forwards packets, announcements and so merges nodes based
on MAC A/B information.
It is able to detect duplicate packets and packetloss should any occur.

Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hsr: Use a single struct for self_node.

self_node_db is a list_head with one entry of struct hsr_node. The
purpose is to hold the two MAC addresses of the node itself.
It is convenient to recycle the structure. However having a list_head
and fetching always the first entry is not really optimal.

Created a new data strucure contaning the two MAC addresses named
hsr_self_node. Access that structure like an RCU protected pointer so
it can be replaced on the fly without blocking the reader.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hsr: Synchronize sequence number updates.

hsr_register_frame_out() compares new sequence_nr vs the old one
recorded in hsr_node::seq_out and if the new sequence_nr is higher then
it will be written to hsr_node::seq_out as the new value.

This operation isn't locked so it is possible that two frames with the
same sequence number arrive (via the two slave devices) and are fed to
hsr_register_frame_out() at the same time. Both will pass the check and
update the sequence counter later to the same value. As a result the
content of the same packet is fed into the stack twice.

This was noticed by running ping and observing DUP being reported from
time to time.

Instead of using the hsr_priv::seqnr_lock for the whole receive path (as
it is for sending in the master node) add an additional lock that is only
used for sequence number checks and updates.

Add a per-node lock that is used during sequence number reads and
updates.

Fixes: f421436a591d3 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hsr: Synchronize sending frames to have always incremented outgoing seq nr.

Sending frames via the hsr (master) device requires a sequence number
which is tracked in hsr_priv::sequence_nr and protected by
hsr_priv::seqnr_lock. Each time a new frame is sent, it will obtain a
new id and then send it via the slave devices.
Each time a packet is sent (via hsr_forward_do()) the sequence number is
checked via hsr_register_frame_out() to ensure that a frame is not
handled twice. This make sense for the receiving side to ensure that the
frame is not injected into the stack twice after it has been received
from both slave ports.

There is no locking to cover the sending path which means the following
scenario is possible:

  CPU0 CPU1
  hsr_dev_xmit(skb1) hsr_dev_xmit(skb2)
   fill_frame_info()             fill_frame_info()
    hsr_fill_frame_info()         hsr_fill_frame_info()
     handle_std_frame()            handle_std_frame()
      skb1's sequence_nr = 1
                                    skb2's sequence_nr = 2
   hsr_forward_do()              hsr_forward_do()

                                   hsr_register_frame_out(, 2)  // okay, send)

    hsr_register_frame_out(, 1) // stop, lower seq duplicate

Both skbs (or their struct hsr_frame_info) received an unique id.
However since skb2 was sent before skb1, the higher sequence number was
recorded in hsr_register_frame_out() and the late arriving skb1 was
dropped and never sent.

This scenario has been observed in a three node HSR setup, with node1 +
node2 having ping and iperf running in parallel. From time to time ping
reported a missing packet. Based on tracing that missing ping packet did
not leave the system.

It might be possible (didn't check) to drop the sequence number check on
the sending side. But if the higher sequence number leaves on wire
before the lower does and the destination receives them in that order
and it will drop the packet with the lower sequence number and never
inject into the stack.
Therefore it seems the only way is to lock the whole path from obtaining
the sequence number and sending via dev_queue_xmit() and assuming the
packets leave on wire in the same order (and don't get reordered by the
NIC).

Cover the whole path for the master interface from obtaining the ID
until after it has been forwarded via hsr_forward_skb() to ensure the
skbs are sent to the NIC in the order of the assigned sequence numbers.

Fixes: f421436a591d3 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hsr: Disable netpoll.

The hsr device is a software device. Its
net_device_ops::ndo_start_xmit() routine will process the packet and
then pass the resulting skb to dev_queue_xmit().
During processing, hsr acquires a lock with spin_lock_bh()
(hsr_add_node()) which needs to be promoted to the _irq() suffix in
order to avoid a potential deadlock.
Then there are the warnings in dev_queue_xmit() (due to
local_bh_disable() with disabled interrupts) left.

Instead trying to address those (there is qdisc and…) for netpoll sake,
just disable netpoll on hsr.

Disable netpoll on hsr and replace the _irqsave() locking with _bh().

Fixes: f421436a591d3 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hsr: Avoid double remove of a node.

Due to the hashed-MAC optimisation one problem become visible:
hsr_handle_sup_frame() walks over the list of available nodes and merges
two node entries into one if based on the information in the supervision
both MAC addresses belong to one node. The list-walk happens on a RCU
protected list and delete operation happens under a lock.

If the supervision arrives on both slave interfaces at the same time
then this delete operation can occur simultaneously on two CPUs. The
result is the first-CPU deletes the from the list and the second CPUs
BUGs while attempting to dereference a poisoned list-entry. This happens
more likely with the optimisation because a new node for the mac_B entry
is created once a packet has been received and removed (merged) once the
supervision frame has been received.

Avoid removing/ cleaning up a hsr_node twice by adding a `removed' field
which is set to true after the removal and checked before the removal.

Fixes: f266a683a4804 ("net/hsr: Better frame dispatch")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>