review.tizen.org Git - platform/kernel/linux-starfive.git/log

net: phy: marvell: Add software control of the LEDs

Add a brightness function, so the LEDs can be controlled from
software using the standard Linux LED infrastructure.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: phy_device: Call into the PHY driver to set LED brightness

Linux LEDs can be software controlled via the brightness file in /sys.
LED drivers need to implement a brightness_set function which the core
will call. Implement an intermediary in phy_device, which will call
into the phy driver if it implements the necessary function.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Add a binding for PHY LEDs

Define common binding parsing for all PHY drivers with LEDs using
phylib. Parse the DT as part of the phy_probe and add LEDs to the
linux LED class infrastructure. For the moment, provide a dummy
brightness function, which will later be replaced with a call into the
PHY driver. This allows testing since the LED core might otherwise
reject an LED whose brightness cannot be set.

Add a dependency on LED_CLASS. It either needs to be built in, or not
enabled, since a modular build can result in linker errors.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

leds: Provide stubs for when CLASS_LED & NEW_LEDS are disabled

Provide stubs for devm_led_classdev_register_ext() and
led_init_default_state_get() so that LED drivers embedded within other
drivers such as PHYs and Ethernet switches still build when LEDS_CLASS
or NEW_LEDS are disabled. This also helps with Kconfig dependencies,
which are somewhat hairy for phylib and mdio and only get worse when
adding a dependency on LED_CLASS.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: qca8k: add LEDs blink_set() support

Add LEDs blink_set() support to qca8k Switch Family.
These LEDs support hw accellerated blinking at a fixed rate
of 4Hz.

Reject any other value since not supported by the LEDs switch.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: qca8k: add LEDs basic support

Add LEDs basic support for qca8k Switch Family by adding basic
brightness_set() support.

Since these LEDs refelect port status, the default label is set to
":port". DT binding should describe the color and function of the
LEDs using standard LEDs api.
Each LED always have the device name as prefix. The device name is
composed from the mii bus id and the PHY addr resulting in example
names like:
- qca8k-0.0:00:amber:lan
- qca8k-0.0:00:white:lan
- qca8k-0.0:01:amber:lan
- qca8k-0.0:01:white:lan

These LEDs supports only blocking variant of the brightness_set()
function since they can sleep during access of the switch leds to set
the brightness.

While at it add to the qca8k header file each mode defined by the Switch
Documentation for future use.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: qca8k: move qca8k_port_to_phy() to header

Move qca8k_port_to_phy() to qca8k header as it's useful for future
reference in Switch LEDs module since the same logic is applied to get
the right index of the switch port.
Make it inline as it's simple function that just decrease the port.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: wwan: Expose secondary AT port on DATA1

Our use-case needs two AT ports available:
One for running a ppp daemon, and another one for management

This patch enables a second AT port on DATA1

Signed-off-by: Jaime Breva <jbreva@nayarsystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx5e-xdp-extend'

Tariq Toukan says:

====================
net/mlx5e: Extend XDP multi-buffer capabilities

This series extends the XDP multi-buffer support in the mlx5e driver.

Patchset breakdown:
- Infrastructural changes and preparations.
- Add XDP multi-buffer support for XDP redirect-in.
- Use TX MPWQE (multi-packet WQE) HW feature for non-linear
  single-segmented XDP frames.
- Add XDP multi-buffer support for striding RQ.

In Striding RQ, we overcome the lack of headroom and tailroom between
the RQ strides by allocating a side page per packet and using it for the
xdp_buff descriptor. We structure the xdp_buff so that it contains
nothing in the linear part, and the whole packet resides in the
fragments.

Performance highlight:

Packet rate test, 64 bytes, 32 channels, MTU 9000 bytes.
CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz.
NIC: ConnectX-6 Dx, at 100 Gbps.

+----------+-------------+-------------+---------+
| Test     | Legacy RQ   | Striding RQ | Speedup |
+----------+-------------+-------------+---------+
| XDP_DROP | 101,615,544 | 117,191,020 | +15%    |
+----------+-------------+-------------+---------+
| XDP_TX   |  95,608,169 | 117,043,422 | +22%    |
+----------+-------------+-------------+---------+

Series generated against net commit:
e61caf04b9f8 Merge branch 'page_pool-allow-caching-from-safely-localized-napi'

I'm submitting this directly as Saeed is traveling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ

Here we add support for multi-buffer XDP handling in Striding RQ, which
is our default out-of-the-box RQ type. Before this series, loading such
an XDP program would fail, until you switch to the legacy RQ (by
unsetting the rx_striding_rq priv-flag).

To overcome the lack of headroom and tailroom between the strides, we
allocate a side page to be used for the descriptor (xdp_buff / skb) and
the linear part. When an XDP program is attached, we structure the
xdp_buff so that it contains no data in the linear part, and the whole
packet resides in the fragments.

In case of XDP_PASS, where an SKB still needs to be created, we copy up
to 256 bytes to its linear part, to match the current behavior, and
satisfy functions that assume finding the packet headers in the SKB
linear part (like eth_type_trans).

Performance testing:

Packet rate test, 64 bytes, 32 channels, MTU 9000 bytes.
CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz.
NIC: ConnectX-6 Dx, at 100 Gbps.

+----------+-------------+-------------+---------+
| Test     | Legacy RQ   | Striding RQ | Speedup |
+----------+-------------+-------------+---------+
| XDP_DROP | 101,615,544 | 117,191,020 | +15%    |
+----------+-------------+-------------+---------+
| XDP_TX   |  95,608,169 | 117,043,422 | +22%    |
+----------+-------------+-------------+---------+

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: RX, Prepare non-linear striding RQ for XDP multi-buffer support

In preparation for supporting XDP multi-buffer in striding RQ, use
xdp_buff struct to describe the packet. Make its skb_shared_info collide
the one of the allocated SKB, then add the fragments using the xdp_buff
API.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: RX, Generalize mlx5e_fill_mxbuf()

Make the function more generic. Let it get an additional frame_sz
parameter instead of deriving it from the RQ struct.

No functional change here, just a preparation for a downstream patch.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: RX, Take shared info fragment addition into a function

Introduce mlx5e_add_skb_shared_info_frag(), a function dedicated for
adding a fragment into a struct skb_shared_info object.

Use it in the Legacy RQ flow. Similar usage will be added in a
downstream patch by the corresponding Striding RQ flow.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Allow non-linear single-segment frames in XDP TX MPWQE

Under a few restrictions, TX MPWQE feature can serve multiple TX packets
in a single TX descriptor. It requires each of the packets to have a
single scatter entry / segment.

Today we allow only linear frames to use this feature, although there's
no real problem with non-linear ones where the whole packet reside in
the first fragment.

Expand the XDP TX MPWQE feature support to include such frames. This is
in preparation for the downstream patch, in which we will generate such
non-linear frames.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Remove un-established assumptions on XDP buffer

Remove the assumption of non-zero linear length in the XDP xmit
function, used to serve both internal XDP_TX operations as well as
redirected-in requests.

Do not apply the MLX5E_XDP_MIN_INLINE check unless necessary.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Consider large muti-buffer packets in Striding RQ params calculations

Function mlx5e_rx_get_linear_stride_sz() returns PAGE_SIZE immediately
in case an XDP program is attached. The more accurate formula is
ALIGN(sz, PAGE_SIZE), to prevent two packets from residing on the same
page.

The assumption behind the current code is that sz <= PAGE_SIZE holds for
all cases with XDP program set.

This is true because it is being called from:
- 3 times from Striding RQ flows, in which XDP is not supported for such
large packets.
- 1 time from Legacy RQ flow, under the condition
mlx5e_rx_is_linear_skb().

No functional change here, just removing the implied assumption in
preparation for supporting XDP multi-buffer in Striding RQ.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Let XDP checker function get the params as input

Change mlx5e_xdp_allowed() so it gets the params structure with the
xdp_prog applied, rather than creating a local copy based on the current
params in priv.

This reduces the amount of memory on the stack, and acts on the exact
params instance that's about to be applied.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Improve Striding RQ check with XDP

Non-linear mem scheme of Striding RQ does not yet support XDP at this
point. Take the check where it belongs, inside the params validation
function mlx5e_params_validate_xdp().

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Add support for multi-buffer XDP redirect-in

Handle multi-buffer XDP redirect-in requests coming through
mlx5e_xdp_xmit.

Extend struct mlx5e_xmit_data_frags with an additional dma_arr field, to
point to the fragments dma mapping, as they cannot be retrieved via the
page_pool_get_dma_addr() function.

Push a dma_addr xdpi instance per each fragment, and use them in the
completion flow to dma_unmap the frags.

Finally, remove the restriction in mlx5e_open_xdpsq, and set the flag in
xdp_features.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifo

Here we fix the current wi->num_pkts abuse, as it was used to indicate
multiple xdpi entries in the xdpi_fifo.

Instead, reduce mlx5e_xdp_info to the size of a single field, making it
a union of unions. Per packet, use as many instances as needed to
provide the information needed at the time of completion.

The sequence of xdpi instances pushed is well defined, derived by the
xmit_mode.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: XDP, Remove doubtful unlikely calls

It is not likely nor unlikely that the xdp buff has fragments, it
depends on the program loaded and size of the packet received.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Introduce extended version for mlx5e_xmit_data

Introduce struct mlx5e_xmit_data_frags to be used for non-linear xmit
buffers. Let it include sinfo pointer.

Take one bit from the len field to indicate if the descriptor has
fragments and can be casted-up into the extended version.

Zero-init to make sure has_frags, and potentially future fields, are
zero when not explicitly assigned.

Another field will be added in a downstream patch to indicate and point
to dma addresses of the different frags, for redirect-in requests.

This simplifies the mlx5e_xmit_xdp_frame/mlx5e_xmit_xdp_frame_mpwqe
functions params.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Move struct mlx5e_xmit_data to datapath header

Move TX datapath struct from the generic en.h to the datapath txrx.h
header, where it belongs.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Move XDP struct and enum to XDP header

Move struct mlx5e_xdp_info and enum mlx5e_xdp_xmit_mode from the generic
en.h to the XDP header, where they belong.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: stmmac: dwmac-sti: remove stih415/stih416/stid127

Remove no more supported platforms (stih415/stih416 and stid127)

Signed-off-by: Alain Volmat <avolmat@me.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230416195523.61075-1-avolmat@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: remove incompatible prototypes

The types for the register argument changed recently, but there are
still incompatible prototypes that got left behind, and gcc-13 warns
about these:

In file included from drivers/net/ethernet/mscc/ocelot.c:13:
drivers/net/ethernet/mscc/ocelot.h:97:5: error: conflicting types for 'ocelot_port_readl' due to enum/integer mismatch; have 'u32(struct ocelot_port *, u32)' {aka 'unsigned int(struct ocelot_port *, unsigned int)'} [-Werror=enum-int-mismatch]
97 | u32 ocelot_port_readl(struct ocelot_port *port, u32 reg);
| ^~~~~~~~~~~~~~~~~

Just remove the two prototypes, and rely on the copy in the global
header.

Fixes: 9ecd05794b8d ("net: mscc: ocelot: strengthen type of "u32 reg" in I/O accessors")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230417205531.1880657-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: propagate feature flags to vlan

stmmac_dev_probe doesn't propagate feature flags to VLANs. So features
like offloading don't correspond with the general features and it's not
possible to manipulate features via ethtool -K to affect VLANs.

Propagate feature flags to vlan features. Drop TSO feature because
it does not work on VLANs yet.

Signed-off-by: Corinna Vinschen <vinschen@redhat.com>
Link: https://lore.kernel.org/r/20230417192845.590034-1-vinschen@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: add software tx timestamping support

Currently, bonding only obtain the timestamp (ts) information of
the active slave, which is available only for modes 1, 5, and 6.
For other modes, bonding only has software rx timestamping support.

However, some users who use modes such as LACP also want tx timestamp
support. To address this issue, let's check the ts information of each
slave. If all slaves support tx timestamping, we can enable tx
timestamping support for the bond.

Add a note that the get_ts_info may be called with RCU, or rtnl or
reference on the device in ethtool.h>

Suggested-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20230418034841.2566262-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-ethernet-driver-for-starfive-jh7110-soc'

Samin Guo says:

====================
Add Ethernet driver for StarFive JH7110 SoC

This series adds ethernet support for the StarFive JH7110 RISC-V SoC,
which includes a dwmac-5.20 MAC driver (from Synopsys DesignWare).
This series has been tested and works fine on VisionFive-2 v1.2A and
v1.3B SBC boards.

For more information and support, you can visit RVspace wiki[1].
You can simply review or test the patches at the link [2].
This patchset should be applied after the patchset [3] [4].

[1]: https://wiki.rvspace.org/
[2]: https://github.com/saminGuo/linux/tree/vf2-6.3rc4-gmac-net-next
[3]: https://patchwork.kernel.org/project/linux-riscv/cover/20230401111934.130844-1-hal.feng@starfivetech.com
[4]: https://patchwork.kernel.org/project/linux-riscv/cover/20230315055813.94740-1-william.qiu@starfivetech.com
====================

Link: https://lore.kernel.org/r/20230417100251.11871-1-samin.guo@starfivetech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-starfive: Add phy interface settings

dwmac supports multiple modess. When working under rmii and rgmii,
you need to set different phy interfaces.

According to the dwmac document, when working in rmii, it needs to be
set to 0x4, and rgmii needs to be set to 0x1.

The phy interface needs to be set in syscon, the format is as follows:
starfive,syscon: <&syscon, offset, shift>

Tested-by: Tommaso Merciai <tomm.merciai@gmail.com>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Add glue layer for StarFive JH7110 SoC

This adds StarFive dwmac driver support on the StarFive JH7110 SoC.

Tested-by: Tommaso Merciai <tomm.merciai@gmail.com>
Co-developed-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: Add support StarFive dwmac

Add documentation to describe StarFive dwmac driver(GMAC).

Signed-off-by: Yanhong Wang <yanhong.wang@starfivetech.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: snps,dwmac: Add 'ahb' reset/reset-name

According to:
stmmac_platform.c: stmmac_probe_config_dt
stmmac_main.c: stmmac_dvr_probe

dwmac controller may require one (stmmaceth) or two (stmmaceth+ahb)
reset signals, and the maxItems of resets/reset-names is going to be 2.

The gmac of Starfive Jh7110 SOC must have two resets.
it uses snps,dwmac-5.20 IP.

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: platform: Add snps,dwmac-5.20 IP compatible string

Add "snps,dwmac-5.20" compatible string for 5.20 version that can avoid
to define some platform data in the glue layer.

Tested-by: Tommaso Merciai <tomm.merciai@gmail.com>
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: snps,dwmac: Add dwmac-5.20 version

Add dwmac-5.20 IP version to snps.dwmac.yaml

Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'r8169-use-new-macros-from-netdev_queues-h'

Heiner Kallweit says:

====================
r8169: use new macros from netdev_queues.h

Add one missing subqueue version of the macros, and use the new macros
in r8169 to simplify the code.
====================

Link: https://lore.kernel.org/r/7147a001-3d9c-a48d-d398-a94c666aa65b@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

r8169: use new macro netif_subqueue_completed_wake in the tx cleanup path

Use new net core macro netif_subqueue_completed_wake to simplify
the code of the tx cleanup path.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

r8169: use new macro netif_subqueue_maybe_stop in rtl8169_start_xmit

Use new net core macro netif_subqueue_maybe_stop in the start_xmit path
to simplify the code. Whilst at it, set the tx queue start threshold to
twice the stop threshold. Before values were the same, resulting in
stopping/starting the queue more often than needed.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: add macro netif_subqueue_completed_wake

Add netif_subqueue_completed_wake, complementing the subqueue versions
netif_subqueue_try_stop and netif_subqueue_maybe_stop.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'ocelot-felix-driver-support-for-preemptible-traffic-classes'

Vladimir Oltean says:

====================
Ocelot/Felix driver support for preemptible traffic classes

The series "Add tc-mqprio and tc-taprio support for preemptible traffic
classes" from:
https://lore.kernel.org/netdev/20230220122343.1156614-1-vladimir.oltean@nxp.com/

was eventually submitted in a form without the support for the
Ocelot/Felix switch driver. This patch set picks up that work again,
and presents a fairly modified form compared to the original.
====================

Link: https://lore.kernel.org/r/20230415170551.3939607-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: add support for preemptible traffic classes

In order to not transmit (preemptible) frames which will be received by
the link partner as corrupted (because it doesn't support FP), the
hardware requires the driver to program the QSYS_PREEMPTION_CFG_P_QUEUES
register only after the MAC Merge layer becomes active (verification
succeeds, or was disabled).

There are some cases when FP is known (through experimentation) to be
broken. Give priority to FP over cut-through switching, and disable FP
for known broken link modes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: act upon the mqprio qopt in taprio offload

The mqprio queue configuration can appear either through
TC_SETUP_QDISC_MQPRIO or through TC_SETUP_QDISC_TAPRIO. Make sure both
are treated in the same way.

Code does nothing new for now (except for rejecting multiple TXQs per
TC, which is a useless concept with DSA switches).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: add support for mqprio offload

This doesn't apply anything to hardware and in general doesn't do
anything that the software variant doesn't do, except for checking that
there isn't more than 1 TXQ per TC (TXQs for a DSA switch are a dubious
concept anyway). The reason we add this is to be able to parse one more
field added to struct tc_mqprio_qopt_offload, namely preemptible_tcs.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: don't rely on cached verify_status in ocelot_port_get_mm()

ocelot_mm_update_port_status() updates mm->verify_status, but when the
verification state of a port changes, an IRQ isn't emitted, but rather,
only when the verification state reaches one of the final states (like
DISABLED, FAILED, SUCCEEDED) - things that would affect mm->tx_active,
which is what the IRQ *is* actually emitted for.

That is to say, user space may miss reports of an intermediary MAC Merge
verification state (like from INITIAL to VERIFYING), unless there was an
IRQ notifying the driver of the change in mm->tx_active as well.

This is not a huge deal, but for reliable reporting to user space, let's
call ocelot_mm_update_port_status() synchronously from
ocelot_port_get_mm(), which makes user space see the current MM status.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: optimize ocelot_mm_irq()

The MAC Merge IRQ of all ports is shared with the PTP TX timestamp IRQ
of all ports, which means that currently, when a PTP TX timestamp is
generated, felix_irq_handler() also polls for the MAC Merge layer status
of all ports, looking for changes. This makes the kernel do more work,
and under certain circumstances may make ptp4l require a
tx_timestamp_timeout argument higher than before.

Changes to the MAC Merge layer status are only to be expected under
certain conditions - its TX direction needs to be enabled - so we can
check early if that is the case, and omit register access otherwise.

Make ocelot_mm_update_port_status() skip register access if
mm->tx_enabled is unset, and also call it once more, outside IRQ
context, from ocelot_port_set_mm(), when mm->tx_enabled transitions from
true to false, because an IRQ is also expected in that case.

Also, a port may have its MAC Merge layer enabled but it may not have
generated the interrupt. In that case, there's no point in writing to
DEV_MM_STATUS to acknowledge that IRQ. We can reduce the number of
register writes per port with MM enabled by keeping an "ack" variable
which writes the "write-one-to-clear" bits. Those are 3 in number:
PRMPT_ACTIVE_STICKY, UNEXP_RX_PFRM_STICKY and UNEXP_TX_PFRM_STICKY.
The other fields in DEV_MM_STATUS are read-only and it doesn't matter
what is written to them, so writing zero is just fine.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: remove struct ocelot_mm_state :: lock

Unfortunately, the workarounds for the hardware bugs make it pointless
to keep fine-grained locking for the MAC Merge state of each port.

Our vsc9959_cut_through_fwd() implementation requires
ocelot->fwd_domain_lock to be held, in order to serialize with changes
to the bridging domains and to port speed changes (which affect which
ports can be cut-through). Simultaneously, the traffic classes which can
be cut-through cannot be preemptible at the same time, and this will
depend on the MAC Merge layer state (which changes from threaded
interrupt context).

Since vsc9959_cut_through_fwd() would have to hold the mm->lock of all
ports for a correct and race-free implementation with respect to
ocelot_mm_irq(), in practice it means that any time a port's mm->lock is
held, it would potentially block holders of ocelot->fwd_domain_lock.

In the interest of simple locking rules, make all MAC Merge layer state
changes (and preemptible traffic class changes) be serialized by the
ocelot->fwd_domain_lock.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: export a single ocelot_mm_irq()

When the switch emits an IRQ, we don't know what caused it, and we
iterate through all ports to check the MAC Merge status.

Move that iteration inside the ocelot lib; we will change the locking in
a future change and it would be good to encapsulate that lock completely
within the ocelot lib.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'xdp-rx-hwts-metadata-for-stmmac-driver'

Song Yoong Siang says:

====================
XDP Rx HWTS metadata for stmmac driver

Implemented XDP receive hardware timestamp metadata for stmmac driver.

This patchset is tested with tools/testing/selftests/bpf/xdp_hw_metadata.
Below are the test steps and results.

Command on DUT:
sudo ./xdp_hw_metadata <interface name>

Command on Link Partner:
echo -n xdp | nc -u -q1 <destination IPv4 addr> 9091
echo -n skb | nc -u -q1 <destination IPv4 addr> 9092

Result for port 9091:
poll: 1 (0) skip=1 fail=0 redir=1
xsk_ring_cons__peek: 1
0x55f69f65f6d0: rx_desc[0]->addr=100000000008000 addr=8100 comp_addr=8000
rx_timestamp: 1677762069053692631
No rx_hash err=-95
0x55f69f65f6d0: complete idx=8 addr=8000

Result for port 9092:
poll: 1 (0) skip=2 fail=0 redir=1
found skb hwtstamp = 1677762071.937207680
====================

Link: https://lore.kernel.org/r/20230415064503.3225835-1-yoong.siang.song@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add Rx HWTS metadata to XDP ZC receive pkt

Add receive hardware timestamp metadata support via kfunc to XDP Zero Copy
receive packets.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add Rx HWTS metadata to XDP receive pkt

Add receive hardware timestamp metadata support via kfunc to XDP receive
packets.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: introduce wrapper for struct xdp_buff

Introduce struct stmmac_xdp_buff as a preparation to support XDP Rx
metadata via kfuncs.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'support-tunnel-mode-in-mlx5-ipsec-packet-offload'

Leon Romanovsky says:

====================
Support tunnel mode in mlx5 IPsec packet offload

This series extends mlx5 to support tunnel mode in its IPsec packet
offload implementation.

v0: https://lore.kernel.org/all/cover.1681106636.git.leonro@nvidia.com
====================

Link: https://lore.kernel.org/r/cover.1681388425.git.leonro@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Accept tunnel mode for IPsec packet offload

Open mlx5 driver to accept IPsec tunnel mode.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Create IPsec table with tunnel support only when encap is disabled

Current hardware doesn't support double encapsulation which is
happening when IPsec packet offload tunnel mode is configured
together with eswitch encap option.

Any user attempt to add new SA/policy after he/she sets encap mode, will
generate the following FW syndrome:

mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 1904): CREATE_FLOW_TABLE(0x930) op_mod(0x0) failed,
status bad parameter(0x3), syndrome (0xa43321), err(-22)

Make sure that we block encap changes before creating flow steering tables.
This is applicable only for packet offload in tunnel mode, while packet
offload in transport mode and crypto offload, don't have such limitation
as they don't perform encapsulation.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Allow blocking encap changes in eswitch

Existing eswitch encap option enables header encapsulation. Unfortunately
currently available hardware isn't able to perform double encapsulation,
which can happen once IPsec packet offload tunnel mode is used together
with encap mode set to BASIC.

So as a solution for misconfiguration, provide an option to block encap
changes, which will be used for IPsec packet offload.

Reviewed-by: Emeel Hakim <ehakim@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Listen to ARP events to update IPsec L2 headers in tunnel mode

In IPsec packet offload mode all header manipulations are performed by
hardware, which is responsible to add/remove L2 header with source and
destinations MACs.

CX-7 devices don't support offload of in-kernel routing functionality,
as such HW needs external help to fill other side MAC as it isn't
available for HW.

As a solution, let's listen to neigh ARP updates and reconfigure IPsec
rules on the fly once new MAC data information arrives.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Support IPsec TX packet offload in tunnel mode

Extend mlx5 driver with logic to support IPsec TX packet offload
in tunnel mode.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Support IPsec RX packet offload in tunnel mode

Extend mlx5 driver with logic to support IPsec RX packet offload
in tunnel mode.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Prepare IPsec packet reformat code for tunnel mode

Refactor setup_pkt_reformat() function to accommodate future extension
to support tunnel mode.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Configure IPsec SA tables to support tunnel mode

Create SA flow steering tables both for RX and TX with tunnel reformat
property. This allows to add and delete extra headers needed for tunnel
mode.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Check IPsec packet offload tunnel capabilities

Validate tunnel mode support for IPsec packet offload.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Add IPsec packet offload tunnel bits

Extend packet reformat types and flow table capabilities with
IPsec packet offload tunnel bits.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lan966x: Fix lan966x_ifh_get

From time to time, it was observed that the nanosecond part of the
received timestamp, which is extracted from the IFH, it was actually
bigger than 1 second. So then when actually calculating the full
received timestamp, based on the nanosecond part from IFH and the second
part which is read from HW, it was actually wrong.

The issue seems to be inside the function lan966x_ifh_get, which
extracts information from an IFH(which is an byte array) and returns the
value in a u64. When extracting the timestamp value from the IFH, which
starts at bit 192 and have the size of 32 bits, then if the most
significant bit was set in the timestamp, then this bit was extended
then the return value became 0xffffffff... . And the reason of this is
because constants without any postfix are treated as signed longs and
that is the reason why '1 << 31' becomes 0xffffffff80000000.
This is fixed by adding the postfix 'ULL' to 1.

Fixes: fd7627833ddf ("net: lan966x: Stop using packing library")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sctp-info-dump'

Xin Long says:

====================
sctp: add some missing peer_capables in sctp info dump

The 1st patch removes the unused and obsolete hostname_address from
sctp_association peer and also the bit from sctp_info peer_capables,
and then reuses its bit for reconf_capable and use the higher
available bit for intl_capable in the 2nd patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add intl_capable and reconf_capable in ss peer_capable

There are two new peer capables have been added since sctp_diag was
introduced into SCTP. When dumping the peer capables, these two new
peer capables should also be included. To not break the old capables,
reconf_capable takes the old hostname_address bit, and intl_capable
uses the higher available bit in sctpi_peer_capable.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: delete the obsolete code for the host name address param

In the latest RFC9260, the Host Name Address param has been deprecated.
For INIT chunk:

  Note 3: An INIT chunk MUST NOT contain the Host Name Address
  parameter.  The receiver of an INIT chunk containing a Host Name
  Address parameter MUST send an ABORT chunk and MAY include an
  "Unresolvable Address" error cause.

For Supported Address Types:

  The value indicating the Host Name Address parameter MUST NOT be
  used when sending this parameter and MUST be ignored when receiving
  this parameter.

Currently Linux SCTP doesn't really support Host Name Address param,
but only saves some flag and print debug info, which actually won't
even be triggered due to the verification in sctp_verify_param().
This patch is to delete those dead code.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-cleanups'

Matthieu Baerts says:

====================
mptcp: various small cleanups

Patch 1 makes a function static because it is only used in one file.

Patch 2 adds info about the git trees we use to help occasional devs.

Patch 3 removes an unused variable.

Patch 4 removes duplicated entries from the help menu of a tool used in
MPTCP selftests.

Patch 5 removes some ShellCheck warnings in mptcp_join.sh selftest.

Only very minor improvements then.
====================

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: join: fix ShellCheck warnings

Most of the code had an issue according to ShellCheck.

That's mainly due to the fact it incorrectly believes most of the code
was unreachable because it's invoked by variable name, see how the
"tests" array is used.

Once SC2317 has been ignored, three small warnings were still visible:

- SC2155: Declare and assign separately to avoid masking return values.

- SC2046: Quote this to prevent word splitting: can be ignored because
"ip netns pids" can display more than one pid.

- SC2166: Prefer [ p ] || [ q ] as [ p -o q ] is not well defined.

This probably didn't fix any actual issues but it might help spotting
new interesting warnings reported by ShellCheck as just before,
ShellCheck was reporting issues for most lines making it a bit useless.

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: remove duplicated entries in usage

mptcp_connect tool was printing some duplicated entries when showing how
to use it: -j -l -r

While at it, I also:

- moved the very few entries that were not sorted,

- added -R that was missing since
commit 8a4b910d005d ("mptcp: selftests: add rcvbuf set option"),

- removed the -u parameter that has been removed in
commit f730b65c9d85 ("selftests: mptcp: try to set mptcp ulp mode in different sk states").

No need to backport this, it is just an internal tool used by our
selftests. The help menu is mainly useful for MPTCP kernel devs.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: remove unused 'remaining' variable

In some functions, 'remaining' variable was given in argument and/or set
but never read.

  net/mptcp/options.c:779:3: warning: Value stored to 'remaining' is never
  read [clang-analyzer-deadcode.DeadStores].

  net/mptcp/options.c:547:3: warning: Value stored to 'remaining' is never
  read [clang-analyzer-deadcode.DeadStores].

The issue has been reported internally by Alibaba CI.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Suggested-by: Mat Martineau <martineau@kernel.org>
Co-developed-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

MAINTAINERS: add git trees for MPTCP

This will help occasional developers to find our git repo without having
to look at our wiki.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: make userspace_pm_append_new_local_addr static

mptcp_userspace_pm_append_new_local_addr() has always exclusively been
used in pm_userspace.c since its introduction in
commit 4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs").

So make it static.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-subflow-init'

Matthieu Baerts says:

====================
mptcp: refactor first subflow init

This series refactors the initialisation of the first subflow of a
listen socket. The first subflow allocation is no longer done at the
initialisation of the socket but later, when the connection request is
received or when requested by the userspace.

This is needed not just because Paolo likes to refactor things but
because this simplifies the code and makes the behaviour more consistent
with the rest. Also, this is a prerequisite for future patches adding
proper support of SELinux/LSM labels with MPTCP and accept(2).

In [1], Ondrej Mosnacek explained they discovered the (userspace-facing)
sockets returned by accept(2) when using MPTCP always end up with the
label representing the kernel (typically system_u:system_r:kernel_t:s0),
while it would make more sense to inherit the context from the parent
socket (the one that is passed to accept(2)).

Before being able to properly support that on SELinux/LSM side, patches
2-3/5 prepare the code to simplify the patch 4/5 moving the allocation.

Patch 1/5 is a small clean-up seen while working on the series and patch
5/5 is a small improvement when closing unaccepted sockets.

[1] https://lore.kernel.org/netdev/CAFqZXNs2LF-OoQBUiiSEyranJUXkPLcCfBkMkwFeM6qEwMKCTw@mail.gmail.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: fastclose msk when cleaning unaccepted sockets

When cleaning up unaccepted mptcp socket still laying inside
the listener queue at listener close time, such sockets will
go through a regular close, waiting for a timeout before
shutting down the subflows.

There is no need to keep the kernel resources in use for
such a possibly long time: short-circuit to fast-close.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: move first subflow allocation at mpc access time

In the long run this will simplify the mptcp code and will
allow for more consistent behavior. Move the first subflow
allocation out of the sock->init ops into the __mptcp_nmpc_socket()
helper.

Since the first subflow creation can now happen after the first
setsockopt() we additionally need to invoke mptcp_sockopt_sync()
on it.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: move fastopen subflow check inside mptcp_sendmsg_fastopen()

So that we can avoid a bunch of check in fastpath. Additionally we
can specialize such check according to the specific fastopen method
- defer_connect vs MSG_FASTOPEN.

The latter bits will simplify the next patches.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: avoid unneeded __mptcp_nmpc_socket() usage

In a few spots, the mptcp code invokes the __mptcp_nmpc_socket() helper
multiple times under the same socket lock scope. Additionally, in such
places, the socket status ensures that there is no MP capable handshake
running.

Under the above condition we can replace the later __mptcp_nmpc_socket()
helper invocation with direct access to the msk->subflow pointer and
better document such access is not supposed to fail with WARN().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: drop unneeded argument

After commit 3a236aef280e ("mptcp: refactor passive socket initialization"),
every mptcp_pm_fully_established() call is always invoked with a
GFP_ATOMIC argument. We can then drop it.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2023-04-14' of git://git./linux/kernel/git/saeed/linux

mlx5-updates-2023-04-14

Yevgeny Kliteynik Says:
=======================

SW Steering: Support pattern/args modify_header actions

The following patch series adds support for a new pattern/arguments type
of modify_header actions.

Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.

The new approach comprises of two types of objects: pattern and argument.
Pattern holds header modification templates, later used with corresponding
argument object to create complete header modification actions.
The pattern indicates which headers are modified, while the arguments
provide the specific values.
Therefore a single pattern can be used with different arguments in different
flows, enabling offloading of large number of modify_header flows.

- Patch 1, 2: Add ICM pool for modify-header-pattern objects and implement
   patterns cache, allowing patterns reuse for different flows
- Patch 3: Allow for chunk allocation separately for STEv0 and STEv1
- Patch 4: Read related device capabilities
- Patch 5: Add create/destroy functions for the new general object type
- Patch 6: Add support for writing modify header argument to ICM
- Patch 7, 8: Some required fixes to support pattern/arg - separate read
   buffer from the write buffer and fix QP continuous allocation
- Patch 9: Add pool for modify header arg objects
- Patch 10, 11, 12: Implement MODIFY_HEADER and TNL_L3_TO_L2 actions with
   the new patterns/args design
- Patch 13: Optimization - set modify header action of size 1 directly on
   the STE instead of separate pattern/args combination
- Patch 14: Adjust debug dump for patterns/args
- Patch 15: Enable patterns and arguments for supporting devices

=======================

Merge branch 'ovs-selftests'

Aaron Conole says:

====================
selftests: openvswitch: add support for testing upcall interface

The existing selftest suite for openvswitch will work for regression
testing the datapath feature bits, but won't test things like adding
interfaces, or the upcall interface.  Here, we add some additional
test facilities.

First, extend the ovs-dpctl.py python module to support the OVS_FLOW
and OVS_PACKET netlink families, with some associated messages.  These
can be extended over time, but the initial support is for more well
known cases (output, userspace, and CT).

Next, extend the test suite to test upcalls by adding a datapath,
monitoring the upcall socket associated with the datapath, and then
dumping any upcalls that are received.  Compare with expected ARP
upcall via arping.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: openvswitch: add support for upcall testing

The upcall socket interface can be exercised now to make sure that
future feature adjustments to the field can maintain backwards
compatibility.

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: openvswitch: add flow dump support

Add a basic set of fields to print in a 'dpflow' format. This will be
used by future commits to check for flow fields after parsing, as
well as verifying the flow fields pushed into the kernel from
userspace.

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: openvswitch: add interface support

Includes an associated test to generate netns and connect
interfaces, with the option to include packet tracing.

This will be used in the future when flow support is added
for additional test cases.

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: micrel: Fix PTP_PF_PEROUT for lan8841

If the 1PPS output was enabled and then lan8841 was configured to be a
follower, then target clock which is used to generate the 1PPS was not
configure correctly. The problem was that for each adjustments of the
time, also the nanosecond part of the target clock was changed.
Therefore the initial nanosecond part of the target clock was changed.
The issue can be observed if both the leader and the follower are
generating 1PPS and see that their PPS are not aligned even if the time
is allined.
The fix consists of not modifying the nanosecond part of the target
clock when adjusting the time. In this way the 1PPS get also aligned.

Fixes: e4ed8ba08e3f ("net: phy: micrel: Add support for PTP_PF_PEROUT for lan8841")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'page_pool-allow-caching-from-safely-localized-napi'

Jakub Kicinski says:

====================
page_pool: allow caching from safely localized NAPI

I went back to the explicit "are we in NAPI method", mostly
because I don't like having both around :( (even tho I maintain
that in_softirq() && !in_hardirq() is as safe, as softirqs do
not nest).

Still returning the skbs to a CPU, tho, not to the NAPI instance.
I reckon we could create a small refcounted struct per NAPI instance
which would allow sockets and other users so hold a persisent
and safe reference. But that's a bigger change, and I get 90+%
recycling thru the cache with just these patches (for RR and
streaming tests with 100% CPU use it's almost 100%).

Some numbers for streaming test with 100% CPU use (from previous version,
but really they perform the same):

HW-GRO page=page
before after before after
recycle:
cached: 0 138669686 0 150197505
cache_full: 0    223391 0     74582
ring: 138551933         9997191 149299454 0
ring_full: 0             488      3154    127590
released_refcnt: 0 0 0 0

alloc:
fast: 136491361 148615710 146969587 150322859
slow:      1772      1799       144       105
slow_high_order: 0 0 0 0
empty:      1772      1799       144       105
refill:   2165245    156302   2332880      2128
waive: 0 0 0 0

v1: https://lore.kernel.org/all/20230411201800.596103-1-kuba@kernel.org/
rfcv2: https://lore.kernel.org/all/20230405232100.103392-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20230413042605.895677-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt: hook NAPIs to page pools

bnxt has 1:1 mapping of page pools and NAPIs, so it's safe
to hoook them up together.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

page_pool: allow caching from safely localized NAPI

Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

page_pool_refill_alloc_cache 1.17% -> 0%
_raw_spin_lock 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: skb: plumb napi state thru skb freeing paths

We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
Going forward we will also try to cache head and data pages.
Plumb the "are we in a normal NAPI context" information thru
deeper into the freeing path, up to skb_release_data() and
skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
comes from netpoll which passes budget of 0 to try to reap
the Tx completions but not perform any Rx.

Use "bool napi_safe" rather than bare "int budget",
the further we get from NAPI the more confusing the budget
argument may seem (particularly whether 0 or MAX is the
correct value to pass in when not in NAPI).

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: DR, Enable patterns and arguments for supporting devices

Check if patterns and arguments for modify header action
are supported and enable them accordingly.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Add support for the pattern/arg parameters in debug dump

Support the pattern/args-based MODIFY_HDR and TNL_L3_TO_L2 actions in dbg dump

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Modify header action of size 1 optimization

Set modify header action of size 1 directly on the STE for supporting
devices, thus reducing number of hops and cache misses.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Support decap L3 action using pattern / arg mechanism

Use the new accelerated action for decap L3 on RX side:
use the mechanism of pattern and argument same as in
modify-header action.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Apply new accelerated modify action and decapl3

If there is support for pattern/args, use the new accelerated modify
header action for modify header and decap L3 actions.
Otherwise fall back to the old modify-header implementation.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Add modify header argument pointer to actions attributes

While building the actions, add the pointer of the arguments for
accelerated modify list action into the action's attributes.
This will be used later on while building the specific STE
for this action.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Add modify header arg pool mechanism

Added new mechanism for handling arguments for modify-header action.
The new action "accelerated modify-header" asks for the arguments from
separated area from the pattern, this area accessed via general objects.
Handling of these object is done via the pool-manager struct.

When the new header patterns are supported, while loading the domain,
a few pools for argument creations will be created. The requests for
allocating/deallocating arg objects are done via the pool manager API.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Fix QP continuous allocation

When allocating a QP we allocate an RQ and an SQ, the RQ is stored first
in memory and followed by the SQ.
This allocation is not physically continiuos - it may span across different
physical pages. SW Steering code always writes in pairs: 1BB write + 1BB read,
or 2 continuous BBs of GTA WQE.

This lead to an issue where RQ allocation was 4x16 which is equal to 1 WQE BB,
causing 1 BB offset in the page and splitting the GTA WQE between different
physical pages.

The solution was to create the RQ with a even number of BBs and to have the
RQ aligned to a page.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Read ICM memory into dedicated buffer

Instead of using the write buffer for reading we will use a dedicated
buffer only for reading ICM memory.
Due to the new support for args, we can have a case with pending_wc
being odd number, and with reading into the same write buffer, it is
possible to overwrite next write on the same slot.
For example:
pending_wc is 17 so the buffer for write is:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
and we have requests as follows:
r wr wr wr wr wr wr wr wr
Now, the first read will be written into the last write because we use
the same buffer for read and write, before it was written to the HW and
we will have a wrong data in the ICM area.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Add support for writing modify header argument

The accelerated modify header arguments are written in the HW area
with special WQE and specific data format.
New function was added to support writing of new argument type.
Note that GTA WQE is larger than READ and WRITE, so the queue
management logic was updated to support this.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Add create/destroy for modify-header-argument general object

Add functions for creation/destruction of the new type of general object.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Check for modify_header_argument device capabilities

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>