review.tizen.org Git - platform/kernel/linux-starfive.git/log

net/mlx5e: xsk: Use KSM for unaligned XSK

UMR MTTs used in striding RQ have certain alignment requirements. While
it's guaranteed to work when UMR pages are aligned to the UMR page size,
in practice it works then UMR pages are aligned to 8 bytes. However,
it's still not enough flexibility for the unaligned mode of XSK. This
patch leverages KSM to map UMR pages without alignment requirements,
when unaligned XSK is active. The downside is that KSM entries are twice
as big as MTTs, which limits the maximum WQE size, so regular RQs and
aligned XSK continue using MTTs.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add MLX5_FLEXIBLE_INLEN to safely calculate cmd inlen

Some commands use a flexible array after a common header. Add a macro to
safely calculate the total input length of the command, detecting
overflows and printing errors with specific values when such overflows
happen.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Keep a separate MKey for striding RQ

Currently, rq->mkey_be keeps a big-endian value of either the PA MKey
(for legacy RQ, no address translation) or MTT MKey (for striding RQ,
direct address translation). Striding RQ stores the same value in
rq->umr_mkey in the native endianness.

The next commit will make striding RQ use KSM MKey (indirect address
translation) for the unaligned mode of XSK, which will require storing
both KSM MKey and PA MKey in the RQ struct. This commit optimizes fields
of mlx5e_rq: umr_mkey is removed (it's redundant), mkey_be always points
to the PA MKey, and mpwqe.umr_mkey_be points to the MTT MKey (or to the
KSM MKey, starting from the next commit).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: xsk: Use XSK frame size as striding RQ page size

XSK RQs support striding RQ linear mode, but the stride size is always
set to PAGE_SIZE. It may be larger than the XSK frame size,
unnecessarily reducing the useful space in a WQE, but more importantly
causing UMEM data corruption in certain cases.

Normally, stride size bigger than XSK frame size is not a problem if the
hardware enforces the MTU. However, traffic between vports skips the
hardware MTU check, and oversized packets may be received.

If an oversized packet is bigger than the XSK frame but not bigger than
the stride, it will cause overwriting of the adjacent UMEM region. If
the packet takes more than one stride, they can be recycled for reuse
so it's not a problem when the XSK frame size matches the stride size.

To reduce the impact of the above issue, attempt to use the MTT page
size for striding RQ that matches the XSK frame size, allowing to safely
use 2048-byte frames on an up-to-date firmware.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use runtime page_shift for striding RQ

This commit allows striding RQ to determine MTT page size at runtime,
instead of sticking to the compile-time PAGE_SIZE. This functionality
will be used by a following commit that adjusts the MTT page size to the
XSK frame size.

Stick with PAGE_SIZE for XSK on legacy RQ, as frag_stride is not used in
data path, it only helps calculate how pages are partitioned into
fragments, and PAGE_SIZE will ensure each fragment starts at the
beginning of a new allocation unit (XSK frame).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xsk: Expose min chunk size to drivers

Drivers should be aware of the range of valid UMEM chunk sizes to be
able to allocate their internal structures of an appropriate size. It
will be used by mlx5e in the following patches.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
CC: "Björn Töpel" <bjorn@kernel.org>
CC: Magnus Karlsson <magnus.karlsson@intel.com>
CC: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip6_vti:Remove the space before the comma

There should be no space before the comma

Signed-off-by: Hongbin Wang <wh_bin@126.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Mediatek-mt8188'

Jianguo Zhang says:

====================
Mediatek ethernet patches for mt8188

Changes in v7:

v7:
1) Add 'Reviewed-by: AngeloGioacchino Del Regno
<angelogioacchino.delregno@collabora.com>' info in commit message of
patch 'dt-bindings: net: snps,dwmac: add new property snps,clk-csr',
'arm64: dts: mediatek: mt2712e: Update the name of property 'clk_csr''
and 'net: stmmac: add a parse for new property 'snps,clk-csr''.

v6:
1) Update commit message of patch 'dt-bindings: net: snps,dwmac: add new property snps,clk-csr'
2) Add a parse for new property 'snps,clk-csr' in patch
'net: stmmac: add a parse for new property 'snps,clk-csr''

v5:
1) Rename the property 'clk_csr' as 'snps,clk-csr' in binding
file as Krzysztof Kozlowski'comment.
2) Add DTS patch 'arm64: dts: mediatek: mt2712e: Update the name of property 'clk_csr''
as Krzysztof Kozlowski'comment.
3) Add driver patch 'net: stmmac: Update the name of property 'clk_csr''
as Krzysztof Kozlowski'comment.

v4:
1) Update the commit message of patch 'dt-bindings: net: snps,dwmac: add clk_csr property'
as Krzysztof Kozlowski'comment.

v3:
1) List the names of SoCs mt8188 and mt8195 in correct order as
AngeloGioacchino Del Regno's comment.
2) Add patch version info as Krzysztof Kozlowski'comment.

v2:
1) Delete patch 'stmmac: dwmac-mediatek: add support for mt8188' as
Krzysztof Kozlowski's comment.
2) Update patch 'dt-bindings: net: mediatek-dwmac: add support for
mt8188' as Krzysztof Kozlowski's comment.
3) Add clk_csr property to fix warning ('clk_csr' was unexpected) when
runnig 'make dtbs_check'.

v1:
1) Add ethernet driver entry for mt8188.
2) Add binding document for ethernet on mt8188.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: add a parse for new property 'snps,clk-csr'

Parse new property 'snps,clk-csr' firstly because the new property
is documented in binding file, if failed, fall back to old property
'clk_csr' for legacy case

Signed-off-by: Jianguo Zhang <jianguo.zhang@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

arm64: dts: mediatek: mt2712e: Update the name of property 'clk_csr'

Update the name of property 'clk_csr' as 'snps,clk-csr' to align with
the property name in the binding file.

Signed-off-by: Jianguo Zhang <jianguo.zhang@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: snps,dwmac: add new property snps,clk-csr

Add description for new property snps,clk-csr in binding file

Signed-off-by: Jianguo Zhang <jianguo.zhang@mediatek.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: mediatek-dwmac: add support for mt8188

Add binding document for the ethernet on mt8188

Signed-off-by: Jianguo Zhang <jianguo.zhang@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Fix spelling mistake "syndrom" -> "syndrome"

There is a spelling mistake in a devlink_health_report message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bna: Fix spelling mistake "muliple" -> "multiple"

There is a spelling mistake in a literal string in the array
bnad_net_stats_strings. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmveth: Ethtool set queue support

Implement channel management functions to allow dynamic addition and
removal of transmit queues. The `ethtool --show-channels` and
`ethtool --set-channels` commands can be used to get and set the
number of queues, respectively. Allow the ability to add as many
transmit queues as available processors but never allow more than the
hard maximum of 16. The number of receive queues is one and cannot be
modified.

Depending on whether the requested number of queues is larger or
smaller than the current value, either allocate or free long term
buffers. Since long term buffer construction and destruction can
occur in two different areas, from either channel set requests or
device open/close, define functions for performing this work. If
allocation of a new buffer fails, then attempt to revert back to the
previous number of queues.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmveth: Implement multi queue on xmit

The `ndo_start_xmit` function is protected by a spinlock on the tx queue
being used to transmit the skb. Allow concurrent calls to
`ndo_start_xmit` by using more than one tx queue. This allows for
greater throughput when several jobs are trying to transmit data.

Introduce 16 tx queues (leave single rx queue as is) which each
correspond to one DMA mapped long term buffer.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmveth: Copy tx skbs into a premapped buffer

Rather than DMA mapping and unmapping every outgoing skb, copy the skb
into a buffer that was mapped during the drivers open function. Copying
the skb and its frags have proven to be more time efficient than
mapping and unmapping. As an effect, performance increases by 3-5
Gbits/s.

Allocate and DMA map one continuous 64KB buffer at `ndo_open`. This
buffer is maintained until `ibmveth_close` is called. This buffer is
large enough to hold the largest possible linnear skb. During
`ndo_start_xmit`, copy the skb and all of it's frags into the continuous
buffer. By manually linnearizing all the socket buffers, time is saved
during memcpy as well as more efficient handling in FW.
As a result, we no longer need to worry about the firmware limitation
of handling a max of 6 frags. So, we only need to maintain 1 descriptor
instead of 6 and can hardcode 0 for the other 5 descriptors during
h_send_logical_lan.

Since, DMA allocation/mapping issues can no longer arise in xmit
functions, we can further reduce code size by removing the need for a
bounce buffer on DMA errors.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2: Fix spelling mistake "bufferred" -> "buffered"

There are spelling mistakes in two literal strings. Fix these.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: assign path_cost for 2.5G and 5G link speed

As 2.5G, 5G ethernet ports are more common and affordable,
these ports are being used in LAN bridge devices.
STP port_cost() is missing path_cost assignment for these link speeds,
causes highest cost 100 being used.
This result in lower speed port being picked
when there is loop between 5G and 1G ports.

Original path_cost: 10G=2, 1G=4, 100m=19, 10m=100
Adjusted path_cost: 10G=2, 5G=3, 2.5G=4, 1G=5, 100m=19, 10m=100
speed greater than 10G = 1

Signed-off-by: Steven Hsieh <steven.hsieh@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Fix spelling mistake "tarffic" -> "traffic"

There is a spelling mistake in a netdev_err message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-next: skbuff: refactor pskb_pull

pskb_may_pull already contains all of the checks performed by
pskb_pull.
Use pskb_may_pull for validation in pskb_pull, eliminating the
duplication and making __pskb_pull obsolete.
Replace __pskb_pull with pskb_pull where applicable.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bonding: Convert to use sysfs_emit()/sysfs_emit_at() APIs

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the value
to be returned to user space.

Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-sysfs: Convert to use sysfs_emit() APIs

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the value
to be returned to user space.

Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tun: Convert to use sysfs_emit() APIs

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the value
to be returned to user space.

Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-tsnep-multiqueue'

Gerhard Engleder says:

====================
tsnep: multi queue support and some other improvements

Add support for additional TX/RX queues along with RX flow classification
support.

Binding is extended to allow additional interrupts for additional TX/RX
queues. Also dma-coherent is allowed as minor improvement.

RX path optimisation is done by using page pool as preparations for future
XDP support.

v4:
- rework dma-coherent commit message (Krzysztof Kozlowski)
- fixed order of interrupt-names in binding (Krzysztof Kozlowski)
- add line break between examples in binding (Krzysztof Kozlowski)
- add RX_CLS_LOC_ANY support to RX flow classification

v3:
- now with changes in cover letter

v2:
- use netdev_name() (Jakub Kicinski)
- use ENOENT if RX flow rule is not found (Jakub Kicinski)
- eliminate return code of tsnep_add_rule() (Jakub Kicinski)
- remove commit with lazy refill due to depletion problem (Jakub Kicinski)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Use page pool for RX

Use page pool for RX buffer handling. Makes RX path more efficient and
is required prework for future XDP support.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Add EtherType RX flow classification support

Received Ethernet frames are assigned to first RX queue per default.
Based on EtherType Ethernet frames can be assigned to other RX queues.
This enables processing of real-time Ethernet protocols on dedicated
RX queues.

Add RX flow classification interface for EtherType based RX queue
assignment.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Support multiple TX/RX queue pairs

Support additional TX/RX queue pairs if dedicated interrupt is
available. Interrupts are detected by name in device tree.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tsnep: Move interrupt from device to queue

For multiple queues multiple interrupts shall be used. Therefore, rework
global interrupt to per queue interrupt.

Every interrupt name shall contain interface name and queue information.
To get a valid interface name, the interrupt request needs to by done
during open like in other drivers. Additionally, this allows the removal
of some initialisation checks in the interrupt handler.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: tsnep: Allow additional interrupts

Additional TX/RX queue pairs require dedicated interrupts. Extend
binding with additional interrupts.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: tsnep: Allow dma-coherent

Within SoCs like ZynqMP, FPGA logic can be connected to different kinds
of AXI master ports. Also cache coherent AXI master ports are available.
The property "dma-coherent" is used to signal that DMA is cache
coherent.

Add "dma-coherent" property to allow the configuration of cache coherent
DMA.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2022-09-28 (ice)

Arkadiusz implements a single pin initialization function, checking feature
bits, instead of having separate device functions and updates sub-device
IDs for recognizing E810T devices.

Martyna adds support for switchdev filters on VLAN priority field.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Add support for VLAN priority filters in switchdev
  ice: support features on new E810T variants
  ice: Merge pin initialization of E810 and E810T adapters
====================

Link: https://lore.kernel.org/r/20220928203217.411078-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Convert to use sysfs_emit() APIs

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the value
to be returned to user space.

Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/1664364860-29153-1-git-send-email-wangyufen@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-tc-taprio-support-for-queuemaxsdu'

Vladimir Oltean says:

====================
Add tc-taprio support for queueMaxSDU

The tc-taprio offload mode supported by the Felix DSA driver has
limitations surrounding its guard bands.

The initial discussion was at:
https://lore.kernel.org/netdev/c7618025da6723418c56a54fe4683bd7@walle.cc/

with the latest status being that we now have a vsc9959_tas_guard_bands_update()
method which makes a best-guess attempt at how much useful space to
reserve for packet scheduling in a taprio interval, and how much to
reserve for guard bands.

IEEE 802.1Q actually does offer a tunable variable (queueMaxSDU) which
can determine the max MTU supported per traffic class. In turn we can
determine the size we need for the guard bands, depending on the
queueMaxSDU. This way we can make the guard band of small taprio
intervals smaller than one full MTU worth of transmission time, if we
know that said traffic class will transport only smaller packets.

As discussed with Gerhard Engleder, the queueMaxSDU may also be useful
in limiting the latency on an endpoint, if some of the TX queues are
outside of the control of the Linux driver.
https://patchwork.kernel.org/project/netdevbpf/patch/20220914153303.1792444-11-vladimir.oltean@nxp.com/

Allow input of queueMaxSDU through netlink into tc-taprio, offload it to
the hardware I have access to (LS1028A), and (implicitly) deny
non-default values to everyone else. Kurt Kanzenbach has also kindly
tested and shared a patch to offload this to hellcreek.

v3 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20220927234746.1823648-1-vladimir.oltean@nxp.com/
v2 at:
https://patchwork.kernel.org/project/netdevbpf/list/?series=679954&state=*
v1 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20220914153303.1792444-1-vladimir.oltean@nxp.com/
====================

Link: https://lore.kernel.org/r/20220928095204.2093716-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: offload per-tc max SDU from tc-taprio

The driver currently sets the PTCMSDUR register statically to the max
MTU supported by the interface. Keep this logic if tc-taprio is absent
or if the max_sdu for a traffic class is 0, and follow the requested max
SDU size otherwise.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: use common naming scheme for PTGCR and PTGCAPR registers

The Port Time Gating Control Register (PTGCR) and Port Time Gating
Capability Register (PTGCAPR) have definitions in the driver which
aren't in line with the other registers. Rename these.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: cache accesses to &priv->si->hw

The &priv->si->hw construct dereferences 2 pointers and makes lines
longer than they need to be, in turn making the code harder to read.

Replace &priv->si->hw accesses with a "hw" variable when there are 2 or
more accesses within a function that dereference this. This includes
loops, since &priv->si->hw is a loop invariant.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: Offload per-tc max SDU from tc-taprio

Add support for configuring the max SDU per priority and per port. If not
specified, keep the default.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: refactor hellcreek_port_setup_tc() to use switch/case

The following patch will need to make this function also respond to
TC_QUERY_BASE, so make the processing more structured around the
tc_setup_type.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: offload per-tc max SDU from tc-taprio

Our current vsc9959_tas_guard_bands_update() algorithm has a limitation
imposed by the hardware design. To avoid packet overruns between one
gate interval and the next (which would add jitter for scheduled traffic
in the next gate), we configure the switch to use guard bands. These are
as large as the largest packet which is possible to be transmitted.

The problem is that at tc-taprio intervals of sizes comparable to a
guard band, there isn't an obvious place in which to split the interval
between the useful portion (for scheduling) and the guard band portion
(where scheduling is blocked).

For example, a 10 us interval at 1Gbps allows 1225 octets to be
transmitted. We currently split the interval between the bare minimum of
33 ns useful time (required to schedule a single packet) and the rest as
guard band.

But 33 ns of useful scheduling time will only allow a single packet to
be sent, be that packet 1200 octets in size, or 60 octets in size. It is
impossible to send 2 60 octets frames in the 10 us window. Except that
if we reduced the guard band (and therefore the maximum allowable SDU
size) to 5 us, the useful time for scheduling is now also 5 us, so more
packets could be scheduled.

The hardware inflexibility of not scheduling according to individual
packet lengths must unfortunately propagate to the user, who needs to
tune the queueMaxSDU values if he wants to fit more small packets into a
10 us interval, rather than one large packet.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: taprio: allow user input of per-tc max SDU

IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data
types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the
existence of a per traffic class limitation of maximum frame sizes, with
a fallback on the port-based MTU.

As far as I am able to understand, the 802.1Q Service Data Unit (SDU)
represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding
any number of prepended VLAN headers which may be otherwise present in
the MSDU. Therefore, the queueMaxSDU is directly comparable to the
device MTU (1500 means L2 payload sizes are accepted, or frame sizes of
1518 octets, or 1522 plus one VLAN header). Drivers which offload this
are directly responsible of translating into other units of measurement.

To keep the fast path checks optimized, we keep 2 arrays in the qdisc,
one for max_sdu translated into frame length (so that it's comparable to
skb->len), and another for offloading and for dumping back to the user.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: query offload capabilities through ndo_setup_tc()

When adding optional new features to Qdisc offloads, existing drivers
must reject the new configuration until they are coded up to act on it.

Since modifying all drivers in lockstep with the changes in the Qdisc
can create problems of its own, it would be nice if there existed an
automatic opt-in mechanism for offloading optional features.

Jakub proposes that we multiplex one more kind of call through
ndo_setup_tc(): one where the driver populates a Qdisc-specific
capability structure.

First user will be taprio in further changes. Here we are introducing
the definitions for the base functionality.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/tipc: Remove unused struct distr_queue_item

After commit 09b5678c778f("tipc: remove dead code in tipc_net and relatives"),
struct distr_queue_item is not used any more and can be removed as well.

Signed-off-by: Yuan Can <yuancan@huawei.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Link: https://lore.kernel.org/r/20220928085636.71749-1-yuancan@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: skb: introduce and use a single page frag cache

After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
for tiny skbs") we are observing 10-20% regressions in performance
tests with small packets. The perf trace points to high pressure on
the slab allocator.

This change tries to improve the allocation schema for small packets
using an idea originally suggested by Eric: a new per CPU page frag is
introduced and used in __napi_alloc_skb to cope with small allocation
requests.

To ensure that the above does not lead to excessive truesize
underestimation, the frag size for small allocation is inflated to 1K
and all the above is restricted to build with 4K page size.

Note that we need to update accordingly the run-time check introduced
with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").

Alex suggested a smart page refcount schema to reduce the number
of atomic operations and deal properly with pfmemalloc pages.

Under small packet UDP flood, I measure a 15% peak tput increases.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: cls_u32: Avoid memcpy() false-positive warning

To work around a misbehavior of the compiler's ability to see into
composite flexible array structs (as detailed in the coming memcpy()
hardening series[1]), use unsafe_memcpy(), as the sizing,
bounds-checking, and allocation are all very tightly coupled here.
This silences the false-positive reported by syzbot:

memcpy: detected field-spanning write (size 80) of single field "&n->sel" at net/sched/cls_u32.c:1043 (size 16)

[1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Reported-by: syzbot+a2c4601efc75848ba321@syzkaller.appspotmail.com
Link: https://lore.kernel.org/lkml/000000000000a96c0b05e97f0444@google.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220927153700.3071688-1-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: snps,dwmac: Document stmmac-axi-config subnode

The stmmac-axi-config subnode is present in multiple dwmac instance DTs,
document its content per snps,axi-config property description which is
a phandle to this subnode.

Signed-off-by: Marek Vasut <marex@denx.de>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220927012449.698915-1-marex@denx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: netlink: clarify the historical baggage of Netlink flags

nlmsg_flags are full of historical baggage, inconsistencies and
strangeness. Try to document it more thoroughly. Explain the meaning
of the ECHO flag (and while at it clarify the comment in the uAPI).
Handwave a little about the NEW request flags and how they make
sense on the surface but cater to really old paradigm before commands
were a thing.

I will add more notes on how to make use of ECHO and discouragement
for reuse of flags to the kernel-side documentation.

Link: https://lore.kernel.org/r/20220927212306.823862-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.0-rc8' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from wifi and can.

  Current release - regressions:

   - phy: don't WARN for PHY_UP state in mdio_bus_phy_resume()

   - wifi: fix locking in mac80211 mlme

   - eth:
      - revert "net: mvpp2: debugfs: fix memory leak when using debugfs_lookup()"
      - mlxbf_gige: fix an IS_ERR() vs NULL bug in mlxbf_gige_mdio_probe

  Previous releases - regressions:

   - wifi: fix regression with non-QoS drivers

  Previous releases - always broken:

   - mptcp: fix unreleased socket in accept queue

   - wifi:
      - don't start TX with fq->lock to fix deadlock
      - fix memory corruption in minstrel_ht_update_rates()

   - eth:
      - macb: fix ZynqMP SGMII non-wakeup source resume failure
      - mt7531: only do PLL once after the reset
      - usbnet: fix memory leak in usbnet_disconnect()

  Misc:

   - usb: qmi_wwan: add new usb-id for Dell branded EM7455"

* tag 'net-6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (30 commits)
  mptcp: fix unreleased socket in accept queue
  mptcp: factor out __mptcp_close() without socket lock
  net: ethernet: mtk_eth_soc: fix mask of RX_DMA_GET_SPORT{,_V2}
  net: mscc: ocelot: fix tagged VLAN refusal while under a VLAN-unaware bridge
  can: c_can: don't cache TX messages for C_CAN cores
  ice: xsk: drop power of 2 ring size restriction for AF_XDP
  ice: xsk: change batched Tx descriptor cleaning
  net: usb: qmi_wwan: Add new usb-id for Dell branded EM7455
  selftests: Fix the if conditions of in test_extra_filter()
  net: phy: Don't WARN for PHY_UP state in mdio_bus_phy_resume()
  net: stmmac: power up/down serdes in stmmac_open/release
  wifi: mac80211: mlme: Fix double unlock on assoc success handling
  wifi: mac80211: mlme: Fix missing unlock on beacon RX
  wifi: mac80211: fix memory corruption in minstrel_ht_update_rates()
  wifi: mac80211: fix regression with non-QoS drivers
  wifi: mac80211: ensure vif queues are operational after start
  wifi: mac80211: don't start TX with fq->lock to fix deadlock
  wifi: cfg80211: fix MCS divisor value
  net: hippi: Add missing pci_disable_device() in rr_init_one()
  net/mlxbf_gige: Fix an IS_ERR() vs NULL bug in mlxbf_gige_mdio_probe
  ...

Merge tag 'input-for-v6.0-rc7' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

- small fixes for iqs62x-keys and melfas_mip4 drivers

- corrected register address in snvs_pwrkey driver

- Synaptic driver will stop trying to use intertouch (native) mode on
   some Lenovo AMD devices

* tag 'input-for-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: snvs_pwrkey - fix SNVS_HPVIDR1 register address
  Input: synaptics - disable Intertouch for Lenovo T14 and P14s AMD G1
  Input: iqs62x-keys - drop unused device node references
  Input: melfas_mip4 - fix return value check in mip4_probe()

Merge tag 'ata-6.0-rc7' of git://git./linux/kernel/git/dlemoal/libata

Pull ATA fixes from Damien Le Moal:
"Three late patches to fix problems discovered recently:

   - Add a horkage to disable link power management by default for the
     Pioneer BDR-207M and BDR-205 DVD drives (from Niklas)

   - Two patches to fix setting the maximum queue depth of libsas owned
     ATA devices (from me)"

* tag 'ata-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: libata-sata: Fix device queue depth control
  ata: libata-scsi: Fix initialization of device queue depth
  libata: add ATA_HORKAGE_NOLPM for Pioneer BDR-207M and BDR-205

Merge tag 'loongarch-fixes-6.0-3' of git://git./linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
"Some trivial fixes and cleanup"

* tag 'loongarch-fixes-6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: Clean up loongson3_smp_ops declaration
  LoongArch: Fix and cleanup csr_era handling in do_ri()
  LoongArch: Align the address of kernel_entry to 4KB

net: cpmac: Add __init/__exit annotations to module init/exit funcs

Add __init/__exit annotations to module init/exit funcs

Signed-off-by: ruanjinjie <ruanjinjie@huawei.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220928031708.89120-1-ruanjinjie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ethernet: 8390: remove unnecessary check of mem

The 'mem' returned by platform_get_resource() has been checked in probe
function, so it is no need do this check in remove function.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20220927151406.797800-1-yangyingliang@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

nfp: Use skb_put_data() instead of skb_put/memcpy pair

Use skb_put_data() instead of skb_put() and memcpy(), which is clear.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Link: https://lore.kernel.org/r/20220927141835.19221-1-shangxiaojing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: liquidio: Remove unused struct lio_trusted_vf_ctx

After commit 6870957ed5bc("liquidio: make soft command calls synchronous"), no
one use struct lio_trusted_vf_ctx, so remove it.

Signed-off-by: Yuan Can <yuancan@huawei.com>
Link: https://lore.kernel.org/r/20220927133940.104181-1-yuancan@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: mtk_eth_soc: use DEFINE_SHOW_ATTRIBUTE to simplify code

Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code.
No functional change.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Link: https://lore.kernel.org/r/20220927111925.2424100-1-liushixin2@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

wwan_hwsim: Use skb_put_data() instead of skb_put/memcpy pair

Use skb_put_data() instead of skb_put() and memcpy(), which is clear.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Link: https://lore.kernel.org/r/20220927024511.14665-1-shangxiaojing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ax88796c: Use skb_put_data() instead of skb_put/memcpy pair

Use skb_put_data() instead of skb_put() and memcpy(), which is clear.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Link: https://lore.kernel.org/r/20220927023043.17769-1-shangxiaojing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ethernet: s2io: Use skb_put_data() instead of skb_put/memcpy pair

Use skb_put_data() instead of skb_put() and memcpy(), which is shorter
and clear. Drop the tmp variable that is not needed any more.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Link: https://lore.kernel.org/r/20220927022802.16050-1-shangxiaojing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'mlx5-updates-2022-09-27' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2022-09-27

This is Part #1 of 4 parts series to align mlx5's implementation of
XSK (AF_XDP) RX-Qs indexing and management with other vendors:

Maxim Says:
===========

xsk: Bug fixes for frame mapping on striding RQ

Striding RQ relies on the driver mapping RX buffers into the NIC's
virtual memory space. Currently, regadless of the XSK frame size, mlx5e
maps them using MTT, and each mapping's length is PAGE_SIZE. As the
result, the stride size used by striding RQ is also equal to PAGE_SIZE.

This decision has the following issues:

1. In the XSK aligned mode with frame size smaller than PAGE_SIZE, it's
suboptimal. Using 2K strides and 2K pages allows to post twice as fewer
WQEs.

2. MTT is not suitable for unaligned frames, as it requires natural
alignment theoretically, in practice at least 8-byte alignment.

3. Using mapping and stride bigger than the frame has risk of writing
over the bounds of the XSK frame upon receiving packets bigger than MTU,
which is possible in some specific configurations.

This series addresses issues 1 and 2 and alleviates issue 3. Where
possible, page and stride size will match the XSK frame size (firmware
upgrade may be needed to have effect for 2K frames). Unaligned mode will
use KSM instead of MTT, which allows to drop the partial workaround [1].

[1]: https://lore.kernel.org/netdev/YufYFQ6JN91lQbso@boxer/T/
====================

Link: https://lore.kernel.org/r/20220927203611.244301-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use runtime values of striding RQ parameters in datapath

Some of the parameters of striding RQ are compile-time constants, but
they are going to become dynamically calculated at runtime in a
following commit. This commit prepares the datapath to take cached
runtime parameters, prefilled at queue creation.

New fields added to struct mlx5e_rq fit into an existing 7-byte hole.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Make dma_info array dynamic in struct mlx5e_mpw_info

This commit moves the dma_info array to the end of struct mlx5e_mpw_info
to make it a flexible array. It also removes the intermediate struct
mlx5e_umr_dma_info, which used to contain only this array. The
flexibility of dma_info will allow to choose its size dynamically in a
following commit.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Improve the MTU change shortcut

Normally, the MTU change requires reopening the channels, but it can be
skipped if the new MTU doesn't change any of the queue parameters and if
MTU is not used in the data path.

The shortcut is applicable to the non-linear mode of striding RQ,
because the only thing affected by MTU is the queue length. As ethtool
sets the queue length in packets, but striding RQ length is defined in
strides or bytes, we estimate the RQ length to be at least as big as the
requested number of MTU-sized packets, that's why it depends on MTU.

Improve the shortcut by actually checking whether the RQ length stayed
the same, instead of an intermediate step in the calculation.

As MTU also affects the SHAMPO parameters, skip the shortcut if SHAMPO
is in use.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: xsk: Fix SKB headroom calculation in validation

In a typical scenario, if an XSK socket is opened first, then an XDP
program is attached, mlx5e_validate_xsk_param will be called twice:
first on XSK bind, second on channel restart caused by enabling XDP. The
validation includes a call to mlx5e_rx_is_linear_skb, which checks the
presence of the XDP program.

The above means that mlx5e_rx_is_linear_skb might return true the first
time, but false the second time, as mlx5e_rx_get_linear_sz_skb's return
value will increase, because of a different headroom used with XDP.

As XSK RQs never exist without XDP, it would make sense to trick
mlx5e_rx_get_linear_sz_skb into thinking XDP is enabled at the first
check as well. This way, if MTU is too big, it would be detected on XSK
bind, without giving false hope to the userspace application.

However, it turns out that this check is too restrictive in the first
place. SKBs created on XDP_PASS on XSK RQs don't have any headroom. That
means that big MTUs filtered out on the first and the second checks
might actually work.

So, address this issue in the proper way, but taking into account the
absence of the SKB headroom on XSK RQs, when calculating the buffer
size.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: xsk: Remove dead code in validation

One of the checks in mlx5e_rx_is_linear_skb verifies that the RX buffer
fits into the XSK frame size. Remove the duplicating check from
mlx5e_validate_xsk_param. It allows to make mlx5e_rx_get_min_frag_sz
static.

Remove mlx5e_rx_is_xdp altogether, as its only usage is located in a
branch where xsk == NULL.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Simplify stride size calculation for linear RQ

Linear RX buffers must be big enough to fit the MTU-sized packet along
with the headroom. On the other hand, they must be small enough to fit
into a page (or into an XSK frame). A straightforward way to check
whether the linear mode is possible would be comparing the required
buffer size to PAGE_SIZE or XSK frame size.

Stride size in the linear mode is defined by the following constraints:

1. A stride is at least as big as the buffer size, and it's a power of
two.

2. If non-XSK XDP is enabled, the stride size is PAGE_SIZE, because
mlx5e requires each packet to be in its own page when XDP is in use. The
previous constraint is automatically fulfilled, because buffer size
can't be bigger than PAGE_SIZE.

3. XSK uses stride size equal to PAGE_SIZE, but the following commits
will allow it to use roundup_pow_of_two(XSK frame size), by allowing the
NIC's MMU to use page sizes not equal to the CPU page size.

This commit puts the above requirements and constraints straight to the
code in an attempt to simplify it and to prepare it for changes made in
the next patches.

For the reference, the old code uses an equivalent, but trickier
calculation (high-level simplified pseudocode):

    if XDP or XSK:
        mlx5e_rx_get_linear_frag_sz := max(buffer size, PAGE_SIZE)
    else:
        mlx5e_rx_get_linear_frag_sz := buffer size
    mlx5e_rx_is_linear_skb := mlx5e_rx_get_linear_frag_sz <= PAGE_SIZE
    stride size := roundup_pow_of_two(mlx5e_rx_get_linear_frag_sz)

The new code effectively removes mlx5e_rx_get_linear_frag_sz that used
to return either buffer size or stride size, depending on the situation,
making it hard to work with and to make changes:

    if XDP or XSK:
        mlx5e_rx_get_linear_stride_sz := PAGE_SIZE
    else
        mlx5e_rx_get_linear_stride_sz := roundup_pow_of_two(buffer size)
    mlx5e_rx_is_linear_skb := buffer size <= (PAGE_SIZE or XSK frame sz)
    stride size := mlx5e_rx_get_linear_stride_sz

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: kTLS, Check ICOSQ WQE size in advance

Instead of WARNing in runtime when TLS offload WQEs posted to ICOSQ are
over the hardware limit, check their size before enabling TLS RX
offload, and block the offload if the condition fails. It also allows to
drop a u16 field from struct mlx5e_icosq.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use the aligned max TX MPWQE size

TX MPWQE size is limited to the cacheline-aligned maximum. Use the same
value for the stop room and the capability check.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Fix a typo in mlx5e_xdp_mpwqe_is_full

Fix a typo in the function name: mpqwe -> mpwqe (stands for multi-packet
work queue element).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use mlx5e_stop_room_for_max_wqe where appropriate

mlx5e_alloc_xdpsq calculates sq->stop_room internally, but there is
already a function for that: mlx5e_stop_room_for_max_wqe. This commit
makes use of this function.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Let mlx5e_get_sw_max_sq_mpw_wqebbs accept mdev

To shorten and simplify code, let mlx5e_get_sw_max_sq_mpw_wqebbs accept
mdev and derive max SQ WQEBBs from it. Also rename the function to a
more generic name mlx5e_get_max_sq_aligned_wqebbs, because the following
patches will use it in non-MPWQE contexts.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Validate striding RQ before enabling XDP

Currently, the driver can silently fall back to legacy RQ after enabling
XDP, even if striding RQ was active before. It happens when PAGE_SIZE is
bigger than the maximum supported stride size. This commit changes this
behavior to more straightforward: if an operation (enabling XDP) doesn't
support the current parameters (striding RQ mode), it fails.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Make mlx5e_verify_rx_mpwqe_strides static

mlx5e_verify_rx_mpwqe_strides is only used in en/params.c, so it can be
made static and removed from en/params.h.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Remove unused fields from datapath structs

No need to keep max_sq_wqebbs in mlx5e_txqsq and mlx5e_xdpsq, as it's
only used when allocating the queues. Removing an extra field reduces
the struct size.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Convert mlx5e_get_max_sq_wqebbs to u8

The return value of mlx5e_get_max_sq_wqebbs is clamped down to
MLX5_SEND_WQE_MAX_WQEBBS = 16, which fits into u8. This commit changes
the return type of this function to u8 for stricter type safety.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add the log_min_mkey_entity_size capability

Add the capability that will allow the driver to determine the minimal
MTT page size to be able to map the smallest possible pages in XSK. The
older firmwares that don't have this capability default to 12 (i.e.
4096-byte pages).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
updates from mlx5-next 2022-09-24

Updates form mlx5-next including[1]:

1) HW definitions and support for NPPS clock settings.

2) various cleanups

3) Enable hash mode by default for all NICs

4) page tracker and advanced virtualization HW definitions for vfio

[1] https://lore.kernel.org/netdev/20220907233636.388475-1-saeed@kernel.org/

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Remove from FPGA IFC file not-needed definitions
  net/mlx5: Remove unused structs
  net/mlx5: Remove unused functions
  net/mlx5: detect and enable bypass port select flow table
  net/mlx5: Lag, enable hash mode by default for all NICs
  net/mlx5: Lag, set active ports if support bypass port select flow table
  RDMA/mlx5: Don't set tx affinity when lag is in hash mode
  net/mlx5: add IFC bits for bypassing port select flow table
  net/mlx5: Add support for NPPS with real time mode
  net/mlx5: Expose NPPS related registers
  net/mlx5: Query ADV_VIRTUALIZATION capabilities
  net/mlx5: Introduce ifc bits for page tracker
  RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
====================

Link: https://lore.kernel.org/all/20220927201906.234015-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sunhme: Fix undersized zeroing of quattro->happy_meals

Just use kzalloc instead.

Fixes: d6f1e89bdbb8 ("sunhme: Return an ERR_PTR from quattro_pci_find")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Link: https://lore.kernel.org/r/20220928004157.279731-1-seanga2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: iosm: Use skb_put_data() instead of skb_put/memcpy pair

Use skb_put_data() instead of skb_put() and memcpy(), which is clear.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Reviewed-by: M Chetan Kumar <m.chetan.kumar@intel.com>
Link: https://lore.kernel.org/r/20220927023254.30342-1-shangxiaojing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rework-resource-allocation-in-felix-dsa-driver'

Vladimir Oltean says:

====================
Rework resource allocation in Felix DSA driver

The Felix DSA driver controls NXP variations of Microchip switches.
Colin Foster is trying to add support in this driver for "genuine"
Microchip hardware, but some of the NXP-isms in this driver need to go
away before that happens cleanly.
https://patchwork.kernel.org/project/netdevbpf/cover/20220926002928.2744638-1-colin.foster@in-advantage.com/

The starting point was Colin's patch 08/14 "net: dsa: felix: update
init_regmap to be string-based", and this continues to be the central
theme here, but things are done differently.

In short (full explanations are in patches), the goal is for MFD-based
switches like Colin's SPI-controlled VSC7512 to be able to request a
regmap that was created 100% externally (by drivers/mfd/ocelot-core.c)
in a very simple way, that does not create dependencies on other
modules. That is dev_get_regmap(), and as input it wants a string, for
the resource name. So we rework the resource allocation in this driver
to be based on string names provided by the specific instantiation (in
Colin's case, ocelot_ext.c).

Patch set was boot-tested on NXP LS1028A.
====================

Link: https://lore.kernel.org/r/20220927191521.1578084-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: update regmap requests to be string-based

Existing felix DSA drivers (vsc9959, vsc9953) are all switches that were
integrated in NXP SoCs, which makes them a bit unusual compared to the
usual Microchip branded Ocelot switches.

To be precise, looking at
Documentation/devicetree/bindings/net/mscc,vsc7514-switch.yaml, one can
see 21 memory regions for the "switch" node, and these correspond to the
"targets" of the switch IP, which are spread throughout the guts of that
SoC's memory space.

In NXP integrations, those targets still exist, but they were condensed
within a single memory region, with no other peripheral in between them,
so it made more sense for the driver to ioremap the entire memory space
of the switch, and then find the targets within that memory space via
some offsets hardcoded in the driver.

The effect of this design decision is that now, the felix driver expects
hardware instantiations to provide their own resource definitions, which
is kind of odd when considering a typical device (those are retrieved
from 'reg' properties in the device tree, using platform_get_resource()
or similar).

Allow other hardware instantiations that share the felix driver to not
provide a hardcoded array of resources in the future. Instead, make the
common denominator based on which regmaps are created be just the
resource "names". Each instantiation comes with its own array of names
that are mandatory for it, and with an optional array of resources.

So we split the resources in 2 arrays, one is what's requested and the
other is what's provided. There is one pool of provided resources, in
felix->info->resources (of length felix->info->num_resources). There are
2 different ways of requesting a resource. One is by enum ocelot_target
(this handles the global regmaps), and one is by int port (this handles
the per-port ones).

For the existing vsc9959 and vsc9953, it would be a bit stupid to
request something that's not provided, given that the 2 arrays are both
defined in the same place.

The advantage is that we can now modify felix_request_regmap_by_name()
to make felix->info->resources[] optional, and if absent, the
implementation can call dev_get_regmap() and this is something that is
compatible with MFD.

Co-developed-by: Colin Foster <colin.foster@in-advantage.com>
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

LoongArch: Clean up loongson3_smp_ops declaration

Since loongson3_smp_ops is not used in LoongArch anymore, let's remove
it for cleanup.

Fixes: f2ac457a6138 ("LoongArch: Add CPU definition headers")
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Fix and cleanup csr_era handling in do_ri()

We don't emulate reserved instructions and just send a signal to the
current process now. So we don't need to call compute_return_era() to
add 4 (point to the next instruction) to csr_era in pt_regs. RA/ERA's
backup/restore is cleaned up as well.

Signed-off-by: Jun Yi <yijun@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Align the address of kernel_entry to 4KB

Align the address of kernel_entry to 4KB, to avoid early tlb miss
exception in case the entry code crosses page boundary.

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

net: dsa: felix: use DEFINE_RES_MEM_NAMED for resources

Use less verbose resource definitions in vsc9959 and vsc9953. This also
sets IORESOURCE_MEM in the constant array of resources, so we don't have
to do this from felix_init_structs() - in fact, in the future, we may
even support IORESOURCE_REG resources.

Note that this macro takes start and length as argument, and we had
start and end before. So transform end into length.

While at it, sort the resources according to their offset.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: remove felix_info :: init_regmap

It turns out that the idea of having a customizable implementation of a
regmap creation from a resource is not exactly useful. The idea was for
the new MFD-based VSC7512 driver to use something that creates a SPI
regmap from a resource. But there are problems in actually getting those
resources (it involves getting them from MFD).

To avoid all that, we'll be getting resources by name, so this custom
init_regmap() method won't be needed. Remove it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: remove felix_info :: imdio_base

This address is only relevant for the vsc9959, which is a PCIe device
that holds its switch registers in a different PCIe BAR compared to the
registers for the internal MDIO controller.

Hide this aspect from the common felix driver and move the
pci_resource_start() call to the only place that needs it, which is in
vsc9959_mdio_bus_alloc().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: remove felix_info :: imdio_res

The imdio_res is used only by vsc9959, which references its own
vsc9959_imdio_res through the common felix_info->imdio_res pointer.
Since the common code doesn't care about this resource (and it can't be
part of the common array of resources, either, because it belongs in a
different PCI BAR), just reference it directly.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-properly-clean-up-unaccepted-subflows'

Mat Martineau says:

====================
mptcp: Properly clean up unaccepted subflows

Patch 1 factors out part of the mptcp_close() function for use by a caller
that already owns the socket lock. This is a prerequisite for patch 2.

Patch 2 is the fix that fully cleans up the unaccepted subflow sockets.
====================

Link: https://lore.kernel.org/r/20220927193158.195729-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix unreleased socket in accept queue

The mptcp socket and its subflow sockets in accept queue can't be
released after the process exit.

While the release of a mptcp socket in listening state, the
corresponding tcp socket will be released too. Meanwhile, the tcp
socket in the unaccept queue will be released too. However, only init
subflow is in the unaccept queue, and the joined subflow is not in the
unaccept queue, which makes the joined subflow won't be released, and
therefore the corresponding unaccepted mptcp socket will not be released
to.

This can be reproduced easily with following steps:

1. create 2 namespace and veth:
   $ ip netns add mptcp-client
   $ ip netns add mptcp-server
   $ sysctl -w net.ipv4.conf.all.rp_filter=0
   $ ip netns exec mptcp-client sysctl -w net.mptcp.enabled=1
   $ ip netns exec mptcp-server sysctl -w net.mptcp.enabled=1
   $ ip link add red-client netns mptcp-client type veth peer red-server \
     netns mptcp-server
   $ ip -n mptcp-server address add 10.0.0.1/24 dev red-server
   $ ip -n mptcp-server address add 192.168.0.1/24 dev red-server
   $ ip -n mptcp-client address add 10.0.0.2/24 dev red-client
   $ ip -n mptcp-client address add 192.168.0.2/24 dev red-client
   $ ip -n mptcp-server link set red-server up
   $ ip -n mptcp-client link set red-client up

2. configure the endpoint and limit for client and server:
   $ ip -n mptcp-server mptcp endpoint flush
   $ ip -n mptcp-server mptcp limits set subflow 2 add_addr_accepted 2
   $ ip -n mptcp-client mptcp endpoint flush
   $ ip -n mptcp-client mptcp limits set subflow 2 add_addr_accepted 2
   $ ip -n mptcp-client mptcp endpoint add 192.168.0.2 dev red-client id \
     1 subflow

3. listen and accept on a port, such as 9999. The nc command we used
   here is modified, which makes it use mptcp protocol by default.
   $ ip netns exec mptcp-server nc -l -k -p 9999

4. open another *two* terminal and use each of them to connect to the
   server with the following command:
   $ ip netns exec mptcp-client nc 10.0.0.1 9999
   Input something after connect to trigger the connection of the second
   subflow. So that there are two established mptcp connections, with the
   second one still unaccepted.

5. exit all the nc command, and check the tcp socket in server namespace.
   And you will find that there is one tcp socket in CLOSE_WAIT state
   and can't release forever.

Fix this by closing all of the unaccepted mptcp socket in
mptcp_subflow_queue_clean() with __mptcp_close().

Now, we can ensure that all unaccepted mptcp sockets will be cleaned by
__mptcp_close() before they are released, so mptcp_sock_destruct(), which
is used to clean the unaccepted mptcp socket, is not needed anymore.

The selftests for mptcp is ran for this commit, and no new failures.

Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests")
Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets")
Cc: stable@vger.kernel.org
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: factor out __mptcp_close() without socket lock

Factor out __mptcp_close() from mptcp_close(). The caller of
__mptcp_close() should hold the socket lock, and cancel mptcp work when
__mptcp_close() returns true.

This function will be used in the next commit.

Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests")
Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets")
Cc: stable@vger.kernel.org
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
ice: xsk: ZC changes

Maciej Fijalkowski says:

This set consists of two fixes to issues that were either pointed out on
indirectly (John was reviewing AF_XDP selftests that were testing ice's
ZC support) mailing list or were directly reported by customers.

First patch allows user space to see done descriptor in CQ even after a
single frame being transmitted and second patch removes the need for
having HW rings sized to power of 2 number of descriptors when used
against AF_XDP.

I also forgot to mention that due to the current Tx cleaning algorithm,
4k HW ring was broken and these two patches bring it back to life, so we
kill two birds with one stone.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: xsk: drop power of 2 ring size restriction for AF_XDP
ice: xsk: change batched Tx descriptor cleaning
====================

Link: https://lore.kernel.org/r/20220927164112.4011983-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_eth_soc: fix mask of RX_DMA_GET_SPORT{,_V2}

The bitmasks applied in RX_DMA_GET_SPORT and RX_DMA_GET_SPORT_V2 macros
were swapped. Fix that.

Reported-by: Chen Minqiang <ptpt52@gmail.com>
Fixes: 160d3a9b192985 ("net: ethernet: mtk_eth_soc: introduce MTK_NETSYS_V2 support")
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://lore.kernel.org/r/YzMW+mg9UsaCdKRQ@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: fix tagged VLAN refusal while under a VLAN-unaware bridge

Currently the following set of commands fails:

$ ip link add br0 type bridge # vlan_filtering 0
$ ip link set swp0 master br0
$ bridge vlan
port              vlan-id
swp0              1 PVID Egress Untagged
$ bridge vlan add dev swp0 vid 10
Error: mscc_ocelot_switch_lib: Port with more than one egress-untagged VLAN cannot have egress-tagged VLANs.

Dumping ocelot->vlans, one can see that the 2 egress-untagged VLANs on swp0 are
vid 1 (the bridge PVID) and vid 4094, a PVID used privately by the driver for
VLAN-unaware bridging. So this is why bridge vid 10 is refused, despite
'bridge vlan' showing a single egress untagged VLAN.

As mentioned in the comment added, having this private VLAN does not impose
restrictions to the hardware configuration, yet it is a bookkeeping problem.

There are 2 possible solutions.

One is to make the functions that operate on VLAN-unaware pvids:
- ocelot_add_vlan_unaware_pvid()
- ocelot_del_vlan_unaware_pvid()
- ocelot_port_setup_dsa_8021q_cpu()
- ocelot_port_teardown_dsa_8021q_cpu()
call something different than ocelot_vlan_member_(add|del)(), the latter being
the real problem, because it allocates a struct ocelot_bridge_vlan *vlan which
it adds to ocelot->vlans. We don't really *need* the private VLANs in
ocelot->vlans, it's just that we have the extra convenience of having the
vlan->portmask cached in software (whereas without these structures, we'd have
to create a raw ocelot_vlant_rmw_mask() procedure which reads back the current
port mask from hardware).

The other solution is to filter out the private VLANs from
ocelot_port_num_untagged_vlans(), since they aren't what callers care about.
We only need to do this to the mentioned function and not to
ocelot_port_num_tagged_vlans(), because private VLANs are never egress-tagged.

Nothing else seems to be broken in either solution, but the first one requires
more rework which will conflict with the net-next change  36a0bf443585 ("net:
mscc: ocelot: set up tag_8021q CPU ports independent of user port affinity"),
and I'd like to avoid that. So go with the other one.

Fixes: 54c319846086 ("net: mscc: ocelot: enforce FDB isolation when VLAN-unaware")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220927122042.1100231-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: drop the weight argument from netif_napi_add

We tell driver developers to always pass NAPI_POLL_WEIGHT
as the weight to netif_napi_add(). This may be confusing
to newcomers, drop the weight argument, those who really
need to tweak the weight can use netif_napi_add_weight().

Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN
Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Fix incorrect address comparison when searching for a bind2 bucket

The v6_rcv_saddr and rcv_saddr are inside a union in the
'struct inet_bind2_bucket'. When searching a bucket by following the
bhash2 hashtable chain, eg. inet_bind2_bucket_match, it is only using
the sk->sk_family and there is no way to check if the inet_bind2_bucket
has a v6 or v4 address in the union. This leads to an uninit-value
KMSAN report in [0] and also potentially incorrect matches.

This patch fixes it by adding a family member to the inet_bind2_bucket
and then tests 'sk->sk_family != tb->family' before matching
the sk's address to the tb's address.

Cc: Joanne Koong <joannelkoong@gmail.com>
Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20220927002544.3381205-1-kafai@fb.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-mptcp-support-for-tcp_fastopen_connect'

Mat Martineau says:

====================
mptcp: MPTCP support for TCP_FASTOPEN_CONNECT

RFC 8684 appendix B describes how to use TCP Fast Open with MPTCP. This
series allows TFO use with MPTCP using the TCP_FASTOPEN_CONNECT socket
option. The scope here is limited to the initiator of the connection -
support for MSG_FASTOPEN and the listener side of the connection will be
in a separate series. The preexisting TCP fastopen code does most of the
work, so these changes mostly involve plumbing MPTCP through to those
TCP functions.

Patch 1 changes the MPTCP socket option code to pass the
TCP_FASTOPEN_CONNECT option through to the initial unconnected subflow.

Patch 2 exports the existing tcp_sendmsg_fastopen() function from tcp.c

Patch 3 adds the call to tcp_sendmsg_fastopen() from the MPTCP send
function.

Patch 4 modifies mptcp_poll() to handle the deferred TFO connection.
====================

Link: https://lore.kernel.org/r/20220926232739.76317-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: poll allow write call before actual connect

If fastopen is used, poll must allow a first write that will trigger
the SYN+data

Similar to what is done in tcp_poll().

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: handle defer connect in mptcp_sendmsg

When TCP_FASTOPEN_CONNECT has been set on the socket before a connect,
the defer flag is set and must be handled when sendmsg is called.

This is similar to what is done in tcp_sendmsg_locked().

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>