review.tizen.org Git - platform/kernel/linux-rpi.git/log

Merge tag 'mlx5-updates-2022-09-27' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2022-09-27

This is Part #1 of 4 parts series to align mlx5's implementation of
XSK (AF_XDP) RX-Qs indexing and management with other vendors:

Maxim Says:
===========

xsk: Bug fixes for frame mapping on striding RQ

Striding RQ relies on the driver mapping RX buffers into the NIC's
virtual memory space. Currently, regadless of the XSK frame size, mlx5e
maps them using MTT, and each mapping's length is PAGE_SIZE. As the
result, the stride size used by striding RQ is also equal to PAGE_SIZE.

This decision has the following issues:

1. In the XSK aligned mode with frame size smaller than PAGE_SIZE, it's
suboptimal. Using 2K strides and 2K pages allows to post twice as fewer
WQEs.

2. MTT is not suitable for unaligned frames, as it requires natural
alignment theoretically, in practice at least 8-byte alignment.

3. Using mapping and stride bigger than the frame has risk of writing
over the bounds of the XSK frame upon receiving packets bigger than MTU,
which is possible in some specific configurations.

This series addresses issues 1 and 2 and alleviates issue 3. Where
possible, page and stride size will match the XSK frame size (firmware
upgrade may be needed to have effect for 2K frames). Unaligned mode will
use KSM instead of MTT, which allows to drop the partial workaround [1].

[1]: https://lore.kernel.org/netdev/YufYFQ6JN91lQbso@boxer/T/
====================

Link: https://lore.kernel.org/r/20220927203611.244301-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use runtime values of striding RQ parameters in datapath

Some of the parameters of striding RQ are compile-time constants, but
they are going to become dynamically calculated at runtime in a
following commit. This commit prepares the datapath to take cached
runtime parameters, prefilled at queue creation.

New fields added to struct mlx5e_rq fit into an existing 7-byte hole.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Make dma_info array dynamic in struct mlx5e_mpw_info

This commit moves the dma_info array to the end of struct mlx5e_mpw_info
to make it a flexible array. It also removes the intermediate struct
mlx5e_umr_dma_info, which used to contain only this array. The
flexibility of dma_info will allow to choose its size dynamically in a
following commit.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Improve the MTU change shortcut

Normally, the MTU change requires reopening the channels, but it can be
skipped if the new MTU doesn't change any of the queue parameters and if
MTU is not used in the data path.

The shortcut is applicable to the non-linear mode of striding RQ,
because the only thing affected by MTU is the queue length. As ethtool
sets the queue length in packets, but striding RQ length is defined in
strides or bytes, we estimate the RQ length to be at least as big as the
requested number of MTU-sized packets, that's why it depends on MTU.

Improve the shortcut by actually checking whether the RQ length stayed
the same, instead of an intermediate step in the calculation.

As MTU also affects the SHAMPO parameters, skip the shortcut if SHAMPO
is in use.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: xsk: Fix SKB headroom calculation in validation

In a typical scenario, if an XSK socket is opened first, then an XDP
program is attached, mlx5e_validate_xsk_param will be called twice:
first on XSK bind, second on channel restart caused by enabling XDP. The
validation includes a call to mlx5e_rx_is_linear_skb, which checks the
presence of the XDP program.

The above means that mlx5e_rx_is_linear_skb might return true the first
time, but false the second time, as mlx5e_rx_get_linear_sz_skb's return
value will increase, because of a different headroom used with XDP.

As XSK RQs never exist without XDP, it would make sense to trick
mlx5e_rx_get_linear_sz_skb into thinking XDP is enabled at the first
check as well. This way, if MTU is too big, it would be detected on XSK
bind, without giving false hope to the userspace application.

However, it turns out that this check is too restrictive in the first
place. SKBs created on XDP_PASS on XSK RQs don't have any headroom. That
means that big MTUs filtered out on the first and the second checks
might actually work.

So, address this issue in the proper way, but taking into account the
absence of the SKB headroom on XSK RQs, when calculating the buffer
size.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: xsk: Remove dead code in validation

One of the checks in mlx5e_rx_is_linear_skb verifies that the RX buffer
fits into the XSK frame size. Remove the duplicating check from
mlx5e_validate_xsk_param. It allows to make mlx5e_rx_get_min_frag_sz
static.

Remove mlx5e_rx_is_xdp altogether, as its only usage is located in a
branch where xsk == NULL.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Simplify stride size calculation for linear RQ

Linear RX buffers must be big enough to fit the MTU-sized packet along
with the headroom. On the other hand, they must be small enough to fit
into a page (or into an XSK frame). A straightforward way to check
whether the linear mode is possible would be comparing the required
buffer size to PAGE_SIZE or XSK frame size.

Stride size in the linear mode is defined by the following constraints:

1. A stride is at least as big as the buffer size, and it's a power of
two.

2. If non-XSK XDP is enabled, the stride size is PAGE_SIZE, because
mlx5e requires each packet to be in its own page when XDP is in use. The
previous constraint is automatically fulfilled, because buffer size
can't be bigger than PAGE_SIZE.

3. XSK uses stride size equal to PAGE_SIZE, but the following commits
will allow it to use roundup_pow_of_two(XSK frame size), by allowing the
NIC's MMU to use page sizes not equal to the CPU page size.

This commit puts the above requirements and constraints straight to the
code in an attempt to simplify it and to prepare it for changes made in
the next patches.

For the reference, the old code uses an equivalent, but trickier
calculation (high-level simplified pseudocode):

    if XDP or XSK:
        mlx5e_rx_get_linear_frag_sz := max(buffer size, PAGE_SIZE)
    else:
        mlx5e_rx_get_linear_frag_sz := buffer size
    mlx5e_rx_is_linear_skb := mlx5e_rx_get_linear_frag_sz <= PAGE_SIZE
    stride size := roundup_pow_of_two(mlx5e_rx_get_linear_frag_sz)

The new code effectively removes mlx5e_rx_get_linear_frag_sz that used
to return either buffer size or stride size, depending on the situation,
making it hard to work with and to make changes:

    if XDP or XSK:
        mlx5e_rx_get_linear_stride_sz := PAGE_SIZE
    else
        mlx5e_rx_get_linear_stride_sz := roundup_pow_of_two(buffer size)
    mlx5e_rx_is_linear_skb := buffer size <= (PAGE_SIZE or XSK frame sz)
    stride size := mlx5e_rx_get_linear_stride_sz

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: kTLS, Check ICOSQ WQE size in advance

Instead of WARNing in runtime when TLS offload WQEs posted to ICOSQ are
over the hardware limit, check their size before enabling TLS RX
offload, and block the offload if the condition fails. It also allows to
drop a u16 field from struct mlx5e_icosq.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use the aligned max TX MPWQE size

TX MPWQE size is limited to the cacheline-aligned maximum. Use the same
value for the stop room and the capability check.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Fix a typo in mlx5e_xdp_mpwqe_is_full

Fix a typo in the function name: mpqwe -> mpwqe (stands for multi-packet
work queue element).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use mlx5e_stop_room_for_max_wqe where appropriate

mlx5e_alloc_xdpsq calculates sq->stop_room internally, but there is
already a function for that: mlx5e_stop_room_for_max_wqe. This commit
makes use of this function.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Let mlx5e_get_sw_max_sq_mpw_wqebbs accept mdev

To shorten and simplify code, let mlx5e_get_sw_max_sq_mpw_wqebbs accept
mdev and derive max SQ WQEBBs from it. Also rename the function to a
more generic name mlx5e_get_max_sq_aligned_wqebbs, because the following
patches will use it in non-MPWQE contexts.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Validate striding RQ before enabling XDP

Currently, the driver can silently fall back to legacy RQ after enabling
XDP, even if striding RQ was active before. It happens when PAGE_SIZE is
bigger than the maximum supported stride size. This commit changes this
behavior to more straightforward: if an operation (enabling XDP) doesn't
support the current parameters (striding RQ mode), it fails.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Make mlx5e_verify_rx_mpwqe_strides static

mlx5e_verify_rx_mpwqe_strides is only used in en/params.c, so it can be
made static and removed from en/params.h.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Remove unused fields from datapath structs

No need to keep max_sq_wqebbs in mlx5e_txqsq and mlx5e_xdpsq, as it's
only used when allocating the queues. Removing an extra field reduces
the struct size.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Convert mlx5e_get_max_sq_wqebbs to u8

The return value of mlx5e_get_max_sq_wqebbs is clamped down to
MLX5_SEND_WQE_MAX_WQEBBS = 16, which fits into u8. This commit changes
the return type of this function to u8 for stricter type safety.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add the log_min_mkey_entity_size capability

Add the capability that will allow the driver to determine the minimal
MTT page size to be able to map the smallest possible pages in XSK. The
older firmwares that don't have this capability default to 12 (i.e.
4096-byte pages).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
updates from mlx5-next 2022-09-24

Updates form mlx5-next including[1]:

1) HW definitions and support for NPPS clock settings.

2) various cleanups

3) Enable hash mode by default for all NICs

4) page tracker and advanced virtualization HW definitions for vfio

[1] https://lore.kernel.org/netdev/20220907233636.388475-1-saeed@kernel.org/

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Remove from FPGA IFC file not-needed definitions
  net/mlx5: Remove unused structs
  net/mlx5: Remove unused functions
  net/mlx5: detect and enable bypass port select flow table
  net/mlx5: Lag, enable hash mode by default for all NICs
  net/mlx5: Lag, set active ports if support bypass port select flow table
  RDMA/mlx5: Don't set tx affinity when lag is in hash mode
  net/mlx5: add IFC bits for bypassing port select flow table
  net/mlx5: Add support for NPPS with real time mode
  net/mlx5: Expose NPPS related registers
  net/mlx5: Query ADV_VIRTUALIZATION capabilities
  net/mlx5: Introduce ifc bits for page tracker
  RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
====================

Link: https://lore.kernel.org/all/20220927201906.234015-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sunhme: Fix undersized zeroing of quattro->happy_meals

Just use kzalloc instead.

Fixes: d6f1e89bdbb8 ("sunhme: Return an ERR_PTR from quattro_pci_find")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Link: https://lore.kernel.org/r/20220928004157.279731-1-seanga2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: iosm: Use skb_put_data() instead of skb_put/memcpy pair

Use skb_put_data() instead of skb_put() and memcpy(), which is clear.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Reviewed-by: M Chetan Kumar <m.chetan.kumar@intel.com>
Link: https://lore.kernel.org/r/20220927023254.30342-1-shangxiaojing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rework-resource-allocation-in-felix-dsa-driver'

Vladimir Oltean says:

====================
Rework resource allocation in Felix DSA driver

The Felix DSA driver controls NXP variations of Microchip switches.
Colin Foster is trying to add support in this driver for "genuine"
Microchip hardware, but some of the NXP-isms in this driver need to go
away before that happens cleanly.
https://patchwork.kernel.org/project/netdevbpf/cover/20220926002928.2744638-1-colin.foster@in-advantage.com/

The starting point was Colin's patch 08/14 "net: dsa: felix: update
init_regmap to be string-based", and this continues to be the central
theme here, but things are done differently.

In short (full explanations are in patches), the goal is for MFD-based
switches like Colin's SPI-controlled VSC7512 to be able to request a
regmap that was created 100% externally (by drivers/mfd/ocelot-core.c)
in a very simple way, that does not create dependencies on other
modules. That is dev_get_regmap(), and as input it wants a string, for
the resource name. So we rework the resource allocation in this driver
to be based on string names provided by the specific instantiation (in
Colin's case, ocelot_ext.c).

Patch set was boot-tested on NXP LS1028A.
====================

Link: https://lore.kernel.org/r/20220927191521.1578084-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: update regmap requests to be string-based

Existing felix DSA drivers (vsc9959, vsc9953) are all switches that were
integrated in NXP SoCs, which makes them a bit unusual compared to the
usual Microchip branded Ocelot switches.

To be precise, looking at
Documentation/devicetree/bindings/net/mscc,vsc7514-switch.yaml, one can
see 21 memory regions for the "switch" node, and these correspond to the
"targets" of the switch IP, which are spread throughout the guts of that
SoC's memory space.

In NXP integrations, those targets still exist, but they were condensed
within a single memory region, with no other peripheral in between them,
so it made more sense for the driver to ioremap the entire memory space
of the switch, and then find the targets within that memory space via
some offsets hardcoded in the driver.

The effect of this design decision is that now, the felix driver expects
hardware instantiations to provide their own resource definitions, which
is kind of odd when considering a typical device (those are retrieved
from 'reg' properties in the device tree, using platform_get_resource()
or similar).

Allow other hardware instantiations that share the felix driver to not
provide a hardcoded array of resources in the future. Instead, make the
common denominator based on which regmaps are created be just the
resource "names". Each instantiation comes with its own array of names
that are mandatory for it, and with an optional array of resources.

So we split the resources in 2 arrays, one is what's requested and the
other is what's provided. There is one pool of provided resources, in
felix->info->resources (of length felix->info->num_resources). There are
2 different ways of requesting a resource. One is by enum ocelot_target
(this handles the global regmaps), and one is by int port (this handles
the per-port ones).

For the existing vsc9959 and vsc9953, it would be a bit stupid to
request something that's not provided, given that the 2 arrays are both
defined in the same place.

The advantage is that we can now modify felix_request_regmap_by_name()
to make felix->info->resources[] optional, and if absent, the
implementation can call dev_get_regmap() and this is something that is
compatible with MFD.

Co-developed-by: Colin Foster <colin.foster@in-advantage.com>
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: use DEFINE_RES_MEM_NAMED for resources

Use less verbose resource definitions in vsc9959 and vsc9953. This also
sets IORESOURCE_MEM in the constant array of resources, so we don't have
to do this from felix_init_structs() - in fact, in the future, we may
even support IORESOURCE_REG resources.

Note that this macro takes start and length as argument, and we had
start and end before. So transform end into length.

While at it, sort the resources according to their offset.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: remove felix_info :: init_regmap

It turns out that the idea of having a customizable implementation of a
regmap creation from a resource is not exactly useful. The idea was for
the new MFD-based VSC7512 driver to use something that creates a SPI
regmap from a resource. But there are problems in actually getting those
resources (it involves getting them from MFD).

To avoid all that, we'll be getting resources by name, so this custom
init_regmap() method won't be needed. Remove it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: remove felix_info :: imdio_base

This address is only relevant for the vsc9959, which is a PCIe device
that holds its switch registers in a different PCIe BAR compared to the
registers for the internal MDIO controller.

Hide this aspect from the common felix driver and move the
pci_resource_start() call to the only place that needs it, which is in
vsc9959_mdio_bus_alloc().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: remove felix_info :: imdio_res

The imdio_res is used only by vsc9959, which references its own
vsc9959_imdio_res through the common felix_info->imdio_res pointer.
Since the common code doesn't care about this resource (and it can't be
part of the common array of resources, either, because it belongs in a
different PCI BAR), just reference it directly.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: drop the weight argument from netif_napi_add

We tell driver developers to always pass NAPI_POLL_WEIGHT
as the weight to netif_napi_add(). This may be confusing
to newcomers, drop the weight argument, those who really
need to tweak the weight can use netif_napi_add_weight().

Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN
Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Fix incorrect address comparison when searching for a bind2 bucket

The v6_rcv_saddr and rcv_saddr are inside a union in the
'struct inet_bind2_bucket'. When searching a bucket by following the
bhash2 hashtable chain, eg. inet_bind2_bucket_match, it is only using
the sk->sk_family and there is no way to check if the inet_bind2_bucket
has a v6 or v4 address in the union. This leads to an uninit-value
KMSAN report in [0] and also potentially incorrect matches.

This patch fixes it by adding a family member to the inet_bind2_bucket
and then tests 'sk->sk_family != tb->family' before matching
the sk's address to the tb's address.

Cc: Joanne Koong <joannelkoong@gmail.com>
Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20220927002544.3381205-1-kafai@fb.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-mptcp-support-for-tcp_fastopen_connect'

Mat Martineau says:

====================
mptcp: MPTCP support for TCP_FASTOPEN_CONNECT

RFC 8684 appendix B describes how to use TCP Fast Open with MPTCP. This
series allows TFO use with MPTCP using the TCP_FASTOPEN_CONNECT socket
option. The scope here is limited to the initiator of the connection -
support for MSG_FASTOPEN and the listener side of the connection will be
in a separate series. The preexisting TCP fastopen code does most of the
work, so these changes mostly involve plumbing MPTCP through to those
TCP functions.

Patch 1 changes the MPTCP socket option code to pass the
TCP_FASTOPEN_CONNECT option through to the initial unconnected subflow.

Patch 2 exports the existing tcp_sendmsg_fastopen() function from tcp.c

Patch 3 adds the call to tcp_sendmsg_fastopen() from the MPTCP send
function.

Patch 4 modifies mptcp_poll() to handle the deferred TFO connection.
====================

Link: https://lore.kernel.org/r/20220926232739.76317-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: poll allow write call before actual connect

If fastopen is used, poll must allow a first write that will trigger
the SYN+data

Similar to what is done in tcp_poll().

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: handle defer connect in mptcp_sendmsg

When TCP_FASTOPEN_CONNECT has been set on the socket before a connect,
the defer flag is set and must be handled when sendmsg is called.

This is similar to what is done in tcp_sendmsg_locked().

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: export tcp_sendmsg_fastopen

It will be used to support TCP FastOpen with MPTCP in the following
commit.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: add TCP_FASTOPEN_CONNECT socket option

Set the option for the first subflow only. For the other subflows TFO
can't be used because a mapping would be needed to cover the data in the
SYN.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netns: Replace zero-length array with DECLARE_FLEX_ARRAY() helper

Zero-length arrays are deprecated and we are moving towards adopting
C99 flexible-array members, instead. So, replace zero-length arrays
declarations in anonymous union with the new DECLARE_FLEX_ARRAY()
helper macro.

This helper allows for flexible-array members in unions.

Link: https://github.com/KSPP/linux/issues/193
Link: https://github.com/KSPP/linux/issues/225
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/YzIvfGXxfjdXmIS3@work
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'shrink-struct-ubuf_info'

Pavel Begunkov says:

====================
shrink struct ubuf_info

struct ubuf_info is large but not all fields are needed for all
cases. We have limited space in io_uring for it and large ubuf_info
prevents some struct embedding, even though we use only a subset
of the fields. It's also not very clean trying to use this typeless
extra space.

Shrink struct ubuf_info to only necessary fields used in generic paths,
namely ->callback, ->refcnt and ->flags, which take only 16 bytes. And
make MSG_ZEROCOPY and some other users to embed it into a larger struct
ubuf_info_msgzc mimicking the former ubuf_info.

Note, xen/vhost may also have some cleaning on top by creating
new structs containing ubuf_info but with proper types.
====================

Link: https://lore.kernel.org/r/cover.1663892211.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: shrink struct ubuf_info

We can benefit from a smaller struct ubuf_info, so leave only mandatory
fields and let users to decide how they want to extend it. Convert
MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields.
This reduces the size from 48 bytes to just 16.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vhost/net: use struct ubuf_info_msgzc

struct ubuf_info will be changed, use ubuf_info_msgzc instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xen/netback: use struct ubuf_info_msgzc

struct ubuf_info will be changed, use ubuf_info_msgzc instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: introduce struct ubuf_info_msgzc

We're going to split struct ubuf_info and leave there only
mandatory fields. Users are free to extend it. Add struct
ubuf_info_msgzc, which will be an extended version for MSG_ZEROCOPY and
some other users. It duplicates of struct ubuf_info for now and will be
removed in a couple of patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'master' of git://git./linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter fix for net-next

This is a late bug fix for the *net-next* tree to make nftables
"fib" expression play nice with VRF devices.

This was broken since day 1 (v4.10) so I don't see a compelling reason
to push this via net at the last minute.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nft_fib: Fix for rpath check with VRF devices
====================

Link: https://lore.kernel.org/r/20220928113908.4525-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nft_fib: Fix for rpath check with VRF devices

Analogous to commit b575b24b8eee3 ("netfilter: Fix rpfilter
dropping vrf packets by mistake") but for nftables fib expression:
Add special treatment of VRF devices so that typical reverse path
filtering via 'fib saddr . iif oif' expression works as expected.

Fixes: f6d0cbcf09c50 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>

Merge branch 'sfc-tc-offload'

Edward Cree says:

====================
sfc: bare bones TC offload

This series begins the work of supporting TC flower offload on EF100 NICs.
This is the absolute minimum viable TC implementation to get traffic to
VFs and allow them to be tested; it supports no match fields besides
ingress port, no actions besides mirred and drop, and no stats.
More matches, actions, and counters will be added in subsequent patches.

Changed in v2:
- Add missing 'static' on declarations (kernel test robot, sparse)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: bare bones TC offload on EF100

This is the absolute minimum viable TC implementation to get traffic to
VFs and allow them to be tested; it supports no match fields besides
ingress port, no actions besides mirred and drop, and no stats.
Example usage:
    tc filter add dev $PF parent ffff: flower skip_sw \
        action mirred egress mirror dev $VFREP
    tc filter add dev $VFREP parent ffff: flower skip_sw \
        action mirred egress redirect dev $PF
gives a VF unfiltered access to the network out the physical port ($PF
acts here as a physical port representor).
More matches, actions, and counters will be added in subsequent patches.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: interrogate MAE capabilities at probe time

Different versions of EF100 firmware and FPGA bitstreams support different
matching capabilities in the Match-Action Engine. Probe for these at
start of day; subsequent patches will validate TC offload requests
against the reported capabilities.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: add a hashtable for offloaded TC rules

Nothing inserts into this table yet, but we have code to remove rules
on FLOW_CLS_DESTROY or at driver teardown time, in both cases also
attempting to remove the corresponding hardware rules.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: optional logging of TC offload errors

TC offload support will involve complex limitations on what matches and
actions a rule can do, in some cases potentially depending on rules
already offloaded. So add an ethtool private flag "log-tc-errors" which
controls reporting the reasons for un-offloadable TC rules at NETIF_INFO.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: bind indirect blocks for TC offload on EF100

Bind indirect blocks for recognised tunnel netdevices.
Currently these connect to a stub efx_tc_flower() that only returns
-EOPNOTSUPP; subsequent patches will implement flower offloads to the
Match-Action Engine.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: bind blocks for TC offload on EF100

Bind direct blocks for the MAE-admin PF and each VF representor.
Currently these connect to a stub efx_tc_flower() that only returns
-EOPNOTSUPP; subsequent patches will implement flower offloads to the
Match-Action Engine.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: rmnet: Replace zero-length array with DECLARE_FLEX_ARRAY() helper

Zero-length arrays are deprecated and we are moving towards adopting
C99 flexible-array members, instead. So, replace zero-length arrays
declarations in anonymous union with the new DECLARE_FLEX_ARRAY()
helper macro.

This helper allows for flexible-array members in unions.

Link: https://github.com/KSPP/linux/issues/193
Link: https://github.com/KSPP/linux/issues/221
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: act_bpf: simplify code logic in tcf_bpf_init()

Both is_bpf and is_ebpf are boolean types, so
(!is_bpf && !is_ebpf) || (is_bpf && is_ebpf) can be reduced to
is_bpf == is_ebpf in tcf_bpf_init().

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'lan966x-qos'

Horatiu Vultur says:

====================
net: lan966x: Add tbf, cbs, ets support

Add support for offloading QoS features with tc command to lan966x.
The offloaded Qos features are tbf, cbs and ets.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Add offload support for ets

Add ets qdisc which allows to mix strict priority with bandwidth-sharing
bands. The ets qdisc needs to be attached as root qdisc.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Add offload support for cbs

Lan966x switch supports credit based shaper in hardware according to
IEEE Std 802.1Q-2018 Section 8.6.8.2. Add support for cbs configuration
on egress port of lan966x switch.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Add offload support for tbf

The tbf qdisc allows to attach a shaper on traffic egress on a port or
on a queue. On port they are attached directly to the root and on queue
they are attached on one of the classes of the parent qdisc.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tc-testing-qdisc'

Zhengchao Shao says:

====================
net: add tc-testing qdisc test cases

For this patchset, test cases of the qdisc modules are added to the
tc-testing test suite.

Last, thanks to Victor for testing and suggestion.

After a test case is added locally, the test result is as follows:

./tdc.py -c atm
ok 1 7628 - Create ATM with default setting
ok 2 390a - Delete ATM with valid handle
ok 3 32a0 - Show ATM class
ok 4 6310 - Dump ATM stats

./tdc.py -c choke
ok 1 8937 - Create CHOKE with default setting
ok 2 48c0 - Create CHOKE with min packet setting
ok 3 38c1 - Create CHOKE with max packet setting
ok 4 234a - Create CHOKE with ecn setting
ok 5 4380 - Create CHOKE with burst setting
ok 6 48c7 - Delete CHOKE with valid handle
ok 7 4398 - Replace CHOKE with min setting
ok 8 0301 - Change CHOKE with limit setting

./tdc.py -c codel
ok 1 983a - Create CODEL with default setting
ok 2 38aa - Create CODEL with limit packet setting
ok 3 9178 - Create CODEL with target setting
ok 4 78d1 - Create CODEL with interval setting
ok 5 238a - Create CODEL with ecn setting
ok 6 939c - Create CODEL with ce_threshold setting
ok 7 8380 - Delete CODEL with valid handle
ok 8 289c - Replace CODEL with limit setting
ok 9 0648 - Change CODEL with limit setting

./tdc.py -c etf
ok 1 34ba - Create ETF with default setting
ok 2 438f - Create ETF with delta nanos setting
ok 3 9041 - Create ETF with deadline_mode setting
ok 4 9a0c - Create ETF with skip_sock_check setting
ok 5 2093 - Delete ETF with valid handle

./tdc.py -c fq
ok 1 983b - Create FQ with default setting
ok 2 38a1 - Create FQ with limit packet setting
ok 3 0a18 - Create FQ with flow_limit setting
ok 4 2390 - Create FQ with quantum setting
ok 5 845b - Create FQ with initial_quantum setting
ok 6 9398 - Create FQ with maxrate setting
ok 7 342c - Create FQ with nopacing setting
ok 8 6391 - Create FQ with refill_delay setting
ok 9 238b - Create FQ with low_rate_threshold setting
ok 10 7582 - Create FQ with orphan_mask setting
ok 11 4894 - Create FQ with timer_slack setting
ok 12 324c - Create FQ with ce_threshold setting
ok 13 424a - Create FQ with horizon time setting
ok 14 89e1 - Create FQ with horizon_cap setting
ok 15 32e1 - Delete FQ with valid handle
ok 16 49b0 - Replace FQ with limit setting
ok 17 9478 - Change FQ with limit setting

./tdc.py -c gred
ok 1 8942 - Create GRED with default setting
ok 2 5783 - Create GRED with grio setting
ok 3 8a09 - Create GRED with limit setting
ok 4 48cb - Create GRED with ecn setting
ok 5 763a - Change GRED setting
ok 6 8309 - Show GRED class

./tdc.py -c hhf
ok 1 4812 - Create HHF with default setting
ok 2 8a92 - Create HHF with limit setting
ok 3 3491 - Create HHF with quantum setting
ok 4 ba04 - Create HHF with reset_timeout setting
ok 5 4238 - Create HHF with admit_bytes setting
ok 6 839f - Create HHF with evict_timeout setting
ok 7 a044 - Create HHF with non_hh_weight setting
ok 8 32f9 - Change HHF with limit setting
ok 9 385e - Show HHF class

./tdc.py -c pfifo_fast
ok 1 900c - Create pfifo_fast with default setting
ok 2 7470 - Dump pfifo_fast stats
ok 3 b974 - Replace pfifo_fast with different handle
ok 4 3240 - Delete pfifo_fast with valid handle
ok 5 4385 - Delete pfifo_fast with invalid handle

./tdc.py -c plug
ok 1 3289 - Create PLUG with default setting
ok 2 0917 - Create PLUG with block setting
ok 3 483b - Create PLUG with release setting
ok 4 4995 - Create PLUG with release_indefinite setting
ok 5 389c - Create PLUG with limit setting
ok 6 384a - Delete PLUG with valid handle
ok 7 439a - Replace PLUG with limit setting
ok 8 9831 - Change PLUG with limit setting

./tdc.py -c sfb
ok 1 3294 - Create SFB with default setting
ok 2 430a - Create SFB with rehash setting
ok 3 3410 - Create SFB with db setting
ok 4 49a0 - Create SFB with limit setting
ok 5 1241 - Create SFB with max setting
ok 6 3249 - Create SFB with target setting
ok 7 30a9 - Create SFB with increment setting
ok 8 239a - Create SFB with decrement setting
ok 9 9301 - Create SFB with penalty_rate setting
ok 10 2a01 - Create SFB with penalty_burst setting
ok 11 3209 - Change SFB with rehash setting
ok 12 5447 - Show SFB class

./tdc.py -c sfq
ok 1 7482 - Create SFQ with default setting
ok 2 c186 - Create SFQ with limit setting
ok 3 ae23 - Create SFQ with perturb setting
ok 4 a430 - Create SFQ with quantum setting
ok 5 4539 - Create SFQ with divisor setting
ok 6 b089 - Create SFQ with flows setting
ok 7 99a0 - Create SFQ with depth setting
ok 8 7389 - Create SFQ with headdrop setting
ok 9 6472 - Create SFQ with redflowlimit setting
ok 10 8929 - Show SFQ class

./tdc.py -c skbprio
ok 1 283e - Create skbprio with default setting
ok 2 c086 - Create skbprio with limit setting
ok 3 6733 - Change skbprio with limit setting
ok 4 2958 - Show skbprio class

./tdc.py -c taprio
ok 1 ba39 - Add taprio Qdisc to multi-queue device (8 queues)
ok 2 9462 - Add taprio Qdisc with multiple sched-entry
ok 3 8d92 - Add taprio Qdisc with txtime-delay
ok 4 d092 - Delete taprio Qdisc with valid handle
ok 5 8471 - Show taprio class
ok 6 0a85 - Add taprio Qdisc to single-queue device

./tdc.py -c tbf
ok 1 6430 - Create TBF with default setting
ok 2 0518 - Create TBF with mtu setting
ok 3 320a - Create TBF with peakrate setting
ok 4 239b - Create TBF with latency setting
ok 5 c975 - Create TBF with overhead setting
ok 6 948c - Create TBF with linklayer setting
ok 7 3549 - Replace TBF with mtu
ok 8 f948 - Change TBF with latency time
ok 9 2348 - Show TBF class

./tdc.py -c teql
ok 1 84a0 - Create TEQL with default setting
ok 2 7734 - Create TEQL with multiple device
ok 3 34a9 - Delete TEQL with valid handle
ok 4 6289 - Show TEQL stats

---
v3: add config
v2: modify subject prefix
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for teql qdisc

Test 84a0: Create TEQL with default setting
Test 7734: Create TEQL with multiple device
Test 34a9: Delete TEQL with valid handle
Test 6289: Show TEQL stats

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for tbf qdisc

Test 6430: Create TBF with default setting
Test 0518: Create TBF with mtu setting
Test 320a: Create TBF with peakrate setting
Test 239b: Create TBF with latency setting
Test c975: Create TBF with overhead setting
Test 948c: Create TBF with linklayer setting
Test 3549: Replace TBF with mtu
Test f948: Change TBF with latency time
Test 2348: Show TBF class

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for taprio qdisc

Test ba39: Add taprio Qdisc to multi-queue device (8 queues)
Test 9462: Add taprio Qdisc with multiple sched-entry
Test 8d92: Add taprio Qdisc with txtime-delay
Test d092: Delete taprio Qdisc with valid handle
Test 8471: Show taprio class
Test 0a85: Add taprio Qdisc to single-queue device

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for skbprio qdisc

Test 283e: Create skbprio with default setting
Test c086: Create skbprio with limit setting
Test 6733: Change skbprio with limit setting
Test 2958: Show skbprio class

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for sfq qdisc

Test 7482: Create SFQ with default setting
Test c186: Create SFQ with limit setting
Test ae23: Create SFQ with perturb setting
Test a430: Create SFQ with quantum setting
Test 4539: Create SFQ with divisor setting
Test b089: Create SFQ with flows setting
Test 99a0: Create SFQ with depth setting
Test 7389: Create SFQ with headdrop setting
Test 6472: Create SFQ with redflowlimit setting
Test 8929: Show SFQ class

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for sfb qdisc

Test 3294: Create SFB with default setting
Test 430a: Create SFB with rehash setting
Test 3410: Create SFB with db setting
Test 49a0: Create SFB with limit setting
Test 1241: Create SFB with max setting
Test 3249: Create SFB with target setting
Test 30a9: Create SFB with increment setting
Test 239a: Create SFB with decrement setting
Test 9301: Create SFB with penalty_rate setting
Test 2a01: Create SFB with penalty_burst setting
Test 3209: Change SFB with rehash setting
Test 5447: Show SFB class

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for plug qdisc

Test 3289: Create PLUG with default setting
Test 0917: Create PLUG with block setting
Test 483b: Create PLUG with release setting
Test 4995: Create PLUG with release_indefinite setting
Test 389c: Create PLUG with limit setting
Test 384a: Delete PLUG with valid handle
Test 439a: Replace PLUG with limit setting
Test 9831: Change PLUG with limit setting

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for pfifo_fast qdisc

Test 900c: Create pfifo_fast with default setting
Test 7470: Dump pfifo_fast stats
Test b974: Replace pfifo_fast with different handle
Test 3240: Delete pfifo_fast with valid handle
Test 4385: Delete pfifo_fast with invalid handle

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for hhf qdisc

Test 4812: Create HHF with default setting
Test 8a92: Create HHF with limit setting
Test 3491: Create HHF with quantum setting
Test ba04: Create HHF with reset_timeout setting
Test 4238: Create HHF with admit_bytes setting
Test 839f: Create HHF with evict_timeout setting
Test a044: Create HHF with non_hh_weight setting
Test 32f9: Change HHF with limit setting
Test 385e: Show HHF class

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for gred qdisc

Test 8942: Create GRED with default setting
Test 5783: Create GRED with grio setting
Test 8a09: Create GRED with limit setting
Test 48cb: Create GRED with ecn setting
Test 763a: Change GRED setting
Test 8309: Show GRED class

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for fq qdisc

Test 983b: Create FQ with default setting
Test 38a1: Create FQ with limit packet setting
Test 0a18: Create FQ with flow_limit setting
Test 2390: Create FQ with quantum setting
Test 845b: Create FQ with initial_quantum setting
Test 9398: Create FQ with maxrate setting
Test 342c: Create FQ with nopacing setting
Test 6391: Create FQ with refill_delay setting
Test 238b: Create FQ with low_rate_threshold setting
Test 7582: Create FQ with orphan_mask setting
Test 4894: Create FQ with timer_slack setting
Test 324c: Create FQ with ce_threshold setting
Test 424a: Create FQ with horizon time setting
Test 89e1: Create FQ with horizon_cap setting
Test 32e1: Delete FQ with valid handle
Test 49b0: Replace FQ with limit setting
Test 9478: Change FQ with limit setting

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for etf qdisc

Test 34ba: Create ETF with default setting
Test 438f: Create ETF with delta nanos setting
Test 9041: Create ETF with deadline_mode setting
Test 9a0c: Create ETF with skip_sock_check setting
Test 2093: Delete ETF with valid handle

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for codel qdisc

Test 983a: Create CODEL with default setting
Test 38aa: Create CODEL with limit packet setting
Test 9178: Create CODEL with target setting
Test 78d1: Create CODEL with interval setting
Test 238a: Create CODEL with ecn setting
Test 939c: Create CODEL with ce_threshold setting
Test 8380: Delete CODEL with valid handle
Test 289c: Replace CODEL with limit setting
Test 0648: Change CODEL with limit setting

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for choke qdisc

Test 8937: Create CHOKE with default setting
Test 48c0: Create CHOKE with min packet setting
Test 38c1: Create CHOKE with max packet setting
Test 234a: Create CHOKE with ecn setting
Test 4380: Create CHOKE with burst setting
Test 48c7: Delete CHOKE with valid handle
Test 4398: Replace CHOKE with min setting
Test 0301: Change CHOKE with limit setting

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tc-testing: add selftests for atm qdisc

Test 7628: Create ATM with default setting
Test 390a: Delete ATM with valid handle
Test 32a0: Show ATM class
Test 6310: Dump ATM stats

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core_acl_flex_actions: Split memcpy() of struct flow_action_cookie flexible array

To work around a misbehavior of the compiler's ability to see into
composite flexible array structs (as detailed in the coming memcpy()
hardening series[1]), split the memcpy() of the header and the payload
so no false positive run-time overflow warning will be generated.

[1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org

Cc: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20220927004033.1942992-1-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-generalized-register-definitions'

Alex Elder says:

====================
net: ipa: generalized register definitions

This series is quite a bit bigger than what I normally like to send,
and I apologize for that.  I would like it to get incorporated in
its entirety this week if possible, and splitting up the series
carries a small risk that wouldn't happen.

Each IPA register has a defined offset, and in most cases, a set
of masks that define the width and position of fields within the
register.  Most registers currently use the same offset for all
versions of IPA.  Usually fields within registers are also the same
across many versions.  Offsets and fields like this are defined
using preprocessor constants.

When a register has a different offset for different versions of
IPA, an inline function is used to determine its offset.  And in
places where a field differs between versions, an inline function is
used to determine how a value is encoded within the field, depending
on IPA version.

Starting with IPA version 5.0, the number of IPA endpoints supported
is greater than 32.  As a consequence, *many* IPA register offsets
differ considerably from prior versions.  This increase in endpoints
also requires a lot of field sizes and/or positions to change (such
as those that contain an endpoint ID).

Defining these things with constants is no longer simple, and rather
than fill the code with one-off functions to define offsets and
encode field values, this series puts in place a new way of defining
IPA registers and their fields.  Note that this series creates this
new scheme, but does not add IPA v5.0+ support.

An enumerated type will now define a unique ID for each IPA register.
Each defined register will have a structure that contains its offset
and its name (a printable string).  Each version of IPA will have an
array of these register structures, indexed by register ID.

Some "parameterized" registers are duplicated (this is not new).
For example, each endpoint has an INIT_HDR register, and the offset
of a given endpoint's INIT_HDR register is dependent on the endpoint
number (the parameter).  In such cases, the register's "stride" is
defined as the distance between two of these registers.

If a register contains fields, each field will have a unique ID
that's used as an index into an array of field masks defined for the
register.  The register structure also defines the number of entries
in this field array.

When a register is to be used in code, its register structure will
be fetched using function ipa_reg().  Other functions are then used
to determine the register's offset, or to encode a value into one of
the register's fields, and so on.

Each version of IPA defines the set of registers that are available,
including all fields for these registers.  The array of defined
registers is set up at probe time based on the IPA version, and it
is associated with the main IPA structure.
====================

Link: https://lore.kernel.org/r/20220926220931.3261749-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define remaining IPA register fields

Define the fields for the ENDP_INIT_DEAGGR, ENDP_INIT_RSRC_GRP,
ENDP_INIT_SEQ, ENDP_STATUS, and ENDP_FILTER_ROUTER_HSH_CFG, and
IPA_IRQ_UC IPA registers for all supported IPA versions.

Create enumerated types to identify fields for these IPA registers.
Use IPA_REG_FIELDS() and IPA_REG_STRIDE_FIELDS() to specify the
field mask values defined for these registers, for each supported
version of IPA.

Use ipa_reg_encode() and ipa_reg_bit() to build up the values to be
written to these registers, remove an inline function and all the
*_FMASK symbols that are now no longer used.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define more IPA endpoint register fields

Define the fields for the ENDP_INIT_MODE, ENDP_INIT_AGGR,
ENDP_INIT_HOL_BLOCK_EN, and ENDP_INIT_HOL_BLOCK_TIMER IPA
registers for all supported IPA versions.

Create enumerated types to identify fields for these IPA registers.
Use IPA_REG_STRIDE_FIELDS() to specify the field mask values defined
for these registers, for each supported version of IPA.

Change aggr_time_limit_encode() and hol_block_timer_encode() so they
take an ipa_reg pointer, and use those register's fields to compute
their encoded results. Have aggr_time_limit_encode() take an IPA
pointer rather than version, to match hol_block_timer_encode().

Use ipa_reg_encode(), ipa_reg_bit(), and ipa_reg_field_max() to
manipulate values to be written to these registers, remove the
definitions of the various inline functions and *_FMASK symbols that
are now no longer used.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define some IPA endpoint register fields

Define the fields for the ENDP_INIT_CTRL, ENDP_INIT_CFG, ENDP_INIT_NAT,
ENDP_INIT_HDR, and ENDP_INIT_HDR_EXT IPA registers for all supported
IPA versions.

Create enumerated types to identify fields for these IPA registers.
Use IPA_REG_STRIDE_FIELDS() to specify the field mask values defined
for these registers, for each supported version of IPA.

Move ipa_header_size_encoded() and ipa_metadata_offset_encoded() out
of "ipa_reg.h" and into "ipa_endpoint.c". Change them so they take
an additional ipa_reg structure argument, and use ipa_reg_encode()
to encode the parts of the header size and offset prior to writing
to the register. Change their names to be verbs rather than nouns.

Use ipa_reg_encode(), ipa_reg_bit, and ipa_reg_field_max() to
manipulate values to be written to these registers, remove the
definition of the no-longer-used *_FMASK symbols.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define resource group/type IPA register fields

Define the fields for the {SRC,DST}_RSRC_GRP_{01,23,45,67}_RSRC_TYPE
IPA registers for all supported IPA versions.

Create enumerated types to identify fields for these IPA registers.
Use IPA_REG_STRIDE_FIELDS() to specify the field mask values defined
for these registers, for each supported version of IPA.

Use ipa_reg_encode() to build up the values to be written to these
registers.

Remove the definition of the no-longer-used *_FMASK symbols.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define even more IPA register fields

Define the fields for the FLAVOR_0, IDLE_INDICATION_CFG,
QTIME_TIMESTAMP_CFG, TIMERS_XO_CLK_DIV_CFG and TIMERS_PULSE_GRAN_CFG
IPA registers for all supported IPA versions.

Create enumerated types to identify fields for these IPA registers.
Use IPA_REG_FIELDS() to specify the field mask values defined for
these registers, for each supported version of IPA.

Use ipa_reg_bit() and ipa_reg_encode() to build up the values to be
written to these registers. Use ipa_reg_decode() to extract field
values from the FLAVOR_0 register.

Remove the definition of the no-longer-used *_FMASK symbols.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define more IPA register fields

Define the fields for the LOCAL_PKT_PROC_CNTXT, COUNTER_CFG, and
IPA_TX_CFG IPA registers for all supported IPA versions.

Create enumerated types to identify fields for these IPA registers.
Use IPA_REG_FIELDS() to specify the field mask values defined for
these registers, for each supported version of IPA.

Use ipa_reg_bit() and ipa_reg_encode() to build up the values to be
written to these registers. Remove the definition of the *_FMASK
symbols as well as proc_cntxt_base_addr_encoded(), because they are
no longer needed.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define some more IPA register fields

Define the fields for the SHARED_MEM_SIZE, QSB_MAX_WRITES,
QSB_MAX_READS, FILT_ROUT_HASH_EN, and FILT_ROUT_HASH_FLUSH IPA
registers for all supported IPA versions.

Create enumerated types to identify fields for these registers. Use
IPA_REG_FIELDS() to specify the field mask values defined for these
registers, for each supported version of IPA.

Use ipa_reg_bit() and ipa_reg_encode() to build up the values to be
written to these registers rather than using the *_FMASK
preprocessor symbols.

Remove the definition of the now unused *_FMASK symbols.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define CLKON_CFG and ROUTE IPA register fields

Create the ipa_reg_clkon_cfg_field_id enumerated type, which
identifies the fields for the CLKON_CFG IPA register. Add "CLKON_"
to a few short names to try to avoid name conflicts. Create the
ipa_reg_route_field_id enumerated type, which identifies the fields
for the ROUTE IPA register.

Use IPA_REG_FIELDS() to specify the field mask values defined for
these registers, for each supported version of IPA.

Use ipa_reg_bit() and ipa_reg_encode() to build up the values to be
written to these registers rather than using the *_FMASK
preprocessor symbols.

Remove the definition of the now unused *_FMASK symbols.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define COMP_CFG IPA register fields

Create the ipa_reg_comp_cfg_field_id enumerated type, which
identifies the fields for the COMP_CFG IPA register.

Use IPA_REG_FIELDS() to specify the field mask values defined for
this register, for each supported version of IPA.

Use ipa_reg_bit() to build up the value to be written to this
register rather than using the *_FMASK preprocessor symbols.

Remove the definition of the *_FMASK symbols, along with the inline
functions that were used to encode certain fields whose position
and/or width within the register was dependent on IPA version.

Take this opportunity to represent all one-bit fields using BIT(x)
rather than GENMASK(x, x).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: introduce ipa_reg field masks

Add register field descriptors to the ipa_reg structure.  A field in
a register is defined by a field mask, which is a 32-bit mask having
a single contiguous range of bits set.

For each register that has at least one field defined, an enumerated
type will identify the register's fields.  The ipa_reg structure for
that register will include an array fmask[] of field masks, indexed
by that enumerated type.  Each field mask defines the position and
bit width of a field.  An additional "fcount" records how many
fields (masks) are defined for a given register.

Introduce two macros to be used to define registers that have at
least one field.

Introduce a few new functions related to field masks.  The first
simply returns a field mask, given an IPA register pointer and field
mask ID.  A variant of that is meant to be used for the special case
of single-bit field masks.

Next, ipa_reg_encode(), identifies a field with an IPA register
pointer and a field ID, and takes a value to represent in that
field.  The result encodes the value in the appropriate place to be
stored in the register.  This is roughly modeled after the bitmask
operations (like u32_encode_bits()).

Another function (ipa_reg_decode()) similarly identifies a register
field, but the value supplied to it represents a full register
value.  The value encoded in the field is extracted from the value
and returned.  This is also roughly modeled after bitmask operations
(such as u32_get_bits()).

Finally, ipa_reg_field_max() returns the maximum value representable
by a field.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: introduce ipa_reg()

Create a new function that returns a register descriptor given its
ID.  Change ipa_reg_offset() and ipa_reg_n_offset() so they take a
register descriptor argument rather than an IPA pointer and register
ID.  Have them accept null pointers (and return an invalid 0 offset),
to avoid the need for excessive error checking.  (A warning is issued
whenever ipa_reg() returns 0).

Call ipa_reg() or ipa_reg_n() to look up information about the
register before calls to ipa_reg_offset() and ipa_reg_n_offset().
Delay looking up offsets until they're needed to read or write
registers.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: use ipa_reg[] array for register offsets

Use the array of register descriptors assigned at initialization
time to determine the offset (and where used, stride) for IPA
registers. Issue a warning if an offset is requested for a register
that's not valid for the current system.

Remove all IPE_REG_*_OFFSET macros, as well as inline static
functions that returned register offsets.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: add per-version IPA register definition files

Create a new subdirectory "reg", which contains a register
definition file for each supported version of IPA.  Each register
definition contains the register's offset, and for parameterized
registers, the stride (distance between consecutive instances of the
register).  Finally, it includes an all-caps printable register name.

In these files, each IPA version defines an array of IPA register
definition pointers, with unsupported registers defined with a null
pointer.  The array is indexed by the ipa_reg_id enumerated type.

At initialization time, the appropriate register definition array to
use is selected based on the IPA version, and assigned to a new
"regs" field in the IPA structure.

Extend ipa_reg_valid() so it fails if a valid register is not
defined.

This patch simply puts this infrastructure in place; the next will
use it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: use IPA register IDs to determine offsets

Expose two inline functions that return the offset for a register
whose ID is provided; one of them takes an additional argument
that's used for registers that are parameterized. These both use
a common helper function __ipa_reg_offset(), which just uses the
offset symbols already defined.

Replace all references to the offset macros defined for IPA
registers with calls to ipa_reg_offset() or ipa_reg_n_offset().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: introduce IPA register IDs

Create a new ipa_reg_id enumerated type, which identifies each IPA
register with a symbolic identifier.  Use short names, but in some
cases (such as "BCR") add "IPA_" to the name to help avoid name
conflicts.

Create two functions that indicate register validity.  The first
concisely indicates whether a register is valid for a given version
of IPA, and if so, whether it is defined.  The second indicates
whether a register is valid for TX or RX endpoints.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

s390/qeth: Split memcpy() of struct qeth_ipacmd_addr_change flexible array

To work around a misbehavior of the compiler's ability to see into
composite flexible array structs (as detailed in the coming memcpy()
hardening series[1]), split the memcpy() of the header and the payload
so no false positive run-time overflow warning will be generated.

[1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org/

Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://lore.kernel.org/r/20220927003953.1942442-1-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Add skb drop reasons to IPv6 UDP receive path

Enumerate the skb drop reasons in the receive path for IPv6 UDP packets.

Signed-off-by: Donald Hunter <donald.hunter@redhat.com>
Link: https://lore.kernel.org/r/20220926120350.14928-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: Remove usage of the deprecated ida_simple_xxx API

Use ida_alloc_xxx()/ida_free() instead of
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Link: https://lore.kernel.org/r/20220926012744.3363-1-liubo03@inspur.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: tls: Add ARIA-GCM algorithm

RFC 6209 describes ARIA for TLS 1.2.
ARIA-128-GCM and ARIA-256-GCM are defined in RFC 6209.

This patch would offer performance increment and an opportunity for
hardware offload.

Benchmark results:
iperf-ssl are used.
CPU: intel i3-12100.

  TLS(openssl-3.0-dev)
[  3]  0.0- 1.0 sec   185 MBytes  1.55 Gbits/sec
[  3]  1.0- 2.0 sec   186 MBytes  1.56 Gbits/sec
[  3]  2.0- 3.0 sec   186 MBytes  1.56 Gbits/sec
[  3]  3.0- 4.0 sec   186 MBytes  1.56 Gbits/sec
[  3]  4.0- 5.0 sec   186 MBytes  1.56 Gbits/sec
[  3]  0.0- 5.0 sec   927 MBytes  1.56 Gbits/sec
  kTLS(aria-generic)
[  3]  0.0- 1.0 sec   198 MBytes  1.66 Gbits/sec
[  3]  1.0- 2.0 sec   194 MBytes  1.62 Gbits/sec
[  3]  2.0- 3.0 sec   194 MBytes  1.63 Gbits/sec
[  3]  3.0- 4.0 sec   194 MBytes  1.63 Gbits/sec
[  3]  4.0- 5.0 sec   194 MBytes  1.62 Gbits/sec
[  3]  0.0- 5.0 sec   974 MBytes  1.63 Gbits/sec
  kTLS(aria-avx wirh GFNI)
[  3]  0.0- 1.0 sec   632 MBytes  5.30 Gbits/sec
[  3]  1.0- 2.0 sec   657 MBytes  5.51 Gbits/sec
[  3]  2.0- 3.0 sec   657 MBytes  5.51 Gbits/sec
[  3]  3.0- 4.0 sec   656 MBytes  5.50 Gbits/sec
[  3]  4.0- 5.0 sec   656 MBytes  5.50 Gbits/sec
[  3]  0.0- 5.0 sec  3.18 GBytes  5.47 Gbits/sec

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/20220925150033.24615-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: Minor spell fix related to 'stmmac_clk_csr_set()'

Minor spell fix related to 'stmmac_clk_csr_set()' inside a
comment used in the 'stmmac_probe_config_dt()' function.

Cc: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Link: https://lore.kernel.org/r/20220924104514.1666947-1-bhupesh.sharma@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Remove from FPGA IFC file not-needed definitions

Move IP layout bits definitions to be close to the place that actually
uses it, together with removal extra defines that not in-use.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Remove unused structs

Remove structs which are no longer used in the driver:
  mlx5dr_cmd_qp_create_attr
  mlx5_fs_dr_ns
  mlx5_pas

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Remove unused functions

Remove functions which are no longer used in the driver:
  mlx5e_ipsec_is_tx_flow
  mlx5_health_flush
  get_cqe_enhanced_num_mini_cqes
  get_cqe_l3_hdr_type
  mlx5_health_flush
  mlx5_fs_is_ipsec_flow
  _mlx5_fs_is_outer_ipproto_flow
  mlx5_fs_is_outer_tcp_flow
  mlx5_fs_is_outer_udp_flow
  mlx5_fs_is_vxlan_flow
  mlx5_fs_is_outer_ipsec_flow

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: detect and enable bypass port select flow table

Use port selection capability port_select_flow_table_bypass
bit to detect and enable explicit port affinity even when
in lag hash mode.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Lag, enable hash mode by default for all NICs

The firmware supports adding a steering rule to catch egress traffic
of the QPs/TISs which are set port affinity explicitly in hash mode.
Enable that mode for NICS with 2 ports as well.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Lag, set active ports if support bypass port select flow table

active_port bit mask indicates the current active ports. Set bit indicates
the port is active. Update active ports info to FW to redirect the QP/TIS
from inactive ports to other ports.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

RDMA/mlx5: Don't set tx affinity when lag is in hash mode

In hash mode, without setting tx affinity explicitly, the port select
flow table decides which port is used for the traffic.
If port_select_flow_table_bypass capability is supported and tx affinity
is set explicitly for QP/TIS, they will be added into the explicit affinity
table in FW to check which port is used for the traffic.
1. The overloaded explicit affinity table may affect performance.
   To avoid this, do not set tx affinity explicitly by default.
2. The packets of the same flow need to be transmitted on the same port.
   Because the packets of the same flow use different QPs in slow & fast
   path, it shouldn't set tx affinity explicitly for these QPs.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: add IFC bits for bypassing port select flow table

port_select_flow_table_bypass - When set, device supports
bypass port select flow table.
active_port - Bitmask indicates the current active ports
in PORT_SELECT_FT LAG.
MLX5_SET_HCA_CAP_OP_MODE_PORT_SELECTION - op_mod to operate
PORT_SELECTION_Capabilities.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>