platform/kernel/linux-starfive.git
17 months agonet/sched: taprio: give higher priority to higher TCs in software dequeue mode
Vladimir Oltean [Tue, 7 Feb 2023 13:54:30 +0000 (15:54 +0200)]
net/sched: taprio: give higher priority to higher TCs in software dequeue mode

Current taprio software implementation is haunted by the shadow of the
igb/igc hardware model. It iterates over child qdiscs in increasing
order of TXQ index, therefore giving higher xmit priority to TXQ 0 and
lower to TXQ N. According to discussions with Vinicius, that is the
default (perhaps even unchangeable) prioritization scheme used for the
NICs that taprio was first written for (igb, igc), and we have a case of
two bugs canceling out, resulting in a functional setup on igb/igc, but
a less sane one on other NICs.

To the best of my understanding, taprio should prioritize based on the
traffic class, so it should really dequeue starting with the highest
traffic class and going down from there. We get to the TXQ using the
tc_to_txq[] netdev property.

TXQs within the same TC have the same (strict) priority, so we should
pick from them as fairly as we can. We can achieve that by implementing
something very similar to q->curband from multiq_dequeue().

Since igb/igc really do have TXQ 0 of higher hardware priority than
TXQ 1 etc, we need to preserve the behavior for them as well. We really
have no choice, because in txtime-assist mode, taprio is essentially a
software scheduler towards offloaded child tc-etf qdiscs, so the TXQ
selection really does matter (not all igb TXQs support ETF/SO_TXTIME,
says Kurt Kanzenbach).

To preserve the behavior, we need a capability bit so that taprio can
determine if it's running on igb/igc, or on something else. Because igb
doesn't offload taprio at all, we can't piggyback on the
qdisc_offload_query_caps() call from taprio_enable_offload(), but
instead we need a separate call which is also made for software
scheduling.

Introduce two static keys to minimize the performance penalty on systems
which only have igb/igc NICs, and on systems which only have other NICs.
For mixed systems, taprio will have to dynamically check whether to
dequeue using one prioritization algorithm or using the other.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: avoid calling child->ops->dequeue(child) twice
Vladimir Oltean [Tue, 7 Feb 2023 13:54:29 +0000 (15:54 +0200)]
net/sched: taprio: avoid calling child->ops->dequeue(child) twice

Simplify taprio_dequeue_from_txq() by noticing that we can goto one call
earlier than the previous skb_found label. This is possible because
we've unified the treatment of the child->ops->dequeue(child) return
call, we always try other TXQs now, instead of abandoning the root
dequeue completely if we failed in the peek() case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: refactor one skb dequeue from TXQ to separate function
Vladimir Oltean [Tue, 7 Feb 2023 13:54:28 +0000 (15:54 +0200)]
net/sched: taprio: refactor one skb dequeue from TXQ to separate function

Future changes will refactor the TXQ selection procedure, and a lot of
stuff will become messy, the indentation of the bulk of the dequeue
procedure would increase, etc.

Break out the bulk of the function into a new one, which knows the TXQ
(child qdisc) we should perform a dequeue from.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: continue with other TXQs if one dequeue() failed
Vladimir Oltean [Tue, 7 Feb 2023 13:54:27 +0000 (15:54 +0200)]
net/sched: taprio: continue with other TXQs if one dequeue() failed

This changes the handling of an unlikely condition to not stop dequeuing
if taprio failed to dequeue the peeked skb in taprio_dequeue().

I've no idea when this can happen, but the only side effect seems to be
that the atomic_sub_return() call right above will have consumed some
budget. This isn't a big deal, since either that made us remain without
any budget (and therefore, we'd exit on the next peeked skb anyway), or
we could send some packets from other TXQs.

I'm making this change because in a future patch I'll be refactoring the
dequeue procedure to simplify it, and this corner case will have to go
away.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: delete peek() implementation
Vladimir Oltean [Tue, 7 Feb 2023 13:54:26 +0000 (15:54 +0200)]
net/sched: taprio: delete peek() implementation

There isn't any code in the network stack which calls taprio_peek().
We only see qdisc->ops->peek() being called on child qdiscs of other
classful qdiscs, never from the generic qdisc code. Whereas taprio is
never a child qdisc, it is always root.

This snippet of a comment from qdisc_peek_dequeued() seems to confirm:

/* we can reuse ->gso_skb because peek isn't called for root qdiscs */

Since I've been known to be wrong many times though, I'm not completely
removing it, but leaving a stub function in place which emits a warning.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'micrel-lan8841-support'
David S. Miller [Wed, 8 Feb 2023 09:16:07 +0000 (09:16 +0000)]
Merge branch 'micrel-lan8841-support'

Horatiu Vultur says:

====================
net: micrel: Add support for lan8841 PHY

Add support for lan8841 PHY.

The first patch add the support for lan8841 PHY which can run at
10/100/1000Mbit. It also has support for other features, but they are not
added in this series.

The second patch updates the documentation for the dt-bindings which is
similar to the ksz9131.

v3->v4:
- add space between defines and function names
- inside lan8841_config_init use only ret variable

v2->v3:
- reuse ksz9131_config_init
- allow only open-drain configuration
- change from single patch to a patch series

v1->v2:
- Remove hardcoded values
- Fix typo in commit message
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agodt-bindings: net: micrel-ksz90x1.txt: Update for lan8841
Horatiu Vultur [Tue, 7 Feb 2023 10:52:12 +0000 (11:52 +0100)]
dt-bindings: net: micrel-ksz90x1.txt: Update for lan8841

The lan8841 has the same bindings as ksz9131, so just reuse the entire
section of ksz9131.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: micrel: Add support for lan8841 PHY
Horatiu Vultur [Tue, 7 Feb 2023 10:52:11 +0000 (11:52 +0100)]
net: micrel: Add support for lan8841 PHY

The LAN8841 is completely integrated triple-speed (10BASE-T/ 100BASE-TX/
1000BASE-T) Ethernet physical layer transceivers for transmission and
reception of data on standard CAT-5, as well as CAT-5e and CAT-6,
unshielded twisted pair (UTP) cables.
The LAN8841 offers the industry-standard GMII/MII as well as the RGMII.
Some of the features of the PHY are:
- Wake on LAN
- Auto-MDIX
- IEEE 1588-2008 (V2)
- LinkMD Capable diagnosis

Currently the patch offers support only for link configuration.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: lan966x: Add support for TC flower filter statistics
Horatiu Vultur [Tue, 7 Feb 2023 10:35:49 +0000 (11:35 +0100)]
net: lan966x: Add support for TC flower filter statistics

Add flower filter packet statistics. This will just read the TCAM
counter of the rule, which mention how many packages were hit by this
rule.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Wed, 8 Feb 2023 05:40:40 +0000 (21:40 -0800)]
Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: various virtualization cleanups

Jacob Keller says:

This series contains a variety of refactors and cleanups in the VF code for
the ice driver. Its primary focus is cleanup and simplification of the VF
operations and addition of a few new operations that will be required by
Scalable IOV, as well as some other refactors needed for the handling of VF
subfunctions.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: remove unnecessary virtchnl_ether_addr struct use
  ice: introduce .irq_close VF operation
  ice: introduce clear_reset_state operation
  ice: convert vf_ops .vsi_rebuild to .create_vsi
  ice: introduce ice_vf_init_host_cfg function
  ice: add a function to initialize vf entry
  ice: Pull common tasks into ice_vf_post_vsi_rebuild
  ice: move ice_vf_vsi_release into ice_vf_lib.c
  ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg
  ice: refactor VSI setup to use parameter structure
  ice: drop unnecessary VF parameter from several VSI functions
  ice: fix function comment referring to ice_vsi_alloc
  ice: Add more usage of existing function ice_get_vf_vsi(vf)
====================

Link: https://lore.kernel.org/r/20230206214813.20107-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agodevlink: Fix memleak in health diagnose callback
Moshe Shemesh [Mon, 6 Feb 2023 15:56:16 +0000 (17:56 +0200)]
devlink: Fix memleak in health diagnose callback

The callback devlink_nl_cmd_health_reporter_diagnose_doit() miss
devlink_fmsg_free(), which leads to memory leak.

Fix it by adding devlink_fmsg_free().

Fixes: e994a75fb7f9 ("devlink: remove reporter reference counting")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/1675698976-45993-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonfp: flower: add check for flower VF netdevs for get/set_eeprom
James Hershaw [Mon, 6 Feb 2023 15:48:36 +0000 (16:48 +0100)]
nfp: flower: add check for flower VF netdevs for get/set_eeprom

Move the nfp_net_get_port_mac_by_hwinfo() check to ahead in the
get/set_eeprom() functions to in order to check for a VF netdev, which
this function does not support.

It is debatable if this is a fix or an enhancement, and we have chosen
to go for the latter. It does address a problem introduced by
commit 74b4f1739d4e ("nfp: flower: change get/set_eeprom logic and enable for flower reps").
However, the ethtool->len == 0 check avoids the problem manifesting as a
run-time bug (NULL pointer dereference of app).

Signed-off-by: James Hershaw <james.hershaw@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230206154836.2803995-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'mlxsw-misc-devlink-changes'
Jakub Kicinski [Wed, 8 Feb 2023 04:18:51 +0000 (20:18 -0800)]
Merge branch 'mlxsw-misc-devlink-changes'

Petr Machata says:

====================
mlxsw: Misc devlink changes

This patchset adjusts mlxsw to recent devlink changes in net-next.

Patch #1 removes a devl_param_driverinit_value_set() call that was
unnecessary, but now additionally triggers a WARN_ON.

Patches #2-#4 are non-functional preparations for the following patches.

Patch #5 fixes a use-after-free that is triggered while changing network
namespaces.

Patch #6 makes mlxsw consistent with netdevsim by having mlxsw register
its devlink instance before its sub-objects. It helps us avoid a warning
described in the commit message.
====================

Link: https://lore.kernel.org/r/cover.1675692666.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agomlxsw: core: Register devlink instance before sub-objects
Ido Schimmel [Mon, 6 Feb 2023 15:39:23 +0000 (16:39 +0100)]
mlxsw: core: Register devlink instance before sub-objects

Recent changes made it possible to register the devlink instance before
its sub-objects and under the instance lock. Among other things, it
allows us to avoid warnings such as this one [1]. The warning is
generated because a buggy firmware is generating a health event during
driver initialization, before the devlink instance is registered.

Move the registration of the devlink instance to the beginning of the
initialization flow to avoid such problems.

A similar change was implemented in netdevsim in commit 82a3aef2e6af
("netdevsim: move devlink registration under the instance lock").

[1]
WARNING: CPU: 3 PID: 49 at net/devlink/leftover.c:7509 devlink_recover_notify.constprop.0+0xaf/0xc0
[...]
Call Trace:
 <TASK>
 devlink_health_report+0x45/0x1d0
 mlxsw_core_health_event_work+0x24/0x30 [mlxsw_core]
 process_one_work+0x1db/0x390
 worker_thread+0x49/0x3b0
 kthread+0xe5/0x110
 ret_from_fork+0x1f/0x30
 </TASK>

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agomlxsw: spectrum_acl_tcam: Move devlink param to TCAM code
Ido Schimmel [Mon, 6 Feb 2023 15:39:22 +0000 (16:39 +0100)]
mlxsw: spectrum_acl_tcam: Move devlink param to TCAM code

Cited commit added 'DEVLINK_CMD_PARAM_DEL' notifications whenever the
network namespace of the devlink instance is changed. Specifically, the
notifications are generated after calling reload_down(), but before
calling reload_up(). At this stage, the data structures accessed while
reading the value of the "acl_region_rehash_interval" devlink parameter
are uninitialized, resulting in a use-after-free [1].

Fix by moving the registration and unregistration of the devlink
parameter to the TCAM code where it is actually used. This means that
the parameter is unregistered during reload_down() and then
re-registered during reload_up(), avoiding the use-after-free between
these two operations.

Reproducer:

 # ip netns add test123
 # devlink dev reload pci/0000:06:00.0 netns test123

[1]
BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
Read of size 4 at addr ffff888162fd37d8 by task devlink/1323
[...]
Call Trace:
 <TASK>
 dump_stack_lvl+0x95/0xbd
 print_report+0x181/0x4a1
 kasan_report+0xdb/0x200
 mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
 mlxsw_sp_params_acl_region_rehash_intrvl_get+0x32/0x80
 devlink_nl_param_fill.constprop.0+0x29a/0x11e0
 devlink_param_notify.constprop.0+0xb9/0x250
 devlink_notify_unregister+0xbc/0x470
 devlink_reload+0x1aa/0x440
 devlink_nl_cmd_reload+0x559/0x11b0
 genl_family_rcv_msg_doit.isra.0+0x1f8/0x2e0
 genl_rcv_msg+0x558/0x7f0
 netlink_rcv_skb+0x170/0x440
 genl_rcv+0x2d/0x40
 netlink_unicast+0x53f/0x810
 netlink_sendmsg+0x961/0xe80
 __sys_sendto+0x2a4/0x420
 __x64_sys_sendto+0xe5/0x1c0
 do_syscall_64+0x38/0x80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 7d7e9169a3ec ("devlink: move devlink reload notifications back in between _down() and _up() calls")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agomlxsw: spectrum_acl_tcam: Reorder functions to avoid forward declarations
Ido Schimmel [Mon, 6 Feb 2023 15:39:21 +0000 (16:39 +0100)]
mlxsw: spectrum_acl_tcam: Reorder functions to avoid forward declarations

Move the initialization and de-initialization code further below in
order to avoid forward declarations in the next patch. No functional
changes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agomlxsw: spectrum_acl_tcam: Make fini symmetric to init
Ido Schimmel [Mon, 6 Feb 2023 15:39:20 +0000 (16:39 +0100)]
mlxsw: spectrum_acl_tcam: Make fini symmetric to init

Move mutex_destroy() to the end to make the function symmetric with
mlxsw_sp_acl_tcam_init(). No functional changes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agomlxsw: spectrum_acl_tcam: Add missing mutex_destroy()
Ido Schimmel [Mon, 6 Feb 2023 15:39:19 +0000 (16:39 +0100)]
mlxsw: spectrum_acl_tcam: Add missing mutex_destroy()

Pair mutex_init() with a mutex_destroy() in the error path. Found during
code review. No functional changes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agomlxsw: spectrum: Remove pointless call to devlink_param_driverinit_value_set()
Danielle Ratson [Mon, 6 Feb 2023 15:39:18 +0000 (16:39 +0100)]
mlxsw: spectrum: Remove pointless call to devlink_param_driverinit_value_set()

The "acl_region_rehash_interval" devlink parameter is a "runtime"
parameter, making the call to devl_param_driverinit_value_set()
pointless. Before cited commit the function simply returned an error
(that was not checked), but now it emits a WARNING [1].

Fix by removing the function call.

[1]
WARNING: CPU: 0 PID: 7 at net/devlink/leftover.c:10974
devl_param_driverinit_value_set+0x8c/0x90
[...]
Call Trace:
 <TASK>
 mlxsw_sp2_params_register+0x83/0xb0 [mlxsw_spectrum]
 __mlxsw_core_bus_device_register+0x5e5/0x990 [mlxsw_core]
 mlxsw_core_bus_device_register+0x42/0x60 [mlxsw_core]
 mlxsw_pci_probe+0x1f0/0x230 [mlxsw_pci]
 local_pci_probe+0x1a/0x40
 work_for_cpu_fn+0xf/0x20
 process_one_work+0x1db/0x390
 worker_thread+0x1d5/0x3b0
 kthread+0xe5/0x110
 ret_from_fork+0x1f/0x30
 </TASK>

Fixes: 85fe0b324c83 ("devlink: make devlink_param_driverinit_value_set() return void")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: enetc: add support for MAC Merge statistics counters
Vladimir Oltean [Mon, 6 Feb 2023 09:45:31 +0000 (11:45 +0200)]
net: enetc: add support for MAC Merge statistics counters

Add PF driver support for the following:

- Viewing the standardized MAC Merge layer counters.

- Viewing the standardized Ethernet MAC and RMON counters associated
  with the pMAC.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230206094531.444988-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: enetc: add support for MAC Merge layer
Vladimir Oltean [Mon, 6 Feb 2023 09:45:30 +0000 (11:45 +0200)]
net: enetc: add support for MAC Merge layer

Add PF driver support for viewing and changing the MAC Merge sublayer
parameters, and seeing the verification state machine's current state.
The verification handshake with the link partner is driven by hardware.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230206094531.444988-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'sched-cpumask-improve-on-cpumask_local_spread-locality'
Jakub Kicinski [Wed, 8 Feb 2023 02:20:02 +0000 (18:20 -0800)]
Merge branch 'sched-cpumask-improve-on-cpumask_local_spread-locality'

Yury Norov says:

====================
sched: cpumask: improve on cpumask_local_spread() locality

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This has significant performance implications on NUMA machines, for example
when using NUMA-aware allocated memory together with NUMA-aware IRQ
affinity hints.

Performance tests from patch 8 of this series for mellanox network
driver show:

  TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
  Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

  +-------------------------+-----------+------------------+------------------+
  |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
  +-------------------------+-----------+------------------+------------------+
  | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
  +-------------------------+-----------+------------------+------------------+

  Bottleneck in RX side is released, reached linerate (~1.8x speedup).
  ~30% less cpu util on TX.
====================

Link: https://lore.kernel.org/r/20230121042436.2661843-1-yury.norov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agolib/cpumask: update comment for cpumask_local_spread()
Yury Norov [Sat, 21 Jan 2023 04:24:36 +0000 (20:24 -0800)]
lib/cpumask: update comment for cpumask_local_spread()

Now that we have an iterator-based alternative for a very common case
of using cpumask_local_spread for all cpus in a row, it's worth to
mention that in comment to cpumask_local_spread().

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints
Tariq Toukan [Sat, 21 Jan 2023 04:24:35 +0000 (20:24 -0800)]
net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints

In the IRQ affinity hints, replace the binary NUMA preference (local /
remote) with the improved for_each_numa_hop_cpu() API that minds the
actual distances, so that remote NUMAs with short distance are preferred
over farther ones.

This has significant performance implications when using NUMA-aware
allocated memory (follow [1] and derivatives for example).

[1]
drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
   int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));

Performance tests:

TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

+-------------------------+-----------+------------------+------------------+
|                         | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline                | 52.3      | 6.4 %            | 17.9 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
+-------------------------+-----------+------------------+------------------+

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

* CPU util on active cores only.

Setups details (similar for both sides):

NIC: ConnectX6-DX dual port, 100 Gbps each.
Single port used in the tests.

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        16
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             2594.804
BogoMIPS:            4890.73
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7,128-135
NUMA node1 CPU(s):   8-15,136-143
NUMA node2 CPU(s):   16-23,144-151
NUMA node3 CPU(s):   24-31,152-159
NUMA node4 CPU(s):   32-39,160-167
NUMA node5 CPU(s):   40-47,168-175
NUMA node6 CPU(s):   48-55,176-183
NUMA node7 CPU(s):   56-63,184-191
NUMA node8 CPU(s):   64-71,192-199
NUMA node9 CPU(s):   72-79,200-207
NUMA node10 CPU(s):  80-87,208-215
NUMA node11 CPU(s):  88-95,216-223
NUMA node12 CPU(s):  96-103,224-231
NUMA node13 CPU(s):  104-111,232-239
NUMA node14 CPU(s):  112-119,240-247
NUMA node15 CPU(s):  120-127,248-255
..

$ numactl -H
..
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  11  11  11  12  12  12  12  32  32  32  32  32  32  32  32
  1:  11  10  11  11  12  12  12  12  32  32  32  32  32  32  32  32
  2:  11  11  10  11  12  12  12  12  32  32  32  32  32  32  32  32
  3:  11  11  11  10  12  12  12  12  32  32  32  32  32  32  32  32
  4:  12  12  12  12  10  11  11  11  32  32  32  32  32  32  32  32
  5:  12  12  12  12  11  10  11  11  32  32  32  32  32  32  32  32
  6:  12  12  12  12  11  11  10  11  32  32  32  32  32  32  32  32
  7:  12  12  12  12  11  11  11  10  32  32  32  32  32  32  32  32
  8:  32  32  32  32  32  32  32  32  10  11  11  11  12  12  12  12
  9:  32  32  32  32  32  32  32  32  11  10  11  11  12  12  12  12
 10:  32  32  32  32  32  32  32  32  11  11  10  11  12  12  12  12
 11:  32  32  32  32  32  32  32  32  11  11  11  10  12  12  12  12
 12:  32  32  32  32  32  32  32  32  12  12  12  12  10  11  11  11
 13:  32  32  32  32  32  32  32  32  12  12  12  12  11  10  11  11
 14:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  10  11
 15:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  11  10

$ cat /sys/class/net/ens5f0/device/numa_node
14

Affinity hints (127 IRQs):
Before:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
348: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
349: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004
350: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000008
351: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000010
352: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020
353: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000040
354: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
355: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
356: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000200
357: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000400
358: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000800
359: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00001000
360: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
361: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00004000
362: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000
363: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
364: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00020000
365: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00040000
366: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000
367: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00100000
368: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00200000
369: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00400000
370: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00800000
371: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,01000000
372: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,02000000
373: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,04000000
374: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,08000000
375: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,10000000
376: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000
377: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,40000000
378: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,80000000
379: 00000000,00000000,00000000,00000000,00000000,00000000,00000001,00000000
380: 00000000,00000000,00000000,00000000,00000000,00000000,00000002,00000000
381: 00000000,00000000,00000000,00000000,00000000,00000000,00000004,00000000
382: 00000000,00000000,00000000,00000000,00000000,00000000,00000008,00000000
383: 00000000,00000000,00000000,00000000,00000000,00000000,00000010,00000000
384: 00000000,00000000,00000000,00000000,00000000,00000000,00000020,00000000
385: 00000000,00000000,00000000,00000000,00000000,00000000,00000040,00000000
386: 00000000,00000000,00000000,00000000,00000000,00000000,00000080,00000000
387: 00000000,00000000,00000000,00000000,00000000,00000000,00000100,00000000
388: 00000000,00000000,00000000,00000000,00000000,00000000,00000200,00000000
389: 00000000,00000000,00000000,00000000,00000000,00000000,00000400,00000000
390: 00000000,00000000,00000000,00000000,00000000,00000000,00000800,00000000
391: 00000000,00000000,00000000,00000000,00000000,00000000,00001000,00000000
392: 00000000,00000000,00000000,00000000,00000000,00000000,00002000,00000000
393: 00000000,00000000,00000000,00000000,00000000,00000000,00004000,00000000
394: 00000000,00000000,00000000,00000000,00000000,00000000,00008000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000000,00010000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000000,00020000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000000,00040000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000000,00080000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000000,00100000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000000,00200000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000000,00400000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000000,00800000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000000,01000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000000,02000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000000,04000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000000,08000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00000000,10000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00000000,20000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00000000,40000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00000000,80000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
427: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
428: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
429: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
430: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
431: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
432: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
433: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
434: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
435: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
436: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
437: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
438: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
439: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
440: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
441: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
442: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
443: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
444: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
445: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
446: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
447: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
448: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
449: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
450: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
451: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
452: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
453: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
454: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
455: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
456: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
457: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000

After:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
348: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
349: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
350: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
351: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
352: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
353: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
354: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
355: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
356: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
357: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
358: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
359: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
360: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
361: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000
362: 00000000,00000000,00000000,00000000,00008000,00000000,00000000,00000000
363: 00000000,00000000,00000000,00000000,01000000,00000000,00000000,00000000
364: 00000000,00000000,00000000,00000000,02000000,00000000,00000000,00000000
365: 00000000,00000000,00000000,00000000,04000000,00000000,00000000,00000000
366: 00000000,00000000,00000000,00000000,08000000,00000000,00000000,00000000
367: 00000000,00000000,00000000,00000000,10000000,00000000,00000000,00000000
368: 00000000,00000000,00000000,00000000,20000000,00000000,00000000,00000000
369: 00000000,00000000,00000000,00000000,40000000,00000000,00000000,00000000
370: 00000000,00000000,00000000,00000000,80000000,00000000,00000000,00000000
371: 00000001,00000000,00000000,00000000,00000000,00000000,00000000,00000000
372: 00000002,00000000,00000000,00000000,00000000,00000000,00000000,00000000
373: 00000004,00000000,00000000,00000000,00000000,00000000,00000000,00000000
374: 00000008,00000000,00000000,00000000,00000000,00000000,00000000,00000000
375: 00000010,00000000,00000000,00000000,00000000,00000000,00000000,00000000
376: 00000020,00000000,00000000,00000000,00000000,00000000,00000000,00000000
377: 00000040,00000000,00000000,00000000,00000000,00000000,00000000,00000000
378: 00000080,00000000,00000000,00000000,00000000,00000000,00000000,00000000
379: 00000100,00000000,00000000,00000000,00000000,00000000,00000000,00000000
380: 00000200,00000000,00000000,00000000,00000000,00000000,00000000,00000000
381: 00000400,00000000,00000000,00000000,00000000,00000000,00000000,00000000
382: 00000800,00000000,00000000,00000000,00000000,00000000,00000000,00000000
383: 00001000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
384: 00002000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
385: 00004000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
386: 00008000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
387: 01000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
388: 02000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
389: 04000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
390: 08000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
391: 10000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
392: 20000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
393: 40000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
394: 80000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
427: 00000000,00000001,00000000,00000000,00000000,00000000,00000000,00000000
428: 00000000,00000002,00000000,00000000,00000000,00000000,00000000,00000000
429: 00000000,00000004,00000000,00000000,00000000,00000000,00000000,00000000
430: 00000000,00000008,00000000,00000000,00000000,00000000,00000000,00000000
431: 00000000,00000010,00000000,00000000,00000000,00000000,00000000,00000000
432: 00000000,00000020,00000000,00000000,00000000,00000000,00000000,00000000
433: 00000000,00000040,00000000,00000000,00000000,00000000,00000000,00000000
434: 00000000,00000080,00000000,00000000,00000000,00000000,00000000,00000000
435: 00000000,00000100,00000000,00000000,00000000,00000000,00000000,00000000
436: 00000000,00000200,00000000,00000000,00000000,00000000,00000000,00000000
437: 00000000,00000400,00000000,00000000,00000000,00000000,00000000,00000000
438: 00000000,00000800,00000000,00000000,00000000,00000000,00000000,00000000
439: 00000000,00001000,00000000,00000000,00000000,00000000,00000000,00000000
440: 00000000,00002000,00000000,00000000,00000000,00000000,00000000,00000000
441: 00000000,00004000,00000000,00000000,00000000,00000000,00000000,00000000
442: 00000000,00008000,00000000,00000000,00000000,00000000,00000000,00000000
443: 00000000,00010000,00000000,00000000,00000000,00000000,00000000,00000000
444: 00000000,00020000,00000000,00000000,00000000,00000000,00000000,00000000
445: 00000000,00040000,00000000,00000000,00000000,00000000,00000000,00000000
446: 00000000,00080000,00000000,00000000,00000000,00000000,00000000,00000000
447: 00000000,00100000,00000000,00000000,00000000,00000000,00000000,00000000
448: 00000000,00200000,00000000,00000000,00000000,00000000,00000000,00000000
449: 00000000,00400000,00000000,00000000,00000000,00000000,00000000,00000000
450: 00000000,00800000,00000000,00000000,00000000,00000000,00000000,00000000
451: 00000000,01000000,00000000,00000000,00000000,00000000,00000000,00000000
452: 00000000,02000000,00000000,00000000,00000000,00000000,00000000,00000000
453: 00000000,04000000,00000000,00000000,00000000,00000000,00000000,00000000
454: 00000000,08000000,00000000,00000000,00000000,00000000,00000000,00000000
455: 00000000,10000000,00000000,00000000,00000000,00000000,00000000,00000000
456: 00000000,20000000,00000000,00000000,00000000,00000000,00000000,00000000
457: 00000000,40000000,00000000,00000000,00000000,00000000,00000000,00000000

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
[Tweaked API use]
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agosched/topology: Introduce for_each_numa_hop_mask()
Valentin Schneider [Sat, 21 Jan 2023 04:24:34 +0000 (20:24 -0800)]
sched/topology: Introduce for_each_numa_hop_mask()

The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
reachable within a given distance budget, wrap the logic for iterating over
all (distance, mask) values inside an iterator macro.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agosched/topology: Introduce sched_numa_hop_mask()
Valentin Schneider [Sat, 21 Jan 2023 04:24:33 +0000 (20:24 -0800)]
sched/topology: Introduce sched_numa_hop_mask()

Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.

Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agolib/cpumask: reorganize cpumask_local_spread() logic
Yury Norov [Sat, 21 Jan 2023 04:24:32 +0000 (20:24 -0800)]
lib/cpumask: reorganize cpumask_local_spread() logic

Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
else-branch of cpumask_local_spread() is just a function call, and
we can simplify logic by using ternary operator.

While here, replace BUG() with WARN_ON().

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agocpumask: improve on cpumask_local_spread() locality
Yury Norov [Sat, 21 Jan 2023 04:24:31 +0000 (20:24 -0800)]
cpumask: improve on cpumask_local_spread() locality

Switch cpumask_local_spread() to use newly added sched_numa_find_nth_cpu(),
which takes into account distances to each node in the system.

For the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

The new cpumask_local_spread() traverses cpus for each node like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agosched: add sched_numa_find_nth_cpu()
Yury Norov [Sat, 21 Jan 2023 04:24:30 +0000 (20:24 -0800)]
sched: add sched_numa_find_nth_cpu()

The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agocpumask: introduce cpumask_nth_and_andnot
Yury Norov [Sat, 21 Jan 2023 04:24:29 +0000 (20:24 -0800)]
cpumask: introduce cpumask_nth_and_andnot

Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agolib/find: introduce find_nth_and_andnot_bit
Yury Norov [Sat, 21 Jan 2023 04:24:28 +0000 (20:24 -0800)]
lib/find: introduce find_nth_and_andnot_bit

In the following patches the function is used to implement in-place bitmaps
traversing without storing intermediate result in temporary bitmaps.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'net-core-use-a-dedicated-kmem_cache-for-skb-head-allocs'
Jakub Kicinski [Tue, 7 Feb 2023 18:57:57 +0000 (10:57 -0800)]
Merge branch 'net-core-use-a-dedicated-kmem_cache-for-skb-head-allocs'

Eric Dumazet says:

====================
net: core: use a dedicated kmem_cache for skb head allocs

Our profile data show that using kmalloc(non_const_size)/kfree(ptr)
has a certain cost, because kfree(ptr) has to pull a 'struct page'
in cpu caches.

Using a dedicated kmem_cache for TCP skb->head allocations makes
a difference, both in cpu cycles and memory savings.

This kmem_cache could also be used for GRO skb allocations,
this is left as a future exercise.
====================

Link: https://lore.kernel.org/r/20230206173103.2617121-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: add dedicated kmem_cache for typical/small skb->head
Eric Dumazet [Mon, 6 Feb 2023 17:31:03 +0000 (17:31 +0000)]
net: add dedicated kmem_cache for typical/small skb->head

Recent removal of ksize() in alloc_skb() increased
performance because we no longer read
the associated struct page.

We have an equivalent cost at kfree_skb() time.

kfree(skb->head) has to access a struct page,
often cold in cpu caches to get the owning
struct kmem_cache.

Considering that many allocations are small (at least for TCP ones)
we can have our own kmem_cache to avoid the cache line miss.

This also saves memory because these small heads
are no longer padded to 1024 bytes.

CONFIG_SLUB=y
$ grep skbuff_small_head /proc/slabinfo
skbuff_small_head   2907   2907    640   51    8 : tunables    0    0    0 : slabdata     57     57      0

CONFIG_SLAB=y
$ grep skbuff_small_head /proc/slabinfo
skbuff_small_head    607    624    640    6    1 : tunables   54   27    8 : slabdata    104    104      5

Notes:

- After Kees Cook patches and this one, we might
  be able to revert commit
  dbae2b062824 ("net: skb: introduce and use a single page frag cache")
  because GRO_MAX_HEAD is also small.

- This patch is a NOP for CONFIG_SLOB=y builds.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: factorize code in kmalloc_reserve()
Eric Dumazet [Mon, 6 Feb 2023 17:31:02 +0000 (17:31 +0000)]
net: factorize code in kmalloc_reserve()

All kmalloc_reserve() callers have to make the same computation,
we can factorize them, to prepare following patch in the series.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: remove osize variable in __alloc_skb()
Eric Dumazet [Mon, 6 Feb 2023 17:31:01 +0000 (17:31 +0000)]
net: remove osize variable in __alloc_skb()

This is a cleanup patch, to prepare following change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: add SKB_HEAD_ALIGN() helper
Eric Dumazet [Mon, 6 Feb 2023 17:31:00 +0000 (17:31 +0000)]
net: add SKB_HEAD_ALIGN() helper

We have many places using this expression:

 SKB_DATA_ALIGN(sizeof(struct skb_shared_info))

Use of SKB_HEAD_ALIGN() will allow to clean them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux...
Paolo Abeni [Tue, 7 Feb 2023 14:54:08 +0000 (15:54 +0100)]
Merge tag 'linux-can-next-for-6.3-20230206' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2023-02-06

this is a pull request of 47 patches for net-next/master.

The first two patch is by Oliver Hartkopp. One adds missing error
checking to the CAN_GW protocol, the other adds a missing CAN address
family check to the CAN ISO TP protocol.

Thomas Kopp contributes a performance optimization to the mcp251xfd
driver.

The next 11 patches are by Geert Uytterhoeven and add support for
R-Car V4H systems to the rcar_canfd driver.

Stephane Grosjean and Lukas Magel contribute 8 patches to the peak_usb
driver, which add support for configurable CAN channel ID.

The last 17 patches are by me and target the CAN bit timing
configuration. The bit timing is cleaned up, error messages are
improved and forwarded to user space via NL_SET_ERR_MSG_FMT() instead
of netdev_err(), and the SJW handling is updated, including the
definition of a new default value that will benefit CAN-FD
controllers, by increasing their oscillator tolerance.

* tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (47 commits)
  can: bittiming: can_validate_bitrate(): report error via netlink
  can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
  can: bittiming: can_calc_bittiming(): clean up SJW handling
  can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW
  can: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffer Segment
  can: bittiming: can_sjw_check(): report error via netlink and harmonize error value
  can: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error value
  can: bittiming: factor out can_sjw_set_default() and can_sjw_check()
  can: bittiming: can_changelink() pass extack down callstack
  can: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
  can: netlink: can_validate(): validate sample point for CAN and CAN-FD
  can: dev: register_candev(): bail out if both fixed bit rates and bit timing constants are provided
  can: dev: register_candev(): ensure that bittiming const are valid
  can: bittiming: can_get_bittiming(): use direct return and remove unneeded else
  can: bittiming: can_fixup_bittiming(): set effective tq
  can: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1
  can: bittiming(): replace open coded variants of can_bit_time()
  can: peak_usb: Reorder include directives alphabetically
  can: peak_usb: align CAN channel ID format in log with sysfs attribute
  can: peak_usb: export PCAN CAN channel ID as sysfs device attribute
  ...
====================

Link: https://lore.kernel.org/r/20230206131620.2758724-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agoethtool: mm: fix get_mm() return code not propagating to user space
Vladimir Oltean [Mon, 6 Feb 2023 09:49:32 +0000 (11:49 +0200)]
ethtool: mm: fix get_mm() return code not propagating to user space

If ops->get_mm() returns a non-zero error code, we goto out_complete,
but there, we return 0. Fix that to propagate the "ret" variable to the
caller. If ops->get_mm() succeeds, it will always return 0.

Fixes: 2b30f8291a30 ("net: ethtool: add support for MAC Merge layer")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230206094932.446379-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
17 months agonet: openvswitch: reduce cpu_used_mask memory
Eddy Tao [Sun, 5 Feb 2023 01:35:37 +0000 (09:35 +0800)]
net: openvswitch: reduce cpu_used_mask memory

Use actual CPU number instead of hardcoded value to decide the size
of 'cpu_used_mask' in 'struct sw_flow'. Below is the reason.

'struct cpumask cpu_used_mask' is embedded in struct sw_flow.
Its size is hardcoded to CONFIG_NR_CPUS bits, which can be
8192 by default, it costs memory and slows down ovs_flow_alloc.

To address this:
 Redefine cpu_used_mask to pointer.
 Append cpumask_size() bytes after 'stat' to hold cpumask.
 Initialization cpu_used_mask right after stats_last_writer.

APIs like cpumask_next and cpumask_set_cpu never access bits
beyond cpu count, cpumask_size() bytes of memory is enough.

Signed-off-by: Eddy Tao <taoyuan_eddy@hotmail.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/OS3P286MB229570CCED618B20355D227AF5D59@OS3P286MB2295.JPNP286.PROD.OUTLOOK.COM
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoamd-xgbe: fix mismatched prototype
Arnd Bergmann [Fri, 3 Feb 2023 12:15:36 +0000 (13:15 +0100)]
amd-xgbe: fix mismatched prototype

The forward declaration was introduced with a prototype that does
not match the function definition:

drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:2166:13: error: conflicting types for 'xgbe_phy_perform_ratechange' due to enum/integer mismatch; have 'void(struct xgbe_prv_data *, enum xgbe_mb_cmd,  enum xgbe_mb_subcmd)' [-Werror=enum-int-mismatch]
 2166 | static void xgbe_phy_perform_ratechange(struct xgbe_prv_data *pdata,
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:391:13: note: previous declaration of 'xgbe_phy_perform_ratechange' with type 'void(struct xgbe_prv_data *, unsigned int,  unsigned int)'
  391 | static void xgbe_phy_perform_ratechange(struct xgbe_prv_data *pdata,
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~

Ideally there should not be any forward declarations here, which
would make it easier to show that there is no unbounded recursion.
I tried fixing this but could not figure out how to avoid the
recursive call.

As a hotfix, address only the broken prototype to fix the build
problem instead.

Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Link: https://lore.kernel.org/r/20230203121553.2871598-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: mscc: ocelot: un-export unused regmap symbols
Colin Foster [Sat, 4 Feb 2023 18:20:56 +0000 (10:20 -0800)]
net: mscc: ocelot: un-export unused regmap symbols

There are no external users of the vsc7514_*_regmap[] symbols or
vsc7514_vcap_* functions. They were exported in commit 32ecd22ba60b ("net:
mscc: ocelot: split register definitions to a separate file") with the
intention of being used, but the actual structure used in commit
2efaca411c96 ("net: mscc: ocelot: expose vsc7514_regmap definition") ended
up being all that was needed.

Bury these unnecessary symbols.

Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230204182056.25502-1-colin.foster@in-advantage.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'aux-bus-v11' of https://github.com/ajitkhaparde1/linux
Jakub Kicinski [Tue, 7 Feb 2023 06:24:33 +0000 (22:24 -0800)]
Merge branch 'aux-bus-v11' of https://github.com/ajitkhaparde1/linux

Ajit Khaparde says:

====================
bnxt: Add Auxiliary driver support

Add auxiliary device driver for Broadcom devices.
The bnxt_en driver will register and initialize an aux device
if RDMA is enabled in the underlying device.
The bnxt_re driver will then probe and initialize the
RoCE interfaces with the infiniband stack.

We got rid of the bnxt_en_ops which the bnxt_re driver used to
communicate with bnxt_en.
Similarly  We have tried to clean up most of the bnxt_ulp_ops.
In most of the cases we used the functions and entry points provided
by the auxiliary bus driver framework.
And now these are the minimal functions needed to support the functionality.

We will try to work on getting rid of the remaining if we find any
other viable option in future.

* 'aux-bus-v11' of https://github.com/ajitkhaparde1/linux:
  bnxt_en: Remove runtime interrupt vector allocation
  RDMA/bnxt_re: Remove the sriov config callback
  bnxt_en: Remove struct bnxt access from RoCE driver
  bnxt_en: Use auxiliary bus calls over proprietary calls
  bnxt_en: Use direct API instead of indirection
  bnxt_en: Remove usage of ulp_id
  RDMA/bnxt_re: Use auxiliary driver interface
  bnxt_en: Add auxiliary driver support
====================

Link: https://lore.kernel.org/r/20230202033809.3989-1-ajit.khaparde@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoice: remove unnecessary virtchnl_ether_addr struct use
Jacob Keller [Thu, 19 Jan 2023 01:16:53 +0000 (17:16 -0800)]
ice: remove unnecessary virtchnl_ether_addr struct use

The dev_lan_addr and hw_lan_addr members of ice_vf are used only to store
the MAC address for the VF. They are defined using virtchnl_ether_addr, but
only the .addr sub-member is actually used. Drop the use of
virtchnl_ether_addr and just use a u8 array of length [ETH_ALEN].

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: introduce .irq_close VF operation
Jacob Keller [Thu, 19 Jan 2023 01:16:52 +0000 (17:16 -0800)]
ice: introduce .irq_close VF operation

The Scalable IOV implementation will require notifying the VDCM driver when
an IRQ must be closed. This allows the VDCM to handle releasing stale IRQ
context values and properly reconfigure.

To handle this, introduce a new optional .irq_close callback to the VF
operations structure. This will be implemented by Scalable IOV to handle
the shutdown of the IRQ context.

Since the SR-IOV implementation does not need this, we must check that its
non-NULL before calling it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: introduce clear_reset_state operation
Jacob Keller [Thu, 19 Jan 2023 01:16:51 +0000 (17:16 -0800)]
ice: introduce clear_reset_state operation

When hardware is reset, the VF relies on the VFGEN_RSTAT register to detect
when the VF is finished resetting. This is a tri-state register where 0
indicates a reset is in progress, 1 indicates the hardware is done
resetting, and 2 indicates that the software is done resetting.

Currently the PF driver relies on the device hardware resetting VFGEN_RSTAT
when a global reset occurs. This works ok, but it does mean that the VF
might not immediately notice a reset when the driver first detects that the
global reset is occurring.

This is also problematic for Scalable IOV, because there is no read/write
equivalent VFGEN_RSTAT register for the Scalable VSI type. Instead, the
Scalable IOV VFs will need to emulate this register.

To support this, introduce a new VF operation, clear_reset_state, which is
called when the PF driver first detects a global reset. The Single Root IOV
implementation can just write to VFGEN_RSTAT to ensure it's cleared
immediately, without waiting for the actual hardware reset to begin. The
Scalable IOV implementation will use this as part of its tracking of the
reset status to allow properly reporting the emulated VFGEN_RSTAT to the VF
driver.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: convert vf_ops .vsi_rebuild to .create_vsi
Jacob Keller [Thu, 19 Jan 2023 01:16:50 +0000 (17:16 -0800)]
ice: convert vf_ops .vsi_rebuild to .create_vsi

The .vsi_rebuild function exists for ice_reset_vf. It is used to release
and re-create the VSI during a single-VF reset.

This function is only called when we need to re-create the VSI, and not
when rebuilding an existing VSI. This makes the single-VF reset process
different from the process used to restore functionality after a
hardware reset such as the PF reset or EMP reset.

When we add support for Scalable IOV VFs, the implementation will be very
similar. The primary difference will be in the fact that each VF type uses
a different underlying VSI type in hardware.

Move the common functionality into a new ice_vf_recreate VSI function. This
will allow the two IOV paths to share this functionality. Rework the
.vsi_rebuild vf_op into .create_vsi, only performing the task of creating a
new VSI.

This creates a nice dichotomy between the ice_vf_rebuild_vsi and
ice_vf_recreate_vsi, and should make it more clear why the two flows atre
distinct.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: introduce ice_vf_init_host_cfg function
Jacob Keller [Thu, 19 Jan 2023 01:16:49 +0000 (17:16 -0800)]
ice: introduce ice_vf_init_host_cfg function

Introduce a new generic helper ice_vf_init_host_cfg which performs common
host configuration initialization tasks that will need to be done for both
Single Root IOV and the new Scalable IOV implementation.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: add a function to initialize vf entry
Jacob Keller [Thu, 19 Jan 2023 01:16:48 +0000 (17:16 -0800)]
ice: add a function to initialize vf entry

Some of the initialization code for Single Root IOV VFs will need to be
reused when we introduce Scalable IOV. Pull this code out into a new
ice_initialize_vf_entry helper function.

Co-developed-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: Pull common tasks into ice_vf_post_vsi_rebuild
Jacob Keller [Thu, 19 Jan 2023 01:16:47 +0000 (17:16 -0800)]
ice: Pull common tasks into ice_vf_post_vsi_rebuild

The Single Root IOV implementation of .post_vsi_rebuild performs some tasks
that will ultimately need to be shared with the Scalable IOV implementation
such as rebuilding the host configuration.

Refactor by introducing a new wrapper function, ice_vf_post_vsi_rebuild
which performs the tasks that will be shared between SR-IOV and Scalable
IOV. Move the ice_vf_rebuild_host_cfg and ice_vf_set_initialized calls into
this wrapper. Then call the implementation specific post_vsi_rebuild
handler afterwards.

This ensures that we will properly re-initialize filters and expected
settings for both SR-IOV and Scalable IOV.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: move ice_vf_vsi_release into ice_vf_lib.c
Jacob Keller [Thu, 19 Jan 2023 01:16:46 +0000 (17:16 -0800)]
ice: move ice_vf_vsi_release into ice_vf_lib.c

The ice_vf_vsi_release function will be used in a future change to
refactor the .vsi_rebuild function. Move this over to ice_vf_lib.c so
that it can be used there.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg
Jacob Keller [Thu, 19 Jan 2023 01:16:44 +0000 (17:16 -0800)]
ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg

The ice_vsi_alloc and ice_vsi_cfg functions are used together to allocate
and configure a new VSI, called as part of the ice_vsi_setup function.

In the future with the addition of the subfunction code the ice driver
will want to be able to allocate a VSI while delaying the configuration to
a later point of the port activation.

Currently this requires that the port code know what type of VSI should
be allocated. This is required because ice_vsi_alloc assigns the VSI type.

Refactor the ice_vsi_alloc and ice_vsi_cfg functions so that VSI type
assignment isn't done until the configuration stage. This will allow the
devlink port addition logic to reserve a VSI as early as possible before
the type of the port is known. In this way, the port add can fail in the
event that all hardware VSI resources are exhausted.

Since the ice_vsi_cfg function already takes the ice_vsi_cfg_params
structure, this is relatively straight forward.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: refactor VSI setup to use parameter structure
Jacob Keller [Thu, 19 Jan 2023 01:16:43 +0000 (17:16 -0800)]
ice: refactor VSI setup to use parameter structure

The ice_vsi_setup function, ice_vsi_alloc, and ice_vsi_cfg functions have
grown a large number of parameters. These parameters are used to initialize
a new VSI, as well as re-configure an existing VSI

Any time we want to add a new parameter to this function chain, even if it
will usually be unset, we have to change many call sites due to changing
the function signature.

A future change is going to refactor ice_vsi_alloc and ice_vsi_cfg to move
the VSI configuration and initialization all into ice_vsi_cfg.

Before this, refactor the VSI setup flow to use a new ice_vsi_cfg_params
structure. This will contain the configuration (mainly pointers) used to
initialize a VSI.

Pass this from ice_vsi_setup into the related functions such as
ice_vsi_alloc, ice_vsi_cfg, and ice_vsi_cfg_def.

Introduce a helper, ice_vsi_to_params to convert an existing VSI to the
parameters used to initialize it. This will aid in the flows where we
rebuild an existing VSI.

Since we also pass the ICE_VSI_FLAG_INIT to more functions which do not
need (or cannot yet have) the VSI parameters, lets make this clear by
renaming the function parameter to vsi_flags and using a u32 instead of a
signed integer. The name vsi_flags also makes it clear that we may extend
the flags in the future.

This change will make it easier to refactor the setup flow in the future,
and will reduce the complexity required to add a new parameter for
configuration in the future.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: drop unnecessary VF parameter from several VSI functions
Jacob Keller [Thu, 19 Jan 2023 01:16:42 +0000 (17:16 -0800)]
ice: drop unnecessary VF parameter from several VSI functions

The vsi->vf pointer gets assigned early on during ice_vsi_alloc. Several
functions currently take a VF pointer, but they can just use the existing
vsi->vf pointer as needed. Modify these functions to drop the unnecessary
VF parameter.

Note that ice_vsi_cfg is not changed as a following change will refactor so
that the VF pointer is assigned during ice_vsi_cfg rather than
ice_vsi_alloc.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: fix function comment referring to ice_vsi_alloc
Jacob Keller [Thu, 19 Jan 2023 01:16:41 +0000 (17:16 -0800)]
ice: fix function comment referring to ice_vsi_alloc

Since commit 1d2e32275de7 ("ice: split ice_vsi_setup into smaller
functions") ice_vsi_alloc has not been responsible for all of the behavior
implied by the comment for ice_vsi_setup_vector_base.

Fix the comment to refer to the new function ice_vsi_alloc_def().

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoice: Add more usage of existing function ice_get_vf_vsi(vf)
Brett Creeley [Thu, 1 Dec 2022 11:58:59 +0000 (12:58 +0100)]
ice: Add more usage of existing function ice_get_vf_vsi(vf)

Extend the usage of function ice_get_vf_vsi(vf) in multiple places
instead of VF's VSI by using a long string of dereferences
(i.e. vf->pf->vsi[vf->lan_vsi_idx]).

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Kalyan Kodamagula <kalyan.kodamagula@intel.com>
Tested-by: Piotr Tyda <piotr.tyda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
17 months agoMerge patch series "can: bittiming: cleanups and rework SJW handling"
Marc Kleine-Budde [Mon, 6 Feb 2023 13:02:37 +0000 (14:02 +0100)]
Merge patch series "can: bittiming: cleanups and rework SJW handling"

Marc Kleine-Budde <mkl@pengutronix.de> says:

several people noticed that on modern CAN controllers with wide bit
timing registers the default SJW of 1 can result in unstable or no
synchronization to the CAN network. See Patch 14/17 for details.

During review of v1 Vincent pointed out that the original code and the
series doesn't always check user provided bit timing parameters,
sometimes silently limits them and the return error values are not
consistent.

This series first cleans up some code in bittiming.c, replacing
open-coded variants by macros or functions (Patches 1, 2).

Patch 3 adds the missing assignment of the effective TQ if the
interface is configured with low level timing parameters.

Patch 4 is another code cleanup.

Patches 5, 6 check the bit timing parameter during interface
registration.

Patch 7 adds a validation of the sample point.

The patches 8-13 convert the error messages from netdev_err() to
NL_SET_ERR_MSG_FMT, factor out the SJW handling from
can_fixup_bittiming(), add checking and error messages for the
individual limits and harmonize the error return values.

Patch 14 changes the default SJW value from 1 to min(Phase Seg1, Phase
Seg2 / 2).

Patch 15 switches can_calc_bittiming() to use the new SJW handling.

Patch 16 converts can_calc_bittiming() to NL_SET_ERR_MSG_FMT().

And patch 16 adds a NL_SET_ERR_MSG_FMT() error message to
can_validate_bitrate().

v1: https://lore.kernel.org/all/20220907103845.3929288-1-mkl@pengutronix.de

Link: https://lore.kernel.org/all/20230202110854.2318594-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_validate_bitrate(): report error via netlink
Marc Kleine-Budde [Wed, 1 Feb 2023 19:27:47 +0000 (20:27 +0100)]
can: bittiming: can_validate_bitrate(): report error via netlink

Report an error to user space via netlink if the requested bit rate is
not supported by the device.

Link: https://lore.kernel.org/all/20230202110854.2318594-18-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
Marc Kleine-Budde [Tue, 31 Jan 2023 16:10:54 +0000 (17:10 +0100)]
can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()

Replace the netdev_err() by NL_SET_ERR_MSG_FMT() to better inform the
user about the problem. While there, use %u to print unsigned values
and improve error message a bit.

In case of an error, return -EINVAL instead of -EDOM, this corresponds
better to the actual meaning of the error value.

Link: https://lore.kernel.org/all/20230202110854.2318594-17-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_calc_bittiming(): clean up SJW handling
Marc Kleine-Budde [Tue, 6 Sep 2022 17:15:28 +0000 (19:15 +0200)]
can: bittiming: can_calc_bittiming(): clean up SJW handling

In the current code, if the user configures a bitrate, a default SJW
value of 1 is used. If the user configures both a bitrate and a SJW
value, can_calc_bittiming() silently limits the SJW value to SJW max
and TSEG2.

We came to the conclusion that if the user provided an invalid SJW
value, it's best to bail out and inform the user [1].

[1] https://lore.kernel.org/all/CAMZ6RqKqhmTgUZiwe5uqUjBDnhhC2iOjZ791+Y845btJYwVDKg@mail.gmail.com

Further the ISO 11898-1:2015 standard mandates that "SJW shall be less
than or equal to the minimum of these two items: Phase_Seg1 and
Phase_Seg2." [2] The current code is missing that check.

[2] https://lore.kernel.org/all/BL3PR11MB64844E3FC13C55433CDD0B3DFB449@BL3PR11MB6484.namprd11.prod.outlook.com

The previous patches introduced
1) can_sjw_set_default() - sets a default value for SJW if unset
2) can_sjw_check() - implements a SJW check against SJW max, Phase
   Seg1 and Phase Seg2. In the error case this function reports the error
   to user space via netlink.

Replace both the open-coded SJW default setting and the open-coded and
insufficient checks of SJW with the helper functions
can_sjw_set_default() and can_sjw_check().

Link: https://lore.kernel.org/all/20230202110854.2318594-16-mkl@pengutronix.de
Link: https://lore.kernel.org/all/CAMZ6RqKqhmTgUZiwe5uqUjBDnhhC2iOjZ791+Y845btJYwVDKg@mail.gmail.com
Link: https://lore.kernel.org/all/BL3PR11MB64844E3FC13C55433CDD0B3DFB449@BL3PR11MB6484.namprd11.prod.outlook.com
Suggested-by: Thomas Kopp <Thomas.Kopp@microchip.com>
Suggested-by: Vincent Mailhol <vincent.mailhol@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW
Marc Kleine-Budde [Tue, 6 Sep 2022 15:47:58 +0000 (17:47 +0200)]
can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW

"The (Re-)Synchronization Jump Width (SJW) defines how far a
 resynchronization may move the Sample Point inside the limits defined
 by the Phase Buffer Segments to compensate for edge phase errors." [1]

In other words, this means that the SJW parameter controls the
tolerance of the CAN controller to frequency errors compared to other
CAN controllers.

If the user space does not provide an SJW parameter, the kernel
chooses a default value of 1. This has proven to be a good default
value for classic CAN controllers, but no longer for modern CAN-FD
controllers.

In the past there were CAN controllers like the sja1000 with a rather
limited range of bit timing parameters. For the standard bit rates
this results in the following bit timing parameters:

| Bit timing parameters for sja1000 with 8.000000 MHz ref clock
|                     _----+--------------=> tseg1: 1 â€¦   16
|                    /    /     _---------=> tseg2: 1 â€¦    8
|                   |    |     /    _-----=> sjw:   1 â€¦    4
|                   |    |    |    /    _-=> brp:   1 â€¦   64 (inc: 1)
|                   |    |    |   |    /
|  nominal          |    |    |   |   |     real  Bitrt    nom   real   SampP
|  Bitrate TQ[ns] PrS PhS1 PhS2 SJW BRP  Bitrate  Error  SampP  SampP   Error  BTR0 BTR1
|  1000000    125   2    3    2   1   1  1000000   0.0%  75.0%  75.0%   0.0%   0x00 0x14
|   800000    125   3    4    2   1   1   800000   0.0%  80.0%  80.0%   0.0%   0x00 0x16
|   666666    125   4    4    3   1   1   666666   0.0%  80.0%  75.0%   6.2%   0x00 0x27
|   500000    125   6    7    2   1   1   500000   0.0%  87.5%  87.5%   0.0%   0x00 0x1c
|   250000    250   6    7    2   1   2   250000   0.0%  87.5%  87.5%   0.0%   0x01 0x1c
|   125000    500   6    7    2   1   4   125000   0.0%  87.5%  87.5%   0.0%   0x03 0x1c
|   100000    625   6    7    2   1   5   100000   0.0%  87.5%  87.5%   0.0%   0x04 0x1c
|    83333    750   6    7    2   1   6    83333   0.0%  87.5%  87.5%   0.0%   0x05 0x1c
|    50000   1250   6    7    2   1  10    50000   0.0%  87.5%  87.5%   0.0%   0x09 0x1c
|    33333   1875   6    7    2   1  15    33333   0.0%  87.5%  87.5%   0.0%   0x0e 0x1c
|    20000   3125   6    7    2   1  25    20000   0.0%  87.5%  87.5%   0.0%   0x18 0x1c
|    10000   6250   6    7    2   1  50    10000   0.0%  87.5%  87.5%   0.0%   0x31 0x1c

The attentive reader will notice that the SJW is 1 in most cases,
while the Seg2 phase is 2. Both values are given in TQ units, which in
turn is a duration in nanoseconds.

For example the 500 kbit/s configuration:

|  nominal                                  real  Bitrt    nom   real   SampP
|  Bitrate TQ[ns] PrS PhS1 PhS2 SJW BRP  Bitrate  Error  SampP  SampP   Error  BTR0 BTR1
|   500000    125   6    7    2   1   1   500000   0.0%  87.5%  87.5%   0.0%   0x00 0x1c

the TQ is 125ns, the Phase Seg2 is "2" (== 250ns), the SJW is "1" (==
125 ns).

Looking at a more modern CAN controller like a mcp2518fd, it has wider
bit timing registers.

| Bit timing parameters for mcp251xfd with 40.000000 MHz ref clock
|                     _----+--------------=> tseg1: 2 â€¦  256
|                    /    /     _---------=> tseg2: 1 â€¦  128
|                   |    |     /    _-----=> sjw:   1 â€¦  128
|                   |    |    |    /    _-=> brp:   1 â€¦  256 (inc: 1)
|                   |    |    |   |    /
|  nominal          |    |    |   |   |     real  Bitrt    nom   real   SampP
|  Bitrate TQ[ns] PrS PhS1 PhS2 SJW BRP  Bitrate  Error  SampP  SampP   Error      NBTCFG
|   500000     25  34   35   10   1   1   500000   0.0%  87.5%  87.5%   0.0%   0x00440900

The TQ is 25ns, the Phase Seg 2 is "10" (== 250ns), the SJW is "1" (==
25ns).

Since the kernel chooses a default SJW of 1 regardless of the TQ, this
leads to a much smaller SJW and thus much smaller tolerances to
frequency errors.

To maintain the same oscillator tolerances on controllers with wide
bit timing registers, select a default SJW value of Phase Seg2 / 2
unless Phase Seg 1 is less. This results in the following bit timing
parameters:

| Bit timing parameters for mcp251xfd with 40.000000 MHz ref clock
|                     _----+--------------=> tseg1: 2 â€¦  256
|                    /    /     _---------=> tseg2: 1 â€¦  128
|                   |    |     /    _-----=> sjw:   1 â€¦  128
|                   |    |    |    /    _-=> brp:   1 â€¦  256 (inc: 1)
|                   |    |    |   |    /
|  nominal          |    |    |   |   |     real  Bitrt    nom   real   SampP
|  Bitrate TQ[ns] PrS PhS1 PhS2 SJW BRP  Bitrate  Error  SampP  SampP   Error      NBTCFG
|   500000     25  34   35   10   5   1   500000   0.0%  87.5%  87.5%   0.0%   0x00440904

The TQ is 25ns, the Phase Seg 2 is "10" (== 250ns), the SJW is "5" (==
125ns). Which is the same as on the sja1000 controller.

[1] http://web.archive.org/http://www.oertel-halle.de/files/cia99paper.pdf

Link: https://lore.kernel.org/all/20230202110854.2318594-15-mkl@pengutronix.de
Cc: Mark Bath <mark@baggywrinkle.co.uk>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffe...
Marc Kleine-Budde [Tue, 6 Sep 2022 17:15:28 +0000 (19:15 +0200)]
can: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffer Segment

According to "The Configuration of the CAN Bit Timing" [1] the SJW
"may not be longer than either Phase Buffer Segment".

Check SJW against length of both Phase buffers. In case the SJW is
greater, report an error via netlink to user space and bail out.

[1] http://web.archive.org/http://www.oertel-halle.de/files/cia99paper.pdf

Link: https://lore.kernel.org/all/20230202110854.2318594-14-mkl@pengutronix.de
Suggested-by: Vincent Mailhol <vincent.mailhol@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_sjw_check(): report error via netlink and harmonize error value
Marc Kleine-Budde [Tue, 31 Jan 2023 16:43:22 +0000 (17:43 +0100)]
can: bittiming: can_sjw_check(): report error via netlink and harmonize error value

If the user space has supplied an invalid SJW value (greater than the
maximum SJW value), report -EINVAL instead of -ERANGE, this better
matches the actual meaning of the error value.

Additionally report an error message via netlink to the user space.

Link: https://lore.kernel.org/all/20230202110854.2318594-13-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error...
Marc Kleine-Budde [Wed, 1 Feb 2023 16:17:55 +0000 (17:17 +0100)]
can: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error value

Check each bit timing parameter first individually against their
limits and report a meaningful error message via netlink to the user
space.

In case of an error, return -EINVAL instead of -ERANGE, this
corresponds better to the actual meaning of the error value.

Link: https://lore.kernel.org/all/20230202110854.2318594-12-mkl@pengutronix.de
Suggested-by: Vincent Mailhol <vincent.mailhol@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: factor out can_sjw_set_default() and can_sjw_check()
Marc Kleine-Budde [Tue, 27 Sep 2022 15:07:07 +0000 (17:07 +0200)]
can: bittiming: factor out can_sjw_set_default() and can_sjw_check()

Factor out the functionality of assigning a SJW default value into
can_sjw_set_default() and the checking the SJW limits into
can_sjw_check().

This functions will be improved and called from a different function
in the following patches.

Link: https://lore.kernel.org/all/20230202110854.2318594-11-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_changelink() pass extack down callstack
Marc Kleine-Budde [Tue, 31 Jan 2023 14:42:59 +0000 (15:42 +0100)]
can: bittiming: can_changelink() pass extack down callstack

This is a preparation patch.

In order to pass warning/error messages during netlink calls back to
user space, pass the extack struct down the callstack of
can_changelink(), the actual error messages will be added in the
following ptaches.

Link: https://lore.kernel.org/all/20230202110854.2318594-10-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
Marc Kleine-Budde [Tue, 27 Sep 2022 11:12:35 +0000 (13:12 +0200)]
can: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()

Since commit 51c352bdbcd2 ("netlink: add support for formatted extack
messages") formatted extack messages are supported to inform the user
space or warnings/errors during netlink calls.

Replace the netdev_err() by NL_SET_ERR_MSG_FMT() to better inform the
user about the problem. While there, use %u to print unsigned values
and improve error message a bit.

Link: https://lore.kernel.org/all/20230202110854.2318594-9-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: netlink: can_validate(): validate sample point for CAN and CAN-FD
Marc Kleine-Budde [Wed, 1 Feb 2023 16:33:51 +0000 (17:33 +0100)]
can: netlink: can_validate(): validate sample point for CAN and CAN-FD

The sample point is a value in tenths of a percent. Meaningful values
are between 0 and 1000. Invalid values are rejected and an error
message is returned to user space via netlink.

Link: https://lore.kernel.org/all/20230202110854.2318594-8-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: dev: register_candev(): bail out if both fixed bit rates and bit timing constant...
Marc Kleine-Budde [Tue, 27 Sep 2022 11:33:44 +0000 (13:33 +0200)]
can: dev: register_candev(): bail out if both fixed bit rates and bit timing constants are provided

The CAN driver framework supports either fixed bit rates or bit timing
constants. Bail out during driver registration if both are given.

Link: https://lore.kernel.org/all/20230202110854.2318594-7-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: dev: register_candev(): ensure that bittiming const are valid
Marc Kleine-Budde [Mon, 9 Aug 2021 19:07:14 +0000 (21:07 +0200)]
can: dev: register_candev(): ensure that bittiming const are valid

Implement the function can_bittiming_const_valid() to check the
validity of the specified bit timing constant. Call this function from
register_candev() to check the bit timing constants during the
registration of the CAN interface.

Link: https://lore.kernel.org/all/20230202110854.2318594-6-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_get_bittiming(): use direct return and remove unneeded else
Marc Kleine-Budde [Tue, 27 Sep 2022 15:14:09 +0000 (17:14 +0200)]
can: bittiming: can_get_bittiming(): use direct return and remove unneeded else

Clean up the code flow a bit, don't assign err variable but directly
return. Remove the unneeded else, too.

Link: https://lore.kernel.org/all/20230202110854.2318594-5-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_fixup_bittiming(): set effective tq
Marc Kleine-Budde [Wed, 1 Feb 2023 16:18:56 +0000 (17:18 +0100)]
can: bittiming: can_fixup_bittiming(): set effective tq

The can_fixup_bittiming() function is used to validate the
user-supplied low-level bit timing parameters and calculate the
bitrate prescaler (brp) from the requested time quanta (tq) and the
CAN clock of the controller.

can_fixup_bittiming() selects the best matching integer bit rate
prescaler, which may result in a different time quantum than the value
specified by the user.

Calculate the resulting time quantum and assign it so that the user
sees the effective time quantum.

Link: https://lore.kernel.org/all/20230202110854.2318594-4-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1
Marc Kleine-Budde [Tue, 31 Jan 2023 16:37:39 +0000 (17:37 +0100)]
can: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1

Commit 1c47fa6b31c2 ("can: dev: add a helper function to calculate the
duration of one bit") made the constant CAN_SYNC_SEG available in a
header file.

The magic number 1 in can_fixup_bittiming() represents the width of
the sync segment, replace it by CAN_SYNC_SEG to make the code more
readable.

Link: https://lore.kernel.org/all/20230202110854.2318594-3-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agocan: bittiming(): replace open coded variants of can_bit_time()
Marc Kleine-Budde [Tue, 31 Jan 2023 16:11:55 +0000 (17:11 +0100)]
can: bittiming(): replace open coded variants of can_bit_time()

Commit 1c47fa6b31c2 ("can: dev: add a helper function to calculate the
duration of one bit") added the helper function can_bit_time().

Replace open coded variants of can_bit_time() by the helper function.

Link: https://lore.kernel.org/all/20230202110854.2318594-2-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
17 months agoMerge branch 'tuntap-socket-uid'
David S. Miller [Mon, 6 Feb 2023 10:16:55 +0000 (10:16 +0000)]
Merge branch 'tuntap-socket-uid'

Pietro Borrello says:

====================
tuntap: correctly initialize socket uid

sock_init_data() assumes that the `struct socket` passed in input is
contained in a `struct socket_alloc` allocated with sock_alloc().
However, tap_open() and tun_chr_open() pass a `struct socket` embedded
in a `struct tap_queue` and `struct tun_file` respectively, both
allocated with sk_alloc().
This causes a type confusion when issuing a container_of() with
SOCK_INODE() in sock_init_data() which results in assigning a wrong
sk_uid to the `struct sock` in input.

Due to the type confusion, both sockets happen to have their uid set
to 0, i.e. root.
While it will be often correct, as tuntap devices require
CAP_NET_ADMIN, it may not always be the case.
Not sure how widespread is the impact of this, it seems the socket uid
may be used for network filtering and routing, thus tuntap sockets may
be incorrectly managed.
Additionally, it seems a socket with an incorrect uid may be returned
to the vhost driver when issuing a get_socket() on a tuntap device in
vhost_net_set_backend().

Fix the bugs by adding and using sock_init_data_uid(), which
explicitly takes a uid as argument.

Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
---
Changes in v3:
- Fix the bug by defining and using sock_init_data_uid()
- Link to v2: https://lore.kernel.org/r/20230131-tuntap-sk-uid-v2-0-29ec15592813@diag.uniroma1.it

Changes in v2:
- Shorten and format comments
- Link to v1: https://lore.kernel.org/r/20230131-tuntap-sk-uid-v1-0-af4f9f40979d@diag.uniroma1.it
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agotap: tap_open(): correctly initialize socket uid
Pietro Borrello [Sat, 4 Feb 2023 17:39:22 +0000 (17:39 +0000)]
tap: tap_open(): correctly initialize socket uid

sock_init_data() assumes that the `struct socket` passed in input is
contained in a `struct socket_alloc` allocated with sock_alloc().
However, tap_open() passes a `struct socket` embedded in a `struct
tap_queue` allocated with sk_alloc().
This causes a type confusion when issuing a container_of() with
SOCK_INODE() in sock_init_data() which results in assigning a wrong
sk_uid to the `struct sock` in input.
On default configuration, the type confused field overlaps with
padding bytes between `int vnet_hdr_sz` and `struct tap_dev __rcu
*tap` in `struct tap_queue`, which makes the uid of all tap sockets 0,
i.e., the root one.
Fix the assignment by using sock_init_data_uid().

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agotun: tun_chr_open(): correctly initialize socket uid
Pietro Borrello [Sat, 4 Feb 2023 17:39:21 +0000 (17:39 +0000)]
tun: tun_chr_open(): correctly initialize socket uid

sock_init_data() assumes that the `struct socket` passed in input is
contained in a `struct socket_alloc` allocated with sock_alloc().
However, tun_chr_open() passes a `struct socket` embedded in a `struct
tun_file` allocated with sk_alloc().
This causes a type confusion when issuing a container_of() with
SOCK_INODE() in sock_init_data() which results in assigning a wrong
sk_uid to the `struct sock` in input.
On default configuration, the type confused field overlaps with the
high 4 bytes of `struct tun_struct __rcu *tun` of `struct tun_file`,
NULL at the time of call, which makes the uid of all tun sockets 0,
i.e., the root one.
Fix the assignment by using sock_init_data_uid().

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: add sock_init_data_uid()
Pietro Borrello [Sat, 4 Feb 2023 17:39:20 +0000 (17:39 +0000)]
net: add sock_init_data_uid()

Add sock_init_data_uid() to explicitly initialize the socket uid.
To initialise the socket uid, sock_init_data() assumes a the struct
socket* sock is always embedded in a struct socket_alloc, used to
access the corresponding inode uid. This may not be true.
Examples are sockets created in tun_chr_open() and tap_open().

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'ENETC-mqprio-taprio-cleanup'
David S. Miller [Mon, 6 Feb 2023 10:06:44 +0000 (10:06 +0000)]
Merge branch 'ENETC-mqprio-taprio-cleanup'

Vladimir Oltean says:

====================
net: ENETC mqprio/taprio cleanup

Please excuse the increased patch set size compared to v4's 15 patches,
but Claudiu stirred up the pot :) when he pointed out that the mqprio
TXQ validation procedure is still incorrect, so I had to fix that, and
then do some consolidation work so that taprio doesn't duplicate
mqprio's bugs. Compared to v4, 3 patches are new and 1 was dropped for now
("net/sched: taprio: mask off bits in gate mask that exceed number of TCs"),
since there's not really much to gain from it. Since the previous patch
set has largely been reviewed, I hope that a delta overview will help
and make up for the large size.

v4->v5:
- new patches:
  "[08/17] net/sched: mqprio: allow reverse TC:TXQ mappings"
  "[11/17] net/sched: taprio: centralize mqprio qopt validation"
  "[12/17] net/sched: refactor mqprio qopt reconstruction to a library function"
- changed patches worth revisiting:
  "[09/17] net/sched: mqprio: allow offloading drivers to request queue
  count validation"
v4 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20230130173145.475943-1-vladimir.oltean@nxp.com/

v3->v4:
- adjusted patch 07/15 to not remove "#include <net/pkt_sched.h>" from
  ti cpsw
https://patchwork.kernel.org/project/netdevbpf/cover/20230127001516.592984-1-vladimir.oltean@nxp.com/

v2->v3:
- move min_num_stack_tx_queues definition so it doesn't conflict with
  the ethtool mm patches I haven't submitted yet for enetc (and also to
  make use of a 4 byte hole)
- warn and mask off excess TCs in gate mask instead of failing
- finally CC qdisc maintainers
v2 at:
https://patchwork.kernel.org/project/netdevbpf/patch/20230126125308.1199404-16-vladimir.oltean@nxp.com/

v1->v2:
- patches 1->4 are new
- update some header inclusions in drivers
- fix typo (said "taprio" instead of "mqprio")
- better enetc mqprio error handling
- dynamically reconstruct mqprio configuration in taprio offload
- also let stmmac and tsnep use per-TXQ gate_mask
v1 (RFC) at:
https://patchwork.kernel.org/project/netdevbpf/cover/20230120141537.1350744-1-vladimir.oltean@nxp.com/

The main goal of this patch set is to make taprio pass the mqprio queue
configuration structure down to ndo_setup_tc() - patch 13/17. But mqprio
itself is not in the best shape currently, so there are some
consolidation patches on that as well.

Next, there are some consolidation patches in the enetc driver's
handling of TX queues and their traffic class assignment. Then, there is
a consolidation between the TX queue configuration for mqprio and
taprio.

Finally, there is a change in the meaning of the gate_mask passed by
taprio through ndo_setup_tc(). We introduce a capability through which
drivers can request the gate mask to be per TXQ. The default is changed
so that it is per TC.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: enetc: act upon mqprio queue config in taprio offload
Vladimir Oltean [Sat, 4 Feb 2023 13:53:07 +0000 (15:53 +0200)]
net: enetc: act upon mqprio queue config in taprio offload

We assume that the mqprio queue configuration from taprio has a simple
1:1 mapping between prio and traffic class, and one TX queue per TC.
That might not be the case. Actually parse and act upon the mqprio
config.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: enetc: act upon the requested mqprio queue configuration
Vladimir Oltean [Sat, 4 Feb 2023 13:53:06 +0000 (15:53 +0200)]
net: enetc: act upon the requested mqprio queue configuration

Regardless of the requested queue count per traffic class, the enetc
driver allocates a number of TX rings equal to the number of TCs, and
hardcodes a queue configuration of "1@0 1@1 ... 1@max-tc". Other
configurations are silently ignored and treated the same.

Improve that by allowing what the user requests to be actually
fulfilled. This allows more than one TX ring per traffic class.
For example:

$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 4 \
map 0 0 1 1 2 2 3 3 queues 2@0 2@2 2@4 2@6
[  146.267648] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
[  146.273451] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
[  146.283280] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 1
[  146.293987] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 1
[  146.300467] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 2
[  146.306866] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 2
[  146.313261] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 3
[  146.319622] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 3
$ tc qdisc del dev eno0 root
[  178.238418] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
[  178.244369] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
[  178.251486] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 0
[  178.258006] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 0
[  178.265038] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 0
[  178.271557] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 0
[  178.277910] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 0
[  178.284281] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 0
$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
[  186.113162] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
[  186.118764] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 1
[  186.124374] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 2
[  186.130765] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 3
[  186.136404] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 4
[  186.142049] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 5
[  186.147674] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 6
[  186.153305] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 7

The driver used to set TC_MQPRIO_HW_OFFLOAD_TCS, near which there is
this comment in the UAPI header:

        TC_MQPRIO_HW_OFFLOAD_TCS,       /* offload TCs, no queue counts */

which is what enetc was doing up until now (and no longer is; we offload
queue counts too), remove that assignment.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: enetc: request mqprio to validate the queue counts
Vladimir Oltean [Sat, 4 Feb 2023 13:53:05 +0000 (15:53 +0200)]
net: enetc: request mqprio to validate the queue counts

The enetc driver does not validate the mqprio queue configuration, so it
currently allows things like this:

$ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1

But also things like this, completely omitting the queue configuration:

$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 hw 1

By requesting validation via the mqprio capability structure, this is no
longer allowed, and we bring what is accepted by hardware in line with
what is accepted by software.

The check that num_tc <= real_num_tx_queues also becomes superfluous and
can be dropped, because mqprio_validate_queue_counts() validates that no
TXQ range exceeds real_num_tx_queues. That is a stronger check, because
there is at least 1 TXQ per TC, so there are at least as many TXQs as TCs.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw
Vladimir Oltean [Sat, 4 Feb 2023 13:53:04 +0000 (15:53 +0200)]
net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw

There are 2 classes of in-tree drivers currently:

- those who act upon struct tc_taprio_sched_entry :: gate_mask as if it
  holds a bit mask of TXQs

- those who act upon the gate_mask as if it holds a bit mask of TCs

When it comes to the standard, IEEE 802.1Q-2018 does say this in the
second paragraph of section 8.6.8.4 Enhancements for scheduled traffic:

| A gate control list associated with each Port contains an ordered list
| of gate operations. Each gate operation changes the transmission gate
| state for the gate associated with each of the Port's traffic class
| queues and allows associated control operations to be scheduled.

In typically obtuse language, it refers to a "traffic class queue"
rather than a "traffic class" or a "queue". But careful reading of
802.1Q clarifies that "traffic class" and "queue" are in fact
synonymous (see 8.6.6 Queuing frames):

| A queue in this context is not necessarily a single FIFO data structure.
| A queue is a record of all frames of a given traffic class awaiting
| transmission on a given Bridge Port. The structure of this record is not
| specified.

i.o.w. their definition of "queue" isn't the Linux TX queue.

The gate_mask really is input into taprio via its UAPI as a mask of
traffic classes, but taprio_sched_to_offload() converts it into a TXQ
mask.

The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is:

- hellcreek, felix, sja1105: these are DSA switches, it's not even very
  clear what TXQs correspond to, other than purely software constructs.
  Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense.
  So it's fine to convert these to a gate mask per TC.

- enetc: I have the hardware and can confirm that the gate mask is per
  TC, and affects all TXQs (BD rings) configured for that priority.

- igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted
  to be per-TXQ.

- tsnep: Gerhard Engleder clarifies that even though this hardware
  supports at most 1 TXQ per TC, the TXQ indices may be different from
  the TC values themselves, and it is the TXQ indices that matter to
  this hardware. So keep it per-TXQ as well.

- stmmac: I have a GMAC datasheet, and in the EST section it does
  specify that the gate events are per TXQ rather than per TC.

- lan966x: again, this is a switch, and while not a DSA one, the way in
  which it implements lan966x_mqprio_add() - by only allowing num_tc ==
  NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely
  software construct here as well. They seem to map 1:1 with TCs.

- am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the
  impression that the fetch_allow variable is treated like a prio_mask.
  This definitely sounds closer to a per-TC gate mask rather than a
  per-TXQ one, and TI documentation does seem to recomment an identity
  mapping between TCs and TXQs. However, Roger Quadros would like to do
  some testing before making changes, so I'm leaving this driver to
  operate as it did before, for now. Link with more details at the end.

Based on this breakdown, we have 5 drivers with a gate mask per TC and
4 with a gate mask per TXQ. So let's make the gate mask per TXQ the
opt-in and the gate mask per TC the default.

Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
query the device driver before calling the proper ndo_setup_tc(), and
figure out if it expects one or the other format.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
Cc: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()
Vladimir Oltean [Sat, 4 Feb 2023 13:53:03 +0000 (15:53 +0200)]
net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()

The taprio qdisc does not currently pass the mqprio queue configuration
down to the offloading device driver. So the driver cannot act upon the
TXQ counts/offsets per TC, or upon the prio->tc map. It was probably
assumed that the driver only wants to offload num_tc (see
TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(),
but there's clearly more to the mqprio configuration than that.

I've considered 2 mechanisms to remedy that. First is to pass a struct
tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second
is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to*
TC_SETUP_QDISC_TAPRIO.

The difference is that in the first case, existing drivers (offloading
or not) all ignore taprio's mqprio portion currently, whereas in the
second case, we could control whether to call TC_SETUP_QDISC_MQPRIO,
based on a new capability. The question is which approach would be
better.

I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based
on a taprio capability bit) would risk introducing regressions. For
example, taprio doesn't populate (or validate) qopt->hw, as well as
mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate.

In comparison, adding a capability is functionally equivalent to just
passing the mqprio in a way that drivers can ignore it, except it's
slightly more complicated to use it (need to set the capability).

Ultimately, what made me go for the "mqprio in taprio" variant was that
it's easier for offloading drivers to interpret the mqprio qopt slightly
differently when it comes from taprio vs when it comes from mqprio,
should that ever become necessary.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: refactor mqprio qopt reconstruction to a library function
Vladimir Oltean [Sat, 4 Feb 2023 13:53:02 +0000 (15:53 +0200)]
net/sched: refactor mqprio qopt reconstruction to a library function

The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
netdev settings once more in a future patch, but this code was already
written twice, once in taprio and once in mqprio.

Refactor the code to a helper in the common mqprio library.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: taprio: centralize mqprio qopt validation
Vladimir Oltean [Sat, 4 Feb 2023 13:53:01 +0000 (15:53 +0200)]
net/sched: taprio: centralize mqprio qopt validation

There is a lot of code in taprio which is "borrowed" from mqprio.
It makes sense to put a stop to the "borrowing" and start actually
reusing code.

Because taprio and mqprio are built as part of different kernel modules,
code reuse can only take place either by writing it as static inline
(limiting), putting it in sch_generic.o (not generic enough), or
creating a third auto-selectable kernel module which only holds library
code. I opted for the third variant.

In a previous change, mqprio gained support for reverse TC:TXQ mappings,
something which taprio still denies. Make taprio use the same validation
logic so that it supports this configuration as well.

The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
that looks intentional, even if I've no idea why that might be. Preserve
that, but add a comment.

There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
update there.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: mqprio: add extack messages for queue count validation
Vladimir Oltean [Sat, 4 Feb 2023 13:53:00 +0000 (15:53 +0200)]
net/sched: mqprio: add extack messages for queue count validation

To make mqprio more user-friendly, create netlink extended ack messages
which say exactly what is wrong about the queue counts. This uses the
new support for printf-formatted extack messages.

Example:

$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: mqprio: allow offloading drivers to request queue count validation
Vladimir Oltean [Sat, 4 Feb 2023 13:52:59 +0000 (15:52 +0200)]
net/sched: mqprio: allow offloading drivers to request queue count validation

mqprio_parse_opt() proudly has a comment:

/* If hardware offload is requested we will leave it to the device
 * to either populate the queue counts itself or to validate the
 * provided queue counts.
 */

Unfortunately some device drivers did not get this memo, and don't
validate the queue counts, or populate them.

In case drivers don't want to populate the queue counts themselves, just
act upon the requested configuration, it makes sense to introduce a tc
capability, and make mqprio query it, so they don't have to do the
validation themselves.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: mqprio: allow reverse TC:TXQ mappings
Vladimir Oltean [Sat, 4 Feb 2023 13:52:58 +0000 (15:52 +0200)]
net/sched: mqprio: allow reverse TC:TXQ mappings

By imposing that the last TXQ of TC i is smaller than the first TXQ of
any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for
the TXQ indices (they must increase as TCs increase).

Claudiu points out that the complexity of the TXQ count validation is
too high for this logic, i.e. instead of iterating over j, it is
sufficient that the TXQ indices of TC i and i + 1 are ordered, and that
will eventually ensure global ordering.

This is true, however it doesn't appear to me that is what the code
really intended to do. Instead, based on the comments, it just wanted to
check for overlaps (and this isn't how one does that).

So the following mqprio configuration, which I had recommended to
Vinicius more than once for igb/igc (to account for the fact that on
this hardware, lower numbered TXQs have higher dequeue priority than
higher ones):

num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0

is in fact denied today by mqprio.

The full story is that in fact, it's only denied with "hw 0"; if
hardware offloading is requested, mqprio defers TXQ range overlap
validation to the device driver (a strange decision in itself).

This is most certainly a bug, but it's not one that has any merit for
being fixed on "stable" as far as I can tell. This is because mqprio
always rejected a configuration which was in fact valid, and this has
shaped the way in which mqprio configuration scripts got built for
various hardware (see igb/igc in the link below). Therefore, one could
consider it to be merely an improvement for mqprio to allow reverse
TC:TXQ mappings.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h
Vladimir Oltean [Sat, 4 Feb 2023 13:52:57 +0000 (15:52 +0200)]
net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h

Since mqprio is a scheduler and not a classifier, move its offload
structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies.

Also update some header inclusions in drivers that access this
structure, to the best of my abilities.

Cc: Igor Russkikh <irusskikh@marvell.com>
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Lars Povlsen <lars.povlsen@microchip.com>
Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
Cc: Daniel Machon <daniel.machon@microchip.com>
Cc: UNGLinuxDriver@microchip.com
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: mqprio: refactor offloading and unoffloading to dedicated functions
Vladimir Oltean [Sat, 4 Feb 2023 13:52:56 +0000 (15:52 +0200)]
net/sched: mqprio: refactor offloading and unoffloading to dedicated functions

Some more logic will be added to mqprio offloading, so split that code
up from mqprio_init(), which is already large, and create a new
function, mqprio_enable_offload(), similar to taprio_enable_offload().
Also create the opposite function mqprio_disable_offload().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/sched: mqprio: refactor nlattr parsing to a separate function
Vladimir Oltean [Sat, 4 Feb 2023 13:52:55 +0000 (15:52 +0200)]
net/sched: mqprio: refactor nlattr parsing to a separate function

mqprio_init() is quite large and unwieldy to add more code to.
Split the netlink attribute parsing to a dedicated function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agogve: Fix gve interrupt names
Praveen Kaligineedi [Fri, 3 Feb 2023 21:20:45 +0000 (13:20 -0800)]
gve: Fix gve interrupt names

IRQs are currently requested before the netdevice is registered
and a proper name is assigned to the device. Changing interrupt
name to avoid using the format string in the name.

Interrupt name before change: eth%d-ntfy-block.<blk_id>
Interrupt name after change: gve-ntfy-blk<blk_id>@pci:<pci_name>

Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Jeroen de Borst <jeroendb@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
David S. Miller [Mon, 6 Feb 2023 10:02:17 +0000 (10:02 +0000)]
Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
net: implement devlink reload in ice

Michal Swiatkowski says:

This is a part of changes done in patchset [0]. Resource management is
kind of controversial part, so I split it into two patchsets.

It is the first one, covering refactor and implement reload API call.
The refactor will unblock some of the patches needed by SIOV or
subfunction.

Most of this patchset is about implementing driver reload mechanism.
Part of code from probe and rebuild is used to not duplicate code.
To allow this reuse probe and rebuild path are split into smaller
functions.

Patch "ice: split ice_vsi_setup into smaller functions" changes
boolean variable in function call to integer and adds define
for it. Instead of having the function called with true/false now it
can be called with readable defines ICE_VSI_FLAG_INIT or
ICE_VSI_FLAG_NO_INIT. It was suggested by Jacob Keller and probably this
mechanism will be implemented across ice driver in follow up patchset.

Previously the code was reviewed here [0].

[0] https://lore.kernel.org/netdev/Y3ckRWtAtZU1BdXm@unreal/T/#m3bb8feba0a62f9b4cd54cd94917b7e2143fc2ecd

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: introduce skb_poison_list and use in kfree_skb_list
Jesper Dangaard Brouer [Fri, 3 Feb 2023 12:59:29 +0000 (13:59 +0100)]
net: introduce skb_poison_list and use in kfree_skb_list

First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs
earlier like introduced in commit eedade12f4cb ("net: kfree_skb_list use
kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in
commit f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list").

In case of a bug like mentioned commit we would have seen OOPS with:
 general protection fault, probably for non-canonical address 0xdead000000000870
And content of one the registers e.g. R13: dead000000000800

In this case skb->len is at offset 112 bytes (0x70) why fault happens at
 0x800+0x70 = 0x870

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'wangxun-interrupts'
David S. Miller [Mon, 6 Feb 2023 09:22:49 +0000 (09:22 +0000)]
Merge branch 'wangxun-interrupts'

Jiawen Wu says:

====================
Wangxun interrupt and RxTx support

Configure interrupt, setup RxTx ring, support to receive and transmit
packets.

change log:
v3:
- Use upper_32_bits() to avoid compile warning.
- Remove useless codes.
v2:
- Andrew Lunn: https://lore.kernel.org/netdev/Y86kDphvyHj21IxK@lunn.ch/
- Add a judgment when allocate dma for descriptor.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ngbe: Support Rx and Tx process path
Mengyuan Lou [Fri, 3 Feb 2023 09:11:35 +0000 (17:11 +0800)]
net: ngbe: Support Rx and Tx process path

Add enable and disable operation process for ngbe open/close.
Clean Rx and Tx ring interrupts, process packets in the data path.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: txgbe: Support Rx and Tx process path
Jiawen Wu [Fri, 3 Feb 2023 09:11:34 +0000 (17:11 +0800)]
net: txgbe: Support Rx and Tx process path

Clean Rx and Tx ring interrupts, process packets in the data path.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: libwx: Add tx path to process packets
Mengyuan Lou [Fri, 3 Feb 2023 09:11:33 +0000 (17:11 +0800)]
net: libwx: Add tx path to process packets

Support to transmit packets without hardware features.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: libwx: Support to receive packets in NAPI
Jiawen Wu [Fri, 3 Feb 2023 09:11:32 +0000 (17:11 +0800)]
net: libwx: Support to receive packets in NAPI

Clean all queues associated with a q_vector, to simple receive packets
without hardware features.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: txgbe: Setup Rx and Tx ring
Jiawen Wu [Fri, 3 Feb 2023 09:11:31 +0000 (17:11 +0800)]
net: txgbe: Setup Rx and Tx ring

Improve the configuration of Rx and Tx ring, set Rx flags and implement
ndo_set_rx_mode ops.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>