platform/kernel/linux-rpi.git
16 months agoipvs: align inner_mac_header for encapsulation
Terin Stock [Fri, 9 Jun 2023 20:58:42 +0000 (22:58 +0200)]
ipvs: align inner_mac_header for encapsulation

When using encapsulation the original packet's headers are copied to the
inner headers. This preserves the space for an inner mac header, which
is not used by the inner payloads for the encapsulation types supported
by IPVS. If a packet is using GUE or GRE encapsulation and needs to be
segmented, flow can be passed to __skb_udp_tunnel_segment() which
calculates a negative tunnel header length. A negative tunnel header
length causes pskb_may_pull() to fail, dropping the packet.

This can be observed by attaching probes to ip_vs_in_hook(),
__dev_queue_xmit(), and __skb_udp_tunnel_segment():

    perf probe --add '__dev_queue_xmit skb->inner_mac_header \
    skb->inner_network_header skb->mac_header skb->network_header'
    perf probe --add '__skb_udp_tunnel_segment:7 tnl_hlen'
    perf probe -m ip_vs --add 'ip_vs_in_hook skb->inner_mac_header \
    skb->inner_network_header skb->mac_header skb->network_header'

These probes the headers and tunnel header length for packets which
traverse the IPVS encapsulation path. A TCP packet can be forced into
the segmentation path by being smaller than a calculated clamped MSS,
but larger than the advertised MSS.

    probe:ip_vs_in_hook: inner_mac_header=0x0 inner_network_header=0x0 mac_header=0x44 network_header=0x52
    probe:ip_vs_in_hook: inner_mac_header=0x44 inner_network_header=0x52 mac_header=0x44 network_header=0x32
    probe:dev_queue_xmit: inner_mac_header=0x44 inner_network_header=0x52 mac_header=0x44 network_header=0x32
    probe:__skb_udp_tunnel_segment_L7: tnl_hlen=-2

When using veth-based encapsulation, the interfaces are set to be
mac-less, which does not preserve space for an inner mac header. This
prevents this issue from occurring.

In our real-world testing of sending a 32KB file we observed operation
time increasing from ~75ms for veth-based encapsulation to over 1.5s
using IPVS encapsulation due to retries from dropped packets.

This changeset modifies the packet on the encapsulation path in
ip_vs_tunnel_xmit() and ip_vs_tunnel_xmit_v6() to remove the inner mac
header offset. This fixes UDP segmentation for both encapsulation types,
and corrects the inner headers for any IPIP flows that may use it.

Fixes: 84c0d5e96f3a ("ipvs: allow tunneling with gue encapsulation")
Signed-off-by: Terin Stock <terin@cloudflare.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
16 months agoMerge tag 'mlx5-fixes-2023-06-16' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Mon, 19 Jun 2023 09:28:56 +0000 (10:28 +0100)]
Merge tag 'mlx5-fixes-2023-06-16' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-fixes-2023-06-16

This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agonet: qca_spi: Avoid high load if QCA7000 is not available
Stefan Wahren [Wed, 14 Jun 2023 21:06:56 +0000 (23:06 +0200)]
net: qca_spi: Avoid high load if QCA7000 is not available

In case the QCA7000 is not available via SPI (e.g. in reset),
the driver will cause a high load. The reason for this is
that the synchronization is never finished and schedule()
is never called. Since the synchronization is not timing
critical, it's safe to drop this from the scheduling condition.

Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Fixes: 291ab06ecf67 ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agonet: phy: Manual remove LEDs to ensure correct ordering
Andrew Lunn [Sat, 17 Jun 2023 15:55:00 +0000 (17:55 +0200)]
net: phy: Manual remove LEDs to ensure correct ordering

If the core is left to remove the LEDs via devm_, it is performed too
late, after the PHY driver is removed from the PHY. This results in
dereferencing a NULL pointer when the LED core tries to turn the LED
off before destroying the LED.

Manually unregister the LEDs at a safe point in phy_remove.

Cc: stable@vger.kernel.org
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agonfc: fdp: Add MODULE_FIRMWARE macros
Juerg Haefliger [Fri, 16 Jun 2023 12:22:18 +0000 (14:22 +0200)]
nfc: fdp: Add MODULE_FIRMWARE macros

The module loads firmware so add MODULE_FIRMWARE macros to provide that
information via modinfo.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agoieee802154/adf7242: Add MODULE_FIRMWARE macro
Juerg Haefliger [Fri, 16 Jun 2023 12:18:07 +0000 (14:18 +0200)]
ieee802154/adf7242: Add MODULE_FIRMWARE macro

The module loads firmware so add a MODULE_FIRMWARE macro to provide that
information via modinfo.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agosfc: use budget for TX completions
Íñigo Huguet [Thu, 15 Jun 2023 08:49:29 +0000 (10:49 +0200)]
sfc: use budget for TX completions

When running workloads heavy unbalanced towards TX (high TX, low RX
traffic), sfc driver can retain the CPU during too long times. Although
in many cases this is not enough to be visible, it can affect
performance and system responsiveness.

A way to reproduce it is to use a debug kernel and run some parallel
netperf TX tests. In some systems, this will lead to this message being
logged:
  kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!

The reason is that sfc driver doesn't account any NAPI budget for the TX
completion events work. With high-TX/low-RX traffic, this makes that the
CPU is held for long time for NAPI poll.

Documentations says "drivers can process completions for any number of Tx
packets but should only process up to budget number of Rx packets".
However, many drivers do limit the amount of TX completions that they
process in a single NAPI poll.

In the same way, this patch adds a limit for the TX work in sfc. With
the patch applied, the watchdog warning never appears.

Tested with netperf in different combinations: single process / parallel
processes, TCP / UDP and different sizes of UDP messages. Repeated the
tests before and after the patch, without any noticeable difference in
network or CPU performance.

Test hardware:
Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
Solarflare Communications XtremeScale X2522-25G Network Adapter

Fixes: 5227ecccea2d ("sfc: remove tx and MCDI handling from NAPI budget consideration")
Fixes: d19a53721863 ("sfc_ef100: TX path for EF100 NICs")
Reported-by: Fei Liu <feliu@redhat.com>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20230615084929.10506-1-ihuguet@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/mlx5e: Fix scheduling of IPsec ASO query while in atomic
Leon Romanovsky [Mon, 5 Jun 2023 08:09:52 +0000 (11:09 +0300)]
net/mlx5e: Fix scheduling of IPsec ASO query while in atomic

ASO query can be scheduled in atomic context as such it can't use usleep.
Use udelay as recommended in Documentation/timers/timers-howto.rst.

Fixes: 76e463f6508b ("net/mlx5e: Overcome slow response for first IPsec ASO WQE")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: Drop XFRM state lock when modifying flow steering
Leon Romanovsky [Mon, 5 Jun 2023 08:09:51 +0000 (11:09 +0300)]
net/mlx5e: Drop XFRM state lock when modifying flow steering

XFRM state which is changed to be XFRM_STATE_EXPIRED doesn't really
need to hold lock while modifying flow steering rules to drop traffic.

That state can be deleted only and as such mlx5e_ipsec_handle_tx_limit()
work will be canceled anyway and won't run in parallel.

Fixes: b2f7b01d36a9 ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: Fix ESN update kernel panic
Patrisious Haddad [Mon, 5 Jun 2023 08:09:50 +0000 (11:09 +0300)]
net/mlx5e: Fix ESN update kernel panic

Previously during mlx5e_ipsec_handle_event the driver tried to execute
an operation that could sleep, while holding a spinlock, which caused
the kernel panic mentioned below.

Move the function call that can sleep outside of the spinlock context.

 Call Trace:
 <TASK>
 dump_stack_lvl+0x49/0x6c
 __schedule_bug.cold+0x42/0x4e
 schedule_debug.constprop.0+0xe0/0x118
 __schedule+0x59/0x58a
 ? __mod_timer+0x2a1/0x3ef
 schedule+0x5e/0xd4
 schedule_timeout+0x99/0x164
 ? __pfx_process_timeout+0x10/0x10
 __wait_for_common+0x90/0x1da
 ? __pfx_schedule_timeout+0x10/0x10
 wait_func+0x34/0x142 [mlx5_core]
 mlx5_cmd_invoke+0x1f3/0x313 [mlx5_core]
 cmd_exec+0x1fe/0x325 [mlx5_core]
 mlx5_cmd_do+0x22/0x50 [mlx5_core]
 mlx5_cmd_exec+0x1c/0x40 [mlx5_core]
 mlx5_modify_ipsec_obj+0xb2/0x17f [mlx5_core]
 mlx5e_ipsec_update_esn_state+0x69/0xf0 [mlx5_core]
 ? wake_affine+0x62/0x1f8
 mlx5e_ipsec_handle_event+0xb1/0xc0 [mlx5_core]
 process_one_work+0x1e2/0x3e6
 ? __pfx_worker_thread+0x10/0x10
 worker_thread+0x54/0x3ad
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xda/0x101
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x29/0x37
 </TASK>
 BUG: workqueue leaked lock or atomic: kworker/u256:4/0x7fffffff/189754#012     last function: mlx5e_ipsec_handle_event [mlx5_core]
 CPU: 66 PID: 189754 Comm: kworker/u256:4 Kdump: loaded Tainted: G        W          6.2.0-2596.20230309201517_5.el8uek.rc1.x86_64 #2
 Hardware name: Oracle Corporation ORACLE SERVER X9-2/ASMMBX9-2, BIOS 61070300 08/17/2022
 Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_event [mlx5_core]
 Call Trace:
 <TASK>
 dump_stack_lvl+0x49/0x6c
 process_one_work.cold+0x2b/0x3c
 ? __pfx_worker_thread+0x10/0x10
 worker_thread+0x54/0x3ad
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xda/0x101
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x29/0x37
 </TASK>
 BUG: scheduling while atomic: kworker/u256:4/189754/0x00000000

Fixes: cee137a63431 ("net/mlx5e: Handle ESN update events")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: Don't delay release of hardware objects
Leon Romanovsky [Mon, 5 Jun 2023 08:09:49 +0000 (11:09 +0300)]
net/mlx5e: Don't delay release of hardware objects

XFRM core provides two callbacks to release resources, one is .xdo_dev_policy_delete()
and another is .xdo_dev_policy_free(). This separation allows delayed release so
"ip xfrm policy free" commands won't starve. Unfortunately, mlx5 command interface
can't run in .xdo_dev_policy_free() callbacks as the latter runs in ATOMIC context.

 BUG: scheduling while atomic: swapper/7/0/0x00000100
 Modules linked in: act_mirred act_tunnel_key cls_flower sch_ingress vxlan mlx5_vdpa vringh vhost_iotlb vdpa rpcrdma rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
 CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.3.0+ #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x33/0x50
  __schedule_bug+0x4e/0x60
  __schedule+0x5d5/0x780
  ? __mod_timer+0x286/0x3d0
  schedule+0x50/0x90
  schedule_timeout+0x7c/0xf0
  ? __bpf_trace_tick_stop+0x10/0x10
  __wait_for_common+0x88/0x190
  ? usleep_range_state+0x90/0x90
  cmd_exec+0x42e/0xb40 [mlx5_core]
  mlx5_cmd_do+0x1e/0x40 [mlx5_core]
  mlx5_cmd_exec+0x18/0x30 [mlx5_core]
  mlx5_cmd_delete_fte+0xa8/0xd0 [mlx5_core]
  del_hw_fte+0x60/0x120 [mlx5_core]
  mlx5_del_flow_rules+0xec/0x270 [mlx5_core]
  ? default_send_IPI_single_phys+0x26/0x30
  mlx5e_accel_ipsec_fs_del_pol+0x1a/0x60 [mlx5_core]
  mlx5e_xfrm_free_policy+0x15/0x20 [mlx5_core]
  xfrm_policy_destroy+0x5a/0xb0
  xfrm4_dst_destroy+0x7b/0x100
  dst_destroy+0x37/0x120
  rcu_core+0x2d6/0x540
  __do_softirq+0xcd/0x273
  irq_exit_rcu+0x82/0xb0
  sysvec_apic_timer_interrupt+0x72/0x90
  </IRQ>
  <TASK>
  asm_sysvec_apic_timer_interrupt+0x16/0x20
 RIP: 0010:default_idle+0x13/0x20
 Code: c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 8b 05 7a 4d ee 00 85 c0 7e 07 0f 00 2d 2f 98 2e 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 40 b4 02 00
 RSP: 0018:ffff888100843ee0 EFLAGS: 00000242
 RAX: 0000000000000001 RBX: ffff888100812b00 RCX: 4000000000000000
 RDX: 0000000000000001 RSI: 0000000000000083 RDI: 000000000002d2ec
 RBP: 0000000000000007 R08: 00000021daeded59 R09: 0000000000000001
 R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  default_idle_call+0x30/0xb0
  do_idle+0x1c1/0x1d0
  cpu_startup_entry+0x19/0x20
  start_secondary+0xfe/0x120
  secondary_startup_64_no_verify+0xf3/0xfb
  </TASK>
 bad: scheduling from the idle thread!

Fixes: a5b8ca9471d3 ("net/mlx5e: Add XFRM policy offload logic")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5: Free IRQ rmap and notifier on kernel shutdown
Saeed Mahameed [Thu, 8 Jun 2023 19:00:54 +0000 (12:00 -0700)]
net/mlx5: Free IRQ rmap and notifier on kernel shutdown

The kernel IRQ system needs the irq affinity notifier to be clear
before attempting to free the irq, see WARN_ON log below.

On a normal driver unload we don't have this issue since we do the
complete cleanup of the irq resources.

To fix this, put the important resources cleanup in a helper function
and use it in both normal driver unload and shutdown flows.

[ 4497.498434] ------------[ cut here ]------------
[ 4497.498726] WARNING: CPU: 0 PID: 9 at kernel/irq/manage.c:2034 free_irq+0x295/0x340
[ 4497.499193] Modules linked in:
[ 4497.499386] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.4.0-rc4+ #10
[ 4497.499876] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 4497.500518] Workqueue: events do_poweroff
[ 4497.500849] RIP: 0010:free_irq+0x295/0x340
[ 4497.501132] Code: 85 c0 0f 84 1d ff ff ff 48 89 ef ff d0 0f 1f 00 e9 10 ff ff ff 0f 0b e9 72 ff ff ff 49 8d 7f 28 ff d0 0f 1f 00 e9 df fd ff ff <0f> 0b 48 c7 80 c0 008
[ 4497.502269] RSP: 0018:ffffc90000053da0 EFLAGS: 00010282
[ 4497.502589] RAX: ffff888100949600 RBX: ffff88810330b948 RCX: 0000000000000000
[ 4497.503035] RDX: ffff888100949600 RSI: ffff888100400490 RDI: 0000000000000023
[ 4497.503472] RBP: ffff88810330c7e0 R08: ffff8881004005d0 R09: ffffffff8273a260
[ 4497.503923] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881009ae000
[ 4497.504359] R13: ffff8881009ae148 R14: 0000000000000000 R15: ffff888100949600
[ 4497.504804] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[ 4497.505302] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4497.505671] CR2: 00007fce98806298 CR3: 000000000262e005 CR4: 0000000000370ef0
[ 4497.506104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4497.506540] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4497.507002] Call Trace:
[ 4497.507158]  <TASK>
[ 4497.507299]  ? free_irq+0x295/0x340
[ 4497.507522]  ? __warn+0x7c/0x130
[ 4497.507740]  ? free_irq+0x295/0x340
[ 4497.507963]  ? report_bug+0x171/0x1a0
[ 4497.508197]  ? handle_bug+0x3c/0x70
[ 4497.508417]  ? exc_invalid_op+0x17/0x70
[ 4497.508662]  ? asm_exc_invalid_op+0x1a/0x20
[ 4497.508926]  ? free_irq+0x295/0x340
[ 4497.509146]  mlx5_irq_pool_free_irqs+0x48/0x90
[ 4497.509421]  mlx5_irq_table_free_irqs+0x38/0x50
[ 4497.509714]  mlx5_core_eq_free_irqs+0x27/0x40
[ 4497.509984]  shutdown+0x7b/0x100
[ 4497.510184]  pci_device_shutdown+0x30/0x60
[ 4497.510440]  device_shutdown+0x14d/0x240
[ 4497.510698]  kernel_power_off+0x30/0x70
[ 4497.510938]  process_one_work+0x1e6/0x3e0
[ 4497.511183]  worker_thread+0x49/0x3b0
[ 4497.511407]  ? __pfx_worker_thread+0x10/0x10
[ 4497.511679]  kthread+0xe0/0x110
[ 4497.511879]  ? __pfx_kthread+0x10/0x10
[ 4497.512114]  ret_from_fork+0x29/0x50
[ 4497.512342]  </TASK>

Fixes: 9c2d08010963 ("net/mlx5: Free irqs only on shutdown callback")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
16 months agonet/mlx5: DR, Fix wrong action data allocation in decap action
Yevgeny Kliteynik [Sun, 4 Jun 2023 18:07:04 +0000 (21:07 +0300)]
net/mlx5: DR, Fix wrong action data allocation in decap action

When TUNNEL_L3_TO_L2 decap action was created, a pointer to a local
variable was passed as its HW action data, resulting in attempt to
free invalid address:

  BUG: KASAN: invalid-free in mlx5dr_action_destroy+0x318/0x410 [mlx5_core]

Fixes: 4781df92f4da ("net/mlx5: DR, Move STEv0 modify header logic")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5: DR, Support SW created encap actions for FW table
Yevgeny Kliteynik [Sun, 11 Jun 2023 22:05:48 +0000 (01:05 +0300)]
net/mlx5: DR, Support SW created encap actions for FW table

In some cases, steering might need to use SW-created action in
FW table, which results in wrong packet reformat being used:

  mlx5_core 0000:81:00.1: mlx5_cmd_check:756:(pid 1154):
      SET_FLOW_TABLE_ENTRY(0×936) op_mod(0×0) failed,
      status bad resource(0×5), syndrome (0xf2ff71)

This patch adds support for usage of SW-created packet reformat (encap)
actions in FW tables, and adds clear error flow for attempt to use
SW-created modify header on FW tables.

Fixes: 6a48faeeca10 ("net/mlx5: Add direct rule fs_cmd implementation")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: TC, Cleanup ct resources for nic flow
Chris Mi [Thu, 6 Apr 2023 06:38:09 +0000 (09:38 +0300)]
net/mlx5e: TC, Cleanup ct resources for nic flow

The cited commit removes special handling of CT action. But it
removes too much. Pre ct/ct_nat tables and some other resources
are not destroyed due to the cited commit.

Fix it by adding it back.

Fixes: 08fe94ec5f77 ("net/mlx5e: TC, Remove special handling of CT action")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: TC, Add null pointer check for hardware miss support
Chris Mi [Tue, 11 Apr 2023 08:17:35 +0000 (11:17 +0300)]
net/mlx5e: TC, Add null pointer check for hardware miss support

The cited commits add hardware miss support to tc action. But if
the rules can't be offloaded, the pointers are null and system
will panic when accessing them.

Fix it by checking null pointer.

Fixes: 08fe94ec5f77 ("net/mlx5e: TC, Remove special handling of CT action")
Fixes: 6702782845a5 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5: Fix driver load with single msix vector
Eli Cohen [Wed, 3 May 2023 12:10:05 +0000 (15:10 +0300)]
net/mlx5: Fix driver load with single msix vector

When a PCI device has just one msix vector available, we want to share
this vector between async and completion events. Current code fails to
do that assuming it will always have at least one dedicated vector for
completion events. Fix this by detecting when the pool contains just a
single vector.

Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: xsk: Set napi_id to support busy polling on XSK RQ
Maxim Mikityanskiy [Wed, 14 Jun 2023 09:00:06 +0000 (12:00 +0300)]
net/mlx5e: xsk: Set napi_id to support busy polling on XSK RQ

The cited commit missed setting napi_id on XSK RQs, it only affected
regular RQs. Add the missing part to support socket busy polling on XSK
RQs.

Fixes: a2740f529da2 ("net/mlx5e: xsk: Set napi_id to support busy polling")
Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agonet/mlx5e: XDP, Allow growing tail for XDP multi buffer
Maxim Mikityanskiy [Wed, 14 Jun 2023 09:00:05 +0000 (12:00 +0300)]
net/mlx5e: XDP, Allow growing tail for XDP multi buffer

The cited commits missed passing frag_size to __xdp_rxq_info_reg, which
is required by bpf_xdp_adjust_tail to support growing the tail pointer
in fragmented packets. Pass the missing parameter when the current RQ
mode allows XDP multi buffer.

Fixes: ea5d49bdae8b ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ")
Fixes: 9cb9482ef10e ("net/mlx5e: Use fragments of the same size in non-linear legacy RQ with XDP")
Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
16 months agoMerge branch 'check-if-fips-mode-is-enabled-when-running-selftests'
Jakub Kicinski [Fri, 16 Jun 2023 05:24:03 +0000 (22:24 -0700)]
Merge branch 'check-if-fips-mode-is-enabled-when-running-selftests'

Magali Lemes says:

====================
Check if FIPS mode is enabled when running selftests

Some test cases from net/tls, net/fcnal-test and net/vrf-xfrm-tests
that rely on cryptographic functions to work and use non-compliant FIPS
algorithms fail in FIPS mode.

In order to allow these tests to pass in a wider set of kernels,
 - for net/tls, skip the test variants that use the ChaCha20-Poly1305
and SM4 algorithms, when FIPS mode is enabled;
 - for net/fcnal-test, skip the MD5 tests, when FIPS mode is enabled;
 - for net/vrf-xfrm-tests, replace the algorithms that are not
FIPS-compliant with compliant ones.

v1: https://lore.kernel.org/netdev/20230607174302.19542-1-magali.lemes@canonical.com/
v2: https://lore.kernel.org/netdev/20230609164324.497813-1-magali.lemes@canonical.com/
v3: https://lore.kernel.org/netdev/20230612125107.73795-1-magali.lemes@canonical.com/
====================

Link: https://lore.kernel.org/r/20230613123222.631897-1-magali.lemes@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: net: fcnal-test: check if FIPS mode is enabled
Magali Lemes [Tue, 13 Jun 2023 12:32:22 +0000 (09:32 -0300)]
selftests: net: fcnal-test: check if FIPS mode is enabled

There are some MD5 tests which fail when the kernel is in FIPS mode,
since MD5 is not FIPS compliant. Add a check and only run those tests
if FIPS mode is not enabled.

Fixes: f0bee1ebb5594 ("fcnal-test: Add TCP MD5 tests")
Fixes: 5cad8bce26e01 ("fcnal-test: Add TCP MD5 tests for VRF")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Magali Lemes <magali.lemes@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: net: vrf-xfrm-tests: change authentication and encryption algos
Magali Lemes [Tue, 13 Jun 2023 12:32:21 +0000 (09:32 -0300)]
selftests: net: vrf-xfrm-tests: change authentication and encryption algos

The vrf-xfrm-tests tests use the hmac(md5) and cbc(des3_ede)
algorithms for performing authentication and encryption, respectively.
This causes the tests to fail when fips=1 is set, since these algorithms
are not allowed in FIPS mode. Therefore, switch from hmac(md5) and
cbc(des3_ede) to hmac(sha1) and cbc(aes), which are FIPS compliant.

Fixes: 3f251d741150 ("selftests: Add tests for vrf and xfrms")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Magali Lemes <magali.lemes@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: net: tls: check if FIPS mode is enabled
Magali Lemes [Tue, 13 Jun 2023 12:32:20 +0000 (09:32 -0300)]
selftests: net: tls: check if FIPS mode is enabled

TLS selftests use the ChaCha20-Poly1305 and SM4 algorithms, which are not
FIPS compliant. When fips=1, this set of tests fails. Add a check and only
run these tests if not in FIPS mode.

Fixes: 4f336e88a870 ("selftests/tls: add CHACHA20-POLY1305 to tls selftests")
Fixes: e506342a03c7 ("selftests/tls: add SM4 GCM/CCM to tls selftests")
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Magali Lemes <magali.lemes@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests/harness: allow tests to be skipped during setup
Magali Lemes [Tue, 13 Jun 2023 12:32:19 +0000 (09:32 -0300)]
selftests/harness: allow tests to be skipped during setup

Before executing each test from a fixture, FIXTURE_SETUP is run once.
When SKIP is used in FIXTURE_SETUP, the setup function returns early
but the test still proceeds to run, unless another SKIP macro is used
within the test definition, leading to some code repetition. Therefore,
allow tests to be skipped directly from the setup function.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Magali Lemes <magali.lemes@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoMerge tag 'net-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Fri, 16 Jun 2023 04:11:17 +0000 (21:11 -0700)]
Merge tag 'net-6.4-rc7' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from wireless, and netfilter.

  Selftests excluded - we have 58 patches and diff of +442/-199, which
  isn't really small but perhaps with the exception of the WiFi locking
  change it's old(ish) bugs.

  We have no known problems with v6.4.

  The selftest changes are rather large as MPTCP folks try to apply
  Greg's guidance that selftest from torvalds/linux should be able to
  run against stable kernels.

  Last thing I should call out is the DCCP/UDP-lite deprecation notices.
  We are fairly sure those are dead, but if we're wrong reverting them
  back in won't be fun.

  Current release - regressions:

   - wifi:
      - cfg80211: fix double lock bug in reg_wdev_chan_valid()
      - iwlwifi: mvm: spin_lock_bh() to fix lockdep regression

  Current release - new code bugs:

   - handshake: remove fput() that causes use-after-free

  Previous releases - regressions:

   - sched: cls_u32: fix reference counter leak leading to overflow

   - sched: cls_api: fix lockup on flushing explicitly created chain

  Previous releases - always broken:

   - nf_tables: integrate pipapo into commit protocol

   - nf_tables: incorrect error path handling with NFT_MSG_NEWRULE, fix
     dangling pointer on failure

   - ping6: fix send to link-local addresses with VRF

   - sched: act_pedit: parse L3 header for L4 offset, the skb may not
     have the offset saved

   - sched: act_ct: fix promotion of offloaded unreplied tuple

   - sched: refuse to destroy an ingress and clsact Qdiscs if there are
     lockless change operations in flight

   - wifi: mac80211: fix handful of bugs in multi-link operation

   - ipvlan: fix bound dev checking for IPv6 l3s mode

   - eth: enetc: correct the indexes of highest and 2nd highest TCs

   - eth: ice: fix XDP memory leak when NIC is brought up and down

  Misc:

   - add deprecation notices for UDP-lite and DCCP

   - selftests: mptcp: skip tests not supported by old kernels

   - sctp: handle invalid error codes without calling BUG()"

* tag 'net-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
  dccp: Print deprecation notice.
  udplite: Print deprecation notice.
  octeon_ep: Add missing check for ioremap
  selftests/ptp: Fix timestamp printf format for PTP_SYS_OFFSET
  net: ethernet: stmicro: stmmac: fix possible memory leak in __stmmac_open
  net: tipc: resize nlattr array to correct size
  sfc: fix XDP queues mode with legacy IRQ
  net: macsec: fix double free of percpu stats
  net: lapbether: only support ethernet devices
  MAINTAINERS: add reviewers for SMC Sockets
  s390/ism: Fix trying to free already-freed IRQ by repeated ism_dev_exit()
  net: dsa: felix: fix taprio guard band overflow at 10Mbps with jumbo frames
  net/sched: cls_api: Fix lockup on flushing explicitly created chain
  ice: Fix ice module unload
  net/handshake: remove fput() that causes use-after-free
  selftests: forwarding: hw_stats_l3: Set addrgenmode in a separate step
  net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
  net/sched: act_ct: Fix promotion of offloaded unreplied tuple
  wifi: iwlwifi: mvm: spin_lock_bh() to fix lockdep regression
  ...

16 months agoMerge tag 'loongarch-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 16 Jun 2023 03:56:39 +0000 (20:56 -0700)]
Merge tag 'loongarch-fixes-6.4-1' of git://git./linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
 "Some trivial bug fixes for v6.4-rc7"

* tag 'loongarch-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: Fix debugfs_create_dir() error checking
  LoongArch: Avoid uninitialized alignment_mask
  LoongArch: Fix perf event id calculation
  LoongArch: Fix the write_fcsr() macro
  LoongArch: Let pmd_present() return true when splitting pmd

16 months agoMerge tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device...
Linus Torvalds [Fri, 16 Jun 2023 03:19:21 +0000 (20:19 -0700)]
Merge tag 'for-6.4/dm-fixes' of git://git./linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM thinp discard performance regression introduced during this
   merge window where DM core was splitting large discards every 128K
   (max_sectors_kb) rather than every 64M (discard_max_bytes).

 - Extend DM core LOCKFS fix, made during 6.4 merge, to also fix race
   between do_mount and dm's do_suspend (in addition to the earlier
   fix's do_mount race with dm's do_resume).

 - Fix DM thin metadata operations to first check if the thin-pool is in
   "fail_io" mode; otherwise UAF can occur.

 - Fix DM thinp's call to __blkdev_issue_discard to use GFP_NOIO rather
   than GFP_NOWAIT (__blkdev_issue_discard cannot handle NULL return
   from bio_alloc).

* tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: use op specific max_sectors when splitting abnormal io
  dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard
  dm thin metadata: check fail_io before using data_sm
  dm: don't lock fs when the map is NULL during suspend or resume

16 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Linus Torvalds [Fri, 16 Jun 2023 03:13:56 +0000 (20:13 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "This is an unusually large bunch of bug fixes for the later rc cycle,
  rxe and mlx5 both dumped a lot of things at once. rxe continues to fix
  itself, and mlx5 is fixing a bunch of "queue counters" related bugs.

  There is one highly notable bug fix regarding the qkey. This small
  security check was missed in the original 2005 implementation and it
  allows some significant issues.

  Summary:

   - Two rtrs bug fixes for error unwind bugs

   - Several rxe bug fixes:
      * Incorrect Rx packet validation
      * Using memory without a refcount
      * Syzkaller found use before initialization
      * Regression fix for missing locking with the tasklet conversion
        from this merge window

   - Have bnxt report the correct link properties to userspace, this was
     a regression in v6.3

   - Several mlx5 bug fixes:
      * Kernel crash triggerable by userspace for the RAW ethernet
        profile
      * Defend against steering refcounting issues created by userspace
      * Incorrect change of QP port affinity parameters in some LAG
        configurations

   - Fix mlx5 Q counters:
      * Do not over allocate Q counters to allow userspace to use the
        full port capacity
      * Kernel crash triggered by eswitch due to mis-use of Q counters
      * Incorrect mlx5_device for Q counters in some LAG configurations

   - Properly implement the IBA spec restricting privileged qkeys to
     root

   - Always an error when reading from a disassociated device's event
     queue

   - isert bug fixes:
      * Avoid a deadlock with the CM handler and CM ID destruction
      * Correct list corruption due to incorrect locking
      * Fix a use after free around connection tear down"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/rxe: Fix rxe_cq_post
  IB/isert: Fix incorrect release of isert connection
  IB/isert: Fix possible list corruption in CMA handler
  IB/isert: Fix dead lock in ib_isert
  RDMA/mlx5: Fix affinity assignment
  IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
  RDMA/uverbs: Restrict usage of privileged QKEYs
  RDMA/cma: Always set static rate to 0 for RoCE
  RDMA/mlx5: Fix Q-counters query in LAG mode
  RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
  RDMA/mlx5: Fix Q-counters per vport allocation
  RDMA/mlx5: Create an indirect flow table for steering anchor
  RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
  RDMA/rxe: Fix the use-before-initialization error of resp_pkts
  RDMA/bnxt_re: Fix reporting active_{speed,width} attributes
  RDMA/rxe: Fix ref count error in check_rkey()
  RDMA/rxe: Fix packet length checks
  RDMA/rtrs: Fix rxe_dealloc_pd warning
  RDMA/rtrs: Fix the last iu->buf leak in err path

16 months agoMerge tag 'spi-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Linus Torvalds [Fri, 16 Jun 2023 03:03:15 +0000 (20:03 -0700)]
Merge tag 'spi-fix-v6.4-rc6' of git://git./linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A few more driver specific fixes.

  The DesignWare fix is for an issue introduced by conversion to the
  chip select accessor functions and is pretty important but the other
  two are less severe"

* tag 'spi-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: dw: Replace incorrect spi_get_chipselect with set
  spi: fsl-dspi: avoid SCK glitches with continuous transfers
  spi: cadence-quadspi: Add missing check for dma_set_mask

16 months agoMerge tag 'regulator-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 16 Jun 2023 02:54:58 +0000 (19:54 -0700)]
Merge tag 'regulator-fix-v6.4-rc6' of git://git./linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "The set of regulators described for the Qualcomm PM8550 just seems to
  have been completely wrong and would likely not have worked at all if
  anything tried to actually configure anything except for enabling and
  disabling at runtime"

* tag 'regulator-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: qcom-rpmh: Fix regulators for PM8550

16 months agoMerge tag 'regmap-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 16 Jun 2023 02:50:57 +0000 (19:50 -0700)]
Merge tag 'regmap-fix-v6.4-rc6' of git://git./linux/kernel/git/broonie/regmap

Pull regmap fix from Mark Brown:
 "Another fix for the maple tree cache, Takashi noticed that unlike
  other caches the maple tree cache didn't check for read only registers
  before trying to sync which would result in spurious syncs for read
  only registers where we don't have a default.

  This was due to the check being open coded in the caches, we now check
  in the shared 'does this register need sync' function so that is fixed
  for this and future caches"

* tag 'regmap-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: regcache: Don't sync read-only registers

16 months agoMerge tag 'media/v6.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Fri, 16 Jun 2023 02:13:45 +0000 (19:13 -0700)]
Merge tag 'media/v6.4-6' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
 "A fix for dvb-core to avoid a race condition during DVB board
  registration"

* tag 'media/v6.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  Revert "media: dvb-core: Fix use-after-free on race condition at dvb_frontend"

16 months agoMerge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 15 Jun 2023 22:40:58 +0000 (15:40 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix two regressions in ext4, one report by syzkaller[1], and reported
  by multiple users (and tracked by regzbot[2])"

[1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
[2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: drop the call to ext4_error() from ext4_get_group_info()
  Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"

16 months agoMerge tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Thu, 15 Jun 2023 22:24:33 +0000 (15:24 -0700)]
Merge tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Eight, mostly small, smb3 client fixes:

   - important fix for deferred close oops (race with unmount) found
     with xfstest generic/098 to some servers

   - important reconnect fix

   - fix problem with max_credits mount option

   - two multichannel (interface related) fixes

   - one trivial removal of confusing comment

   - two small debugging improvements (to better spot crediting
     problems)"

* tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: add a warning when the in-flight count goes negative
  cifs: fix lease break oops in xfstest generic/098
  cifs: fix max_credits implementation
  cifs: fix sockaddr comparison in iface_cmp
  smb/client: print "Unknown" instead of bogus link speed value
  cifs: print all credit counters in DebugData
  cifs: fix status checks in cifs_tree_connect
  smb: remove obsolete comment

16 months agoMerge branch 'udplite-dccp-print-deprecation-notice'
Jakub Kicinski [Thu, 15 Jun 2023 22:09:00 +0000 (15:09 -0700)]
Merge branch 'udplite-dccp-print-deprecation-notice'

Kuniyuki Iwashima says:

====================
udplite/dccp: Print deprecation notice.

UDP-Lite is assumed to have no users for 7 years, and DCCP is
orphaned for 7 years too.

Let's add deprecation notice and see if anyone responds to it.
====================

Link: https://lore.kernel.org/r/20230614194705.90673-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agodccp: Print deprecation notice.
Kuniyuki Iwashima [Wed, 14 Jun 2023 19:47:05 +0000 (12:47 -0700)]
dccp: Print deprecation notice.

DCCP was marked as Orphan in the MAINTAINERS entry 2 years ago in commit
054c4610bd05 ("MAINTAINERS: dccp: move Gerrit Renker to CREDITS").  It says
we haven't heard from the maintainer for five years, so DCCP is not well
maintained for 7 years now.

Recently DCCP only receives updates for bugs, and major distros disable it
by default.

Removing DCCP would allow for better organisation of TCP fields to reduce
the number of cache lines hit in the fast path.

Let's add a deprecation notice when DCCP socket is created and schedule its
removal to 2025.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoudplite: Print deprecation notice.
Kuniyuki Iwashima [Wed, 14 Jun 2023 19:47:04 +0000 (12:47 -0700)]
udplite: Print deprecation notice.

Recently syzkaller reported a 7-year-old null-ptr-deref [0] that occurs
when a UDP-Lite socket tries to allocate a buffer under memory pressure.

Someone should have stumbled on the bug much earlier if UDP-Lite had been
used in a real app.  Also, we do not always need a large UDP-Lite workload
to hit the bug since UDP and UDP-Lite share the same memory accounting
limit.

Removing UDP-Lite would simplify UDP code removing a bunch of conditionals
in fast path.

Let's add a deprecation notice when UDP-Lite socket is created and schedule
its removal to 2025.

Link: https://lore.kernel.org/netdev/20230523163305.66466-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoocteon_ep: Add missing check for ioremap
Jiasheng Jiang [Thu, 15 Jun 2023 03:34:00 +0000 (11:34 +0800)]
octeon_ep: Add missing check for ioremap

Add check for ioremap() and return the error if it fails in order to
guarantee the success of ioremap().

Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://lore.kernel.org/r/20230615033400.2971-1-jiasheng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests/ptp: Fix timestamp printf format for PTP_SYS_OFFSET
Alex Maftei [Thu, 15 Jun 2023 08:34:04 +0000 (09:34 +0100)]
selftests/ptp: Fix timestamp printf format for PTP_SYS_OFFSET

Previously, timestamps were printed using "%lld.%u" which is incorrect
for nanosecond values lower than 100,000,000 as they're fractional
digits, therefore leading zeros are meaningful.

This patch changes the format strings to "%lld.%09u" in order to add
leading zeros to the nanosecond value.

Fixes: 568ebc5985f5 ("ptp: add the PTP_SYS_OFFSET ioctl to the testptp program")
Fixes: 4ec54f95736f ("ptp: Fix compiler warnings in the testptp utility")
Fixes: 6ab0e475f1f3 ("Documentation: fix misc. warnings")
Signed-off-by: Alex Maftei <alex.maftei@amd.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20230615083404.57112-1-alex.maftei@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet: ethernet: stmicro: stmmac: fix possible memory leak in __stmmac_open
Christian Marangi [Wed, 14 Jun 2023 09:17:14 +0000 (11:17 +0200)]
net: ethernet: stmicro: stmmac: fix possible memory leak in __stmmac_open

Fix a possible memory leak in __stmmac_open when stmmac_init_phy fails.
It's also needed to free everything allocated by stmmac_setup_dma_desc
and not just the dma_conf struct.

Drop free_dma_desc_resources from __stmmac_open and correctly call
free_dma_desc_resources on each user of __stmmac_open on error.

Reported-by: Jose Abreu <Jose.Abreu@synopsys.com>
Fixes: ba39b344e924 ("net: ethernet: stmicro: stmmac: generate stmmac dma conf before open")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jose Abreu <Jose.Abreu@synopsys.com>
Link: https://lore.kernel.org/r/20230614091714.15912-1-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet: tipc: resize nlattr array to correct size
Lin Ma [Wed, 14 Jun 2023 12:06:04 +0000 (20:06 +0800)]
net: tipc: resize nlattr array to correct size

According to nla_parse_nested_deprecated(), the tb[] is supposed to the
destination array with maxtype+1 elements. In current
tipc_nl_media_get() and __tipc_nl_media_set(), a larger array is used
which is unnecessary. This patch resize them to a proper size.

Fixes: 1e55417d8fc6 ("tipc: add media set to new netlink api")
Fixes: 46f15c6794fb ("tipc: add media get/dump to new netlink api")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Link: https://lore.kernel.org/r/20230614120604.1196377-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agodm: use op specific max_sectors when splitting abnormal io
Mike Snitzer [Thu, 15 Jun 2023 01:47:46 +0000 (21:47 -0400)]
dm: use op specific max_sectors when splitting abnormal io

Split abnormal IO in terms of the corresponding operation specific
max_sectors (max_discard_sectors, max_secure_erase_sectors or
max_write_zeroes_sectors).

This fixes a significant dm-thinp discard performance regression that
was introduced with commit e2dd8aca2d76 ("dm bio prison v1: improve
concurrent IO performance"). Relative to discard: max_discard_sectors
is used instead of max_sectors; which fixes excessive discard splitting
(e.g. max_sectors=128K vs max_discard_sectors=64M).

Tested by discarding an 1 Petabyte dm-thin device:
lvcreate -V 1125899906842624B -T test/pool -n thin
time blkdiscard /dev/test/thin

Before this fix (splitting discards every 128K): ~116m
 After this fix (splitting discards every 64M) : 0m33.460s

Reported-by: Zorro Lang <zlang@redhat.com>
Fixes: 06961c487a33 ("dm: split discards further if target sets max_discard_granularity")
Requires: 13f6facf3fae ("dm: allow targets to require splitting WRITE_ZEROES and SECURE_ERASE")
Fixes: e2dd8aca2d76 ("dm bio prison v1: improve concurrent IO performance")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
16 months agodm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard
Mike Snitzer [Wed, 14 Jun 2023 00:05:34 +0000 (20:05 -0400)]
dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard

issue_discard() passes GFP_NOWAIT to __blkdev_issue_discard() despite
its code assuming bio_alloc() always succeeds.

Commit 3dba53a958a75 ("dm thin: use __blkdev_issue_discard for async
discard support") clearly shows where things went bad:

Before commit 3dba53a958a75, dm-thin.c's open-coded
__blkdev_issue_discard_async() properly handled using GFP_NOWAIT.
Unfortunately __blkdev_issue_discard() doesn't and it was missed
during review.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
16 months agodm thin metadata: check fail_io before using data_sm
Li Lingfeng [Tue, 6 Jun 2023 12:20:24 +0000 (20:20 +0800)]
dm thin metadata: check fail_io before using data_sm

Must check pmd->fail_io before using pmd->data_sm since
pmd->data_sm may be destroyed by other processes.

       P1(kworker)                             P2(message)
do_worker
 process_prepared
  process_prepared_discard_passdown_pt2
   dm_pool_dec_data_range
                                    pool_message
                                     commit
                                      dm_pool_commit_metadata
                                        ↓
                                       // commit failed
                                      metadata_operation_failed
                                       abort_transaction
                                        dm_pool_abort_metadata
                                         __open_or_format_metadata
                                           ↓
                                          dm_sm_disk_open
                                            ↓
                                           // open failed
                                           // pmd->data_sm is NULL
    dm_sm_dec_blocks
      ↓
     // try to access pmd->data_sm --> UAF

As shown above, if dm_pool_commit_metadata() and
dm_pool_abort_metadata() fail in pool_message process, kworker may
trigger UAF.

Fixes: be500ed721a6 ("dm space maps: improve performance with inc/dec on ranges of blocks")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
16 months agodm: don't lock fs when the map is NULL during suspend or resume
Li Lingfeng [Thu, 1 Jun 2023 06:14:23 +0000 (14:14 +0800)]
dm: don't lock fs when the map is NULL during suspend or resume

As described in commit 38d11da522aa ("dm: don't lock fs when the map is
NULL in process of resume"), a deadlock may be triggered between
do_resume() and do_mount().

This commit preserves the fix from commit 38d11da522aa but moves it to
where it also serves to fix a similar deadlock between do_suspend()
and do_mount().  It does so, if the active map is NULL, by clearing
DM_SUSPEND_LOCKFS_FLAG in dm_suspend() which is called by both
do_suspend() and do_resume().

Fixes: 38d11da522aa ("dm: don't lock fs when the map is NULL in process of resume")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
16 months agosfc: fix XDP queues mode with legacy IRQ
Íñigo Huguet [Tue, 13 Jun 2023 13:38:54 +0000 (15:38 +0200)]
sfc: fix XDP queues mode with legacy IRQ

In systems without MSI-X capabilities, xdp_txq_queues_mode is calculated
in efx_allocate_msix_channels, but when enabling MSI-X fails, it was not
changed to a proper default value. This was leading to the driver
thinking that it has dedicated XDP queues, when it didn't.

Fix it by setting xdp_txq_queues_mode to the correct value if the driver
fallbacks to MSI or legacy IRQ mode. The correct value is
EFX_XDP_TX_QUEUES_BORROWED because there are no XDP dedicated queues.

The issue can be easily visible if the kernel is started with pci=nomsi,
then a call trace is shown. It is not shown only with sfc's modparam
interrupt_mode=2. Call trace example:
 WARNING: CPU: 2 PID: 663 at drivers/net/ethernet/sfc/efx_channels.c:828 efx_set_xdp_channels+0x124/0x260 [sfc]
 [...skip...]
 Call Trace:
  <TASK>
  efx_set_channels+0x5c/0xc0 [sfc]
  efx_probe_nic+0x9b/0x15a [sfc]
  efx_probe_all+0x10/0x1a2 [sfc]
  efx_pci_probe_main+0x12/0x156 [sfc]
  efx_pci_probe_post_io+0x18/0x103 [sfc]
  efx_pci_probe.cold+0x154/0x257 [sfc]
  local_pci_probe+0x42/0x80

Fixes: 6215b608a8c4 ("sfc: last resort fallback for lack of xdp tx queues")
Reported-by: Yanghang Liu <yanghliu@redhat.com>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agonet: macsec: fix double free of percpu stats
Fedor Pchelkin [Tue, 13 Jun 2023 19:22:20 +0000 (22:22 +0300)]
net: macsec: fix double free of percpu stats

Inside macsec_add_dev() we free percpu macsec->secy.tx_sc.stats and
macsec->stats on some of the memory allocation failure paths. However, the
net_device is already registered to that moment: in macsec_newlink(), just
before calling macsec_add_dev(). This means that during unregister process
its priv_destructor - macsec_free_netdev() - will be called and will free
the stats again.

Remove freeing percpu stats inside macsec_add_dev() because
macsec_free_netdev() will correctly free the already allocated ones. The
pointers to unallocated stats stay NULL, and free_percpu() treats that
correctly.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 0a28bfd4971f ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support")
Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agonet: lapbether: only support ethernet devices
Eric Dumazet [Wed, 14 Jun 2023 16:18:02 +0000 (16:18 +0000)]
net: lapbether: only support ethernet devices

It probbaly makes no sense to support arbitrary network devices
for lapbether.

syzbot reported:

skbuff: skb_under_panic: text:ffff80008934c100 len:44 put:40 head:ffff0000d18dd200 data:ffff0000d18dd1ea tail:0x16 end:0x140 dev:bond1
kernel BUG at net/core/skbuff.c:200 !
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 5643 Comm: dhcpcd Not tainted 6.4.0-rc5-syzkaller-g4641cff8e810 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : skb_panic net/core/skbuff.c:196 [inline]
pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:210
lr : skb_panic net/core/skbuff.c:196 [inline]
lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:210
sp : ffff8000973b7260
x29: ffff8000973b7270 x28: ffff8000973b7360 x27: dfff800000000000
x26: ffff0000d85d8150 x25: 0000000000000016 x24: ffff0000d18dd1ea
x23: ffff0000d18dd200 x22: 000000000000002c x21: 0000000000000140
x20: 0000000000000028 x19: ffff80008934c100 x18: ffff8000973b68a0
x17: 0000000000000000 x16: ffff80008a43bfbc x15: 0000000000000202
x14: 0000000000000000 x13: 0000000000000001 x12: 0000000000000001
x11: 0000000000000201 x10: 0000000000000000 x9 : f22f7eb937cced00
x8 : f22f7eb937cced00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff8000973b6b78 x4 : ffff80008df9ee80 x3 : ffff8000805974f4
x2 : 0000000000000001 x1 : 0000000100000201 x0 : 0000000000000086
Call trace:
skb_panic net/core/skbuff.c:196 [inline]
skb_under_panic+0x13c/0x140 net/core/skbuff.c:210
skb_push+0xf0/0x108 net/core/skbuff.c:2409
ip6gre_header+0xbc/0x738 net/ipv6/ip6_gre.c:1383
dev_hard_header include/linux/netdevice.h:3137 [inline]
lapbeth_data_transmit+0x1c4/0x298 drivers/net/wan/lapbether.c:257
lapb_data_transmit+0x8c/0xb0 net/lapb/lapb_iface.c:447
lapb_transmit_buffer+0x178/0x204 net/lapb/lapb_out.c:149
lapb_send_control+0x220/0x320 net/lapb/lapb_subr.c:251
lapb_establish_data_link+0x94/0xec
lapb_device_event+0x348/0x4e0
notifier_call_chain+0x1a4/0x510 kernel/notifier.c:93
raw_notifier_call_chain+0x3c/0x50 kernel/notifier.c:461
__dev_notify_flags+0x2bc/0x544
dev_change_flags+0xd0/0x15c net/core/dev.c:8643
devinet_ioctl+0x858/0x17e4 net/ipv4/devinet.c:1150
inet_ioctl+0x2ac/0x4d8 net/ipv4/af_inet.c:979
sock_do_ioctl+0x134/0x2dc net/socket.c:1201
sock_ioctl+0x4ec/0x858 net/socket.c:1318
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:856
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52
el0_svc_common+0x138/0x244 arch/arm64/kernel/syscall.c:142
do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:191
el0_svc+0x4c/0x160 arch/arm64/kernel/entry-common.c:647
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
Code: aa1803e6 aa1903e7 a90023f5 947730f5 (d4210000)

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agoMAINTAINERS: add reviewers for SMC Sockets
Jan Karcher [Wed, 14 Jun 2023 06:54:56 +0000 (08:54 +0200)]
MAINTAINERS: add reviewers for SMC Sockets

adding three people from Alibaba as reviewers for SMC.
They are currently working on improving SMC on other architectures than
s390 and help with reviewing patches on top.

Thank you D. Wythe, Tony Lu and Wen Gu for your contributions and
collaboration and welcome on board as reviewers!

Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Acked-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agos390/ism: Fix trying to free already-freed IRQ by repeated ism_dev_exit()
Julian Ruess [Tue, 13 Jun 2023 12:25:37 +0000 (14:25 +0200)]
s390/ism: Fix trying to free already-freed IRQ by repeated ism_dev_exit()

This patch prevents the system from crashing when unloading the ISM module.

How to reproduce: Attach an ISM device and execute 'rmmod ism'.

Error-Log:
- Trying to free already-free IRQ 0
- WARNING: CPU: 1 PID: 966 at kernel/irq/manage.c:1890 free_irq+0x140/0x540

After calling ism_dev_exit() for each ISM device in the exit routine,
pci_unregister_driver() will execute ism_remove() for each ISM device.
Because ism_remove() also calls ism_dev_exit(),
free_irq(pci_irq_vector(pdev, 0), ism) is called twice for each ISM
device. This results in a crash with the error
'Trying to free already-free IRQ'.

In the exit routine, it is enough to call pci_unregister_driver()
because it ensures that ism_dev_exit() is called once per
ISM device.

Cc: <stable@vger.kernel.org> # 6.3+
Fixes: 89e7d2ba61b7 ("net/ism: Add new API for client registration")
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agoLoongArch: Fix debugfs_create_dir() error checking
Immad Mir [Thu, 15 Jun 2023 06:35:56 +0000 (14:35 +0800)]
LoongArch: Fix debugfs_create_dir() error checking

The debugfs_create_dir() returns ERR_PTR in case of an error and the
correct way of checking it is using the IS_ERR_OR_NULL inline function
rather than the simple null comparision. This patch fixes the issue.

Cc: stable@vger.kernel.org
Suggested-By: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Immad Mir <mirimmad17@gmail.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
16 months agoLoongArch: Avoid uninitialized alignment_mask
Qing Zhang [Thu, 15 Jun 2023 06:35:52 +0000 (14:35 +0800)]
LoongArch: Avoid uninitialized alignment_mask

The hardware monitoring points for instruction fetching and load/store
operations need to align 4 bytes and 1/2/4/8 bytes respectively.

Reported-by: Colin King <colin.i.king@gmail.com>
Signed-off-by: Qing Zhang <zhangqing@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
16 months agoLoongArch: Fix perf event id calculation
Huacai Chen [Thu, 15 Jun 2023 06:35:52 +0000 (14:35 +0800)]
LoongArch: Fix perf event id calculation

LoongArch PMCFG has 10bit event id rather than 8 bit, so fix it.

Cc: stable@vger.kernel.org
Signed-off-by: Jun Yi <yijun@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
16 months agoLoongArch: Fix the write_fcsr() macro
Qi Hu [Thu, 15 Jun 2023 06:35:52 +0000 (14:35 +0800)]
LoongArch: Fix the write_fcsr() macro

The "write_fcsr()" macro uses wrong the positions for val and dest in
asm. Fix it!

Reported-by: Miao HAO <haomiao19@mails.ucas.ac.cn>
Signed-off-by: Qi Hu <huqi@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
16 months agoLoongArch: Let pmd_present() return true when splitting pmd
Hongchen Zhang [Thu, 15 Jun 2023 06:35:52 +0000 (14:35 +0800)]
LoongArch: Let pmd_present() return true when splitting pmd

When we split a pmd into ptes, pmd_present() and pmd_trans_huge() should
return true, otherwise it would be treated as a swap pmd.

This is the same as arm64 does in commit b65399f6111b ("arm64/mm: Change
THP helpers to comply with generic MM semantics"), we also add a new bit
named _PAGE_PRESENT_INVALID for LoongArch.

Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
16 months agonet: dsa: felix: fix taprio guard band overflow at 10Mbps with jumbo frames
Vladimir Oltean [Tue, 13 Jun 2023 17:09:07 +0000 (20:09 +0300)]
net: dsa: felix: fix taprio guard band overflow at 10Mbps with jumbo frames

The DEV_MAC_MAXLEN_CFG register contains a 16-bit value - up to 65535.
Plus 2 * VLAN_HLEN (4), that is up to 65543.

The picos_per_byte variable is the largest when "speed" is lowest -
SPEED_10 = 10. In that case it is (1000000L * 8) / 10 = 800000.

Their product - 52434400000 - exceeds 32 bits, which is a problem,
because apparently, a multiplication between two 32-bit factors is
evaluated as 32-bit before being assigned to a 64-bit variable.
In fact it's a problem for any MTU value larger than 5368.

Cast one of the factors of the multiplication to u64 to force the
multiplication to take place on 64 bits.

Issue found by Coverity.

Fixes: 55a515b1f5a9 ("net: dsa: felix: drop oversized frames with tc-taprio instead of hanging the port")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230613170907.2413559-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/sched: cls_api: Fix lockup on flushing explicitly created chain
Vlad Buslov [Mon, 12 Jun 2023 09:34:26 +0000 (11:34 +0200)]
net/sched: cls_api: Fix lockup on flushing explicitly created chain

Mingshuai Ren reports:

When a new chain is added by using tc, one soft lockup alarm will be
 generated after delete the prio 0 filter of the chain. To reproduce
 the problem, perform the following steps:
(1) tc qdisc add dev eth0 root handle 1: htb default 1
(2) tc chain add dev eth0
(3) tc filter del dev eth0 chain 0 parent 1: prio 0
(4) tc filter add dev eth0 chain 0 parent 1:

Fix the issue by accounting for additional reference to chains that are
explicitly created by RTM_NEWCHAIN message as opposed to implicitly by
RTM_NEWTFILTER message.

Fixes: 726d061286ce ("net: sched: prevent insertion of new classifiers during chain flush")
Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/lkml/87legswvi3.fsf@nvidia.com/T/
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20230612093426.2867183-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoice: Fix ice module unload
Jakub Buchocki [Mon, 12 Jun 2023 17:14:21 +0000 (10:14 -0700)]
ice: Fix ice module unload

Clearing the interrupt scheme before PFR reset,
during the removal routine, could cause the hardware
errors and possibly lead to system reboot, as the PF
reset can cause the interrupt to be generated.

Place the call for PFR reset inside ice_deinit_dev(),
wait until reset and all pending transactions are done,
then call ice_clear_interrupt_scheme().

This introduces a PFR reset to multiple error paths.

Additionally, remove the call for the reset from
ice_load() - it will be a part of ice_unload() now.

Error example:
[   75.229328] ice 0000:ca:00.1: Failed to read Tx Scheduler Tree - User Selection data from flash
[   77.571315] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   77.571418] {1}[Hardware Error]: event severity: recoverable
[   77.571459] {1}[Hardware Error]:  Error 0, type: recoverable
[   77.571500] {1}[Hardware Error]:   section_type: PCIe error
[   77.571540] {1}[Hardware Error]:   port_type: 4, root port
[   77.571580] {1}[Hardware Error]:   version: 3.0
[   77.571615] {1}[Hardware Error]:   command: 0x0547, status: 0x4010
[   77.571661] {1}[Hardware Error]:   device_id: 0000:c9:02.0
[   77.571703] {1}[Hardware Error]:   slot: 25
[   77.571736] {1}[Hardware Error]:   secondary_bus: 0xca
[   77.571773] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x347a
[   77.571821] {1}[Hardware Error]:   class_code: 060400
[   77.571858] {1}[Hardware Error]:   bridge: secondary_status: 0x2800, control: 0x0013
[   77.572490] pcieport 0000:c9:02.0: AER: aer_status: 0x00200000, aer_mask: 0x00100020
[   77.572870] pcieport 0000:c9:02.0:    [21] ACSViol                (First)
[   77.573222] pcieport 0000:c9:02.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[   77.573554] pcieport 0000:c9:02.0: AER: aer_uncor_severity: 0x00463010
[   77.691273] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   77.691738] {2}[Hardware Error]: event severity: recoverable
[   77.691971] {2}[Hardware Error]:  Error 0, type: recoverable
[   77.692192] {2}[Hardware Error]:   section_type: PCIe error
[   77.692403] {2}[Hardware Error]:   port_type: 4, root port
[   77.692616] {2}[Hardware Error]:   version: 3.0
[   77.692825] {2}[Hardware Error]:   command: 0x0547, status: 0x4010
[   77.693032] {2}[Hardware Error]:   device_id: 0000:c9:02.0
[   77.693238] {2}[Hardware Error]:   slot: 25
[   77.693440] {2}[Hardware Error]:   secondary_bus: 0xca
[   77.693641] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x347a
[   77.693853] {2}[Hardware Error]:   class_code: 060400
[   77.694054] {2}[Hardware Error]:   bridge: secondary_status: 0x0800, control: 0x0013
[   77.719115] pci 0000:ca:00.1: AER: can't recover (no error_detected callback)
[   77.719140] pcieport 0000:c9:02.0: AER: device recovery failed
[   77.719216] pcieport 0000:c9:02.0: AER: aer_status: 0x00200000, aer_mask: 0x00100020
[   77.719390] pcieport 0000:c9:02.0:    [21] ACSViol                (First)
[   77.719557] pcieport 0000:c9:02.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[   77.719723] pcieport 0000:c9:02.0: AER: aer_uncor_severity: 0x00463010

Fixes: 5b246e533d01 ("ice: split probe into smaller functions")
Signed-off-by: Jakub Buchocki <jakubx.buchocki@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230612171421.21570-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoMerge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Jakub Kicinski [Thu, 15 Jun 2023 05:36:53 +0000 (22:36 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2023-06-12 (igc, igb)

This series contains updates to igc and igb drivers.

Husaini clears Tx rings when interface is brought down for igc.

Vinicius disables PTM and PCI busmaster when removing igc driver.

Alex adds error check and path for NVM read error on igb.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igb: fix nvm.ops.read() error handling
  igc: Fix possible system crash when loading module
  igc: Clean the TX buffer and TX descriptor ring
====================

Link: https://lore.kernel.org/r/20230612205208.115292-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/handshake: remove fput() that causes use-after-free
Lin Ma [Wed, 14 Jun 2023 01:52:49 +0000 (09:52 +0800)]
net/handshake: remove fput() that causes use-after-free

A reference underflow is found in TLS handshake subsystem that causes a
direct use-after-free. Part of the crash log is like below:

[    2.022114] ------------[ cut here ]------------
[    2.022193] refcount_t: underflow; use-after-free.
[    2.022288] WARNING: CPU: 0 PID: 60 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
[    2.022432] Modules linked in:
[    2.022848] RIP: 0010:refcount_warn_saturate+0xbe/0x110
[    2.023231] RSP: 0018:ffffc900001bfe18 EFLAGS: 00000286
[    2.023325] RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00000000ffffdfff
[    2.023438] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
[    2.023555] RBP: ffff888004c20098 R08: ffffffff82b392c8 R09: 00000000ffffdfff
[    2.023693] R10: ffffffff82a592e0 R11: ffffffff82b092e0 R12: ffff888004c200d8
[    2.023813] R13: 0000000000000000 R14: ffff888004c20000 R15: ffffc90000013ca8
[    2.023930] FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[    2.024062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.024161] CR2: ffff888003601000 CR3: 0000000002a2e000 CR4: 00000000000006f0
[    2.024275] Call Trace:
[    2.024322]  <TASK>
[    2.024367]  ? __warn+0x7f/0x130
[    2.024430]  ? refcount_warn_saturate+0xbe/0x110
[    2.024513]  ? report_bug+0x199/0x1b0
[    2.024585]  ? handle_bug+0x3c/0x70
[    2.024676]  ? exc_invalid_op+0x18/0x70
[    2.024750]  ? asm_exc_invalid_op+0x1a/0x20
[    2.024830]  ? refcount_warn_saturate+0xbe/0x110
[    2.024916]  ? refcount_warn_saturate+0xbe/0x110
[    2.024998]  __tcp_close+0x2f4/0x3d0
[    2.025065]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[    2.025168]  tcp_close+0x1f/0x70
[    2.025231]  inet_release+0x33/0x60
[    2.025297]  sock_release+0x1f/0x80
[    2.025361]  handshake_req_cancel_test2+0x100/0x2d0
[    2.025457]  kunit_try_run_case+0x4c/0xa0
[    2.025532]  kunit_generic_run_threadfn_adapter+0x15/0x20
[    2.025644]  kthread+0xe1/0x110
[    2.025708]  ? __pfx_kthread+0x10/0x10
[    2.025780]  ret_from_fork+0x2c/0x50

One can enable CONFIG_NET_HANDSHAKE_KUNIT_TEST config to reproduce above
crash.

The root cause of this bug is that the commit 1ce77c998f04
("net/handshake: Unpin sock->file if a handshake is cancelled") adds one
additional fput() function. That patch claims that the fput() is used to
enable sock->file to be freed even when user space never calls DONE.

However, it seems that the intended DONE routine will never give an
additional fput() of ths sock->file. The existing two of them are just
used to balance the reference added in sockfd_lookup().

This patch revert the mentioned commit to avoid the use-after-free. The
patched kernel could successfully pass the KUNIT test and boot to shell.

[    0.733613]     # Subtest: Handshake API tests
[    0.734029]     1..11
[    0.734255]         KTAP version 1
[    0.734542]         # Subtest: req_alloc API fuzzing
[    0.736104]         ok 1 handshake_req_alloc NULL proto
[    0.736114]         ok 2 handshake_req_alloc CLASS_NONE
[    0.736559]         ok 3 handshake_req_alloc CLASS_MAX
[    0.737020]         ok 4 handshake_req_alloc no callbacks
[    0.737488]         ok 5 handshake_req_alloc no done callback
[    0.737988]         ok 6 handshake_req_alloc excessive privsize
[    0.738529]         ok 7 handshake_req_alloc all good
[    0.739036]     # req_alloc API fuzzing: pass:7 fail:0 skip:0 total:7
[    0.739444]     ok 1 req_alloc API fuzzing
[    0.740065]     ok 2 req_submit NULL req arg
[    0.740436]     ok 3 req_submit NULL sock arg
[    0.740834]     ok 4 req_submit NULL sock->file
[    0.741236]     ok 5 req_lookup works
[    0.741621]     ok 6 req_submit max pending
[    0.741974]     ok 7 req_submit multiple
[    0.742382]     ok 8 req_cancel before accept
[    0.742764]     ok 9 req_cancel after accept
[    0.743151]     ok 10 req_cancel after done
[    0.743510]     ok 11 req_destroy works
[    0.743882] # Handshake API tests: pass:11 fail:0 skip:0 total:11
[    0.744205] # Totals: pass:17 fail:0 skip:0 total:17

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 1ce77c998f04 ("net/handshake: Unpin sock->file if a handshake is cancelled")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20230613083204.633896-1-linma@zju.edu.cn
Link: https://lore.kernel.org/r/20230614015249.987448-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoMerge tag 'wireless-2023-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Thu, 15 Jun 2023 04:28:59 +0000 (21:28 -0700)]
Merge tag 'wireless-2023-06-14' of git://git./linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
A couple of straggler fixes, mostly in the stack:
 - fix fragmentation for multi-link related elements
 - fix callback copy/paste error
 - fix multi-link locking
 - remove double-locking of wiphy mutex
 - transmit only on active links, not all
 - activate links in the correct order
 - don't remove links that weren't added
 - disable soft-IRQs for LQ lock in iwlwifi

* tag 'wireless-2023-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: mvm: spin_lock_bh() to fix lockdep regression
  wifi: mac80211: fragment per STA profile correctly
  wifi: mac80211: Use active_links instead of valid_links in Tx
  wifi: cfg80211: remove links only on AP
  wifi: mac80211: take lock before setting vif links
  wifi: cfg80211: fix link del callback to call correct handler
  wifi: mac80211: fix link activation settings order
  wifi: cfg80211: fix double lock bug in reg_wdev_chan_valid()
====================

Link: https://lore.kernel.org/r/20230614075502.11765-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoext4: drop the call to ext4_error() from ext4_get_group_info()
Fabio M. De Francesco [Wed, 14 Jun 2023 10:02:55 +0000 (12:02 +0200)]
ext4: drop the call to ext4_error() from ext4_get_group_info()

A recent patch added a call to ext4_error() which is problematic since
some callers of the ext4_get_group_info() function may be holding a
spinlock, whereas ext4_error() must never be called in atomic context.

This triggered a report from Syzbot: "BUG: sleeping function called from
invalid context in ext4_update_super" (see the link below).

Therefore, drop the call to ext4_error() from ext4_get_group_info(). In
the meantime use eight characters tabs instead of nine characters ones.

Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/
Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail")
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com
16 months agoRevert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
Kemeng Shi [Tue, 13 Jun 2023 22:50:25 +0000 (06:50 +0800)]
Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"

This reverts commit ad3f09be6cfe332be8ff46c78e6ec0f8839107aa.

The reverted commit was intended to simpfy the code to get group
descriptor block number in non-meta block group by assuming
s_gdb_count is block number used for all non-meta block group descriptors.
However s_gdb_count is block number used for all meta *and* non-meta
group descriptors. So s_gdb_group will be > actual group descriptor block
number used for all non-meta block group which should be "total non-meta
block group" / "group descriptors per block", e.g. s_first_meta_bg.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230613225025.3859522-1-shikemeng@huaweicloud.com
Fixes: ad3f09be6cfe ("ext4: remove unnecessary check in ext4_bg_num_gdb_nometa")
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agoRevert "media: dvb-core: Fix use-after-free on race condition at dvb_frontend"
Mauro Carvalho Chehab [Fri, 9 Jun 2023 08:16:21 +0000 (09:16 +0100)]
Revert "media: dvb-core: Fix use-after-free on race condition at dvb_frontend"

As reported by Thomas Voegtle <tv@lio96.de>, sometimes a DVB card does
not initialize properly booting Linux 6.4-rc4. This is not always, maybe
in 3 out of 4 attempts.

After double-checking, the root cause seems to be related to the
UAF fix, which is causing a race issue:

[   26.332149] tda10071 7-0005: found a 'NXP TDA10071' in cold state, will try to load a firmware
[   26.340779] tda10071 7-0005: downloading firmware from file 'dvb-fe-tda10071.fw'
[  989.277402] INFO: task vdr:743 blocked for more than 491 seconds.
[  989.283504]       Not tainted 6.4.0-rc5-i5 #249
[  989.288036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  989.295860] task:vdr             state:D stack:0     pid:743   ppid:711    flags:0x00004002
[  989.295865] Call Trace:
[  989.295867]  <TASK>
[  989.295869]  __schedule+0x2ea/0x12d0
[  989.295877]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  989.295881]  schedule+0x57/0xc0
[  989.295884]  schedule_preempt_disabled+0xc/0x20
[  989.295887]  __mutex_lock.isra.16+0x237/0x480
[  989.295891]  ? dvb_get_property.isra.10+0x1bc/0xa50
[  989.295898]  ? dvb_frontend_stop+0x36/0x180
[  989.338777]  dvb_frontend_stop+0x36/0x180
[  989.338781]  dvb_frontend_open+0x2f1/0x470
[  989.338784]  dvb_device_open+0x81/0xf0
[  989.338804]  ? exact_lock+0x20/0x20
[  989.338808]  chrdev_open+0x7f/0x1c0
[  989.338811]  ? generic_permission+0x1a2/0x230
[  989.338813]  ? link_path_walk.part.63+0x340/0x380
[  989.338815]  ? exact_lock+0x20/0x20
[  989.338817]  do_dentry_open+0x18e/0x450
[  989.374030]  path_openat+0xca5/0xe00
[  989.374031]  ? terminate_walk+0xec/0x100
[  989.374034]  ? path_lookupat+0x93/0x140
[  989.374036]  do_filp_open+0xc0/0x140
[  989.374038]  ? __call_rcu_common.constprop.91+0x92/0x240
[  989.374041]  ? __check_object_size+0x147/0x260
[  989.374043]  ? __check_object_size+0x147/0x260
[  989.374045]  ? alloc_fd+0xbb/0x180
[  989.374048]  ? do_sys_openat2+0x243/0x310
[  989.374050]  do_sys_openat2+0x243/0x310
[  989.374052]  do_sys_open+0x52/0x80
[  989.374055]  do_syscall_64+0x5b/0x80
[  989.421335]  ? __task_pid_nr_ns+0x92/0xa0
[  989.421337]  ? syscall_exit_to_user_mode+0x20/0x40
[  989.421339]  ? do_syscall_64+0x67/0x80
[  989.421341]  ? syscall_exit_to_user_mode+0x20/0x40
[  989.421343]  ? do_syscall_64+0x67/0x80
[  989.421345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  989.421348] RIP: 0033:0x7fe895d067e3
[  989.421349] RSP: 002b:00007fff933c2ba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[  989.421351] RAX: ffffffffffffffda RBX: 00007fff933c2c10 RCX: 00007fe895d067e3
[  989.421352] RDX: 0000000000000802 RSI: 00005594acdce160 RDI: 00000000ffffff9c
[  989.421353] RBP: 0000000000000802 R08: 0000000000000000 R09: 0000000000000000
[  989.421353] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
[  989.421354] R13: 00007fff933c2ca0 R14: 00000000ffffffff R15: 00007fff933c2c90
[  989.421355]  </TASK>

This reverts commit 6769a0b7ee0c3b31e1b22c3fadff2bfb642de23f.

Fixes: 6769a0b7ee0c ("media: dvb-core: Fix use-after-free on race condition at dvb_frontend")
Link: https://lore.kernel.org/all/da5382ad-09d6-20ac-0d53-611594b30861@lio96.de/
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
16 months agoRDMA/rxe: Fix rxe_cq_post
Bob Pearson [Mon, 12 Jun 2023 15:50:33 +0000 (10:50 -0500)]
RDMA/rxe: Fix rxe_cq_post

A recent patch replaced a tasklet execution of cq->comp_handler by a
direct call. While this made sense it let changes to cq->notify state be
unprotected and assumed that the cq completion machinery and the ulp done
callbacks were reentrant. The result is that in some cases completion
events can be lost. This patch moves the cq->comp_handler call inside of
the spinlock in rxe_cq_post which solves both issues. This is compatible
with the matching code in the request notify verb.

Fixes: 78b26a335310 ("RDMA/rxe: Remove tasklet call from rxe_cq.c")
Link: https://lore.kernel.org/r/20230612155032.17036-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
16 months agocifs: add a warning when the in-flight count goes negative
Shyam Prasad N [Fri, 9 Jun 2023 17:46:56 +0000 (17:46 +0000)]
cifs: add a warning when the in-flight count goes negative

We've seen the in-flight count go into negative with some
internal stress testing in Microsoft.

Adding a WARN when this happens, in hope of understanding
why this happens when it happens.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
16 months agocifs: fix lease break oops in xfstest generic/098
Steve French [Sun, 11 Jun 2023 16:23:32 +0000 (11:23 -0500)]
cifs: fix lease break oops in xfstest generic/098

umount can race with lease break so need to check if
tcon->ses->server is still valid to send the lease
break response.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Fixes: 59a556aebc43 ("SMB3: drop reference to cfile before sending oplock break")
Signed-off-by: Steve French <stfrench@microsoft.com>
16 months agoselftests: forwarding: hw_stats_l3: Set addrgenmode in a separate step
Danielle Ratson [Mon, 12 Jun 2023 14:34:58 +0000 (16:34 +0200)]
selftests: forwarding: hw_stats_l3: Set addrgenmode in a separate step

Setting the IPv6 address generation mode of a net device during its
creation never worked, but after commit b0ad3c179059 ("rtnetlink: call
validate_linkmsg in rtnl_create_link") it explicitly fails [1]. The
failure is caused by the fact that validate_linkmsg() is called before
the net device is registered, when it still does not have an 'inet6_dev'.

Likewise, raising the net device before setting the address generation
mode is meaningless, because by the time the mode is set, the address
has already been generated.

Therefore, fix the test to first create the net device, then set its
IPv6 address generation mode and finally bring it up.

[1]
 # ip link add name mydev addrgenmode eui64 type dummy
 RTNETLINK answers: Address family not supported by protocol

Fixes: ba95e7930957 ("selftests: forwarding: hw_stats_l3: Add a new test")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/f3b05d85b2bc0c3d6168fe8f7207c6c8365703db.1686580046.git.petrm@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoMerge branch 'net-sched-fix-race-conditions-in-mini_qdisc_pair_swap'
Paolo Abeni [Wed, 14 Jun 2023 08:31:42 +0000 (10:31 +0200)]
Merge branch 'net-sched-fix-race-conditions-in-mini_qdisc_pair_swap'

Peilin Ye says:

====================
net/sched: Fix race conditions in mini_qdisc_pair_swap()

These 2 patches fix race conditions for ingress and clsact Qdiscs as
reported [1] by syzbot, split out from another [2] series (last 2 patches
of it).  Per-patch changelog omitted.

Patch 1 hasn't been touched since last version; I just included
everybody's tag.

Patch 2 bases on patch 6 v1 of [2], with comments and commit log slightly
changed.  We also need rtnl_dereference() to load ->qdisc_sleeping since
commit d636fc5dd692 ("net: sched: add rcu annotations around
qdisc->qdisc_sleeping"), so I changed that; please take yet another look,
thanks!

Patch 2 has been tested with the new reproducer Pedro posted [3].

[1] https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
[2] https://lore.kernel.org/r/cover.1684887977.git.peilin.ye@bytedance.com/
[3] https://lore.kernel.org/r/7879f218-c712-e9cc-57ba-665990f5f4c9@mojatatu.com/
====================

Link: https://lore.kernel.org/r/cover.1686355297.git.peilin.ye@bytedance.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agonet/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
Peilin Ye [Sun, 11 Jun 2023 03:30:25 +0000 (20:30 -0700)]
net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized
in ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar
for clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.

Fixing without changing sch_api.c turned out to be difficult (please
refer to Closes: for discussions).  Instead, make sure @new's first call
always happen after @old's last call (in {ingress,clsact}_destroy()) has
finished:

In qdisc_graft(), return -EBUSY if @old has any ongoing filter requests,
and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for filter requests.  Introduce a
non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and
clsact Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after commit c7cfbd115001 ("net/sched:
sch_ingress: Only create under TC_H_INGRESS")) on eth0 that has 8
transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2),
  then adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to
its mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress
packets on eth0 will not find filter Y in sch_handle_ingress().

This is just one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agonet/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
Peilin Ye [Sun, 11 Jun 2023 03:30:15 +0000 (20:30 -0700)]
net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs

Grafting ingress and clsact Qdiscs does not need a for-loop in
qdisc_graft().  Refactor it.  No functional changes intended.

Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agonet/sched: act_ct: Fix promotion of offloaded unreplied tuple
Paul Blakey [Fri, 9 Jun 2023 12:22:59 +0000 (15:22 +0300)]
net/sched: act_ct: Fix promotion of offloaded unreplied tuple

Currently UNREPLIED and UNASSURED connections are added to the nf flow
table. This causes the following connection packets to be processed
by the flow table which then skips conntrack_in(), and thus such the
connections will remain UNREPLIED and UNASSURED even if reply traffic
is then seen. Even still, the unoffloaded reply packets are the ones
triggering hardware update from new to established state, and if
there aren't any to triger an update and/or previous update was
missed, hardware can get out of sync with sw and still mark
packets as new.

Fix the above by:
1) Not skipping conntrack_in() for UNASSURED packets, but still
   refresh for hardware, as before the cited patch.
2) Try and force a refresh by reply-direction packets that update
   the hardware rules from new to established state.
3) Remove any bidirectional flows that didn't failed to update in
   hardware for re-insertion as bidrectional once any new packet
   arrives.

Fixes: 6a9bad0069cf ("net/sched: act_ct: offload UDP NEW connections")
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/1686313379-117663-1-git-send-email-paulb@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agowifi: iwlwifi: mvm: spin_lock_bh() to fix lockdep regression
Hugh Dickins [Fri, 9 Jun 2023 21:29:39 +0000 (14:29 -0700)]
wifi: iwlwifi: mvm: spin_lock_bh() to fix lockdep regression

Lockdep on 6.4-rc on ThinkPad X1 Carbon 5th says
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
6.4.0-rc5 #1 Not tainted
-----------------------------------------------------
kworker/3:1/49 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
ffff8881066fa368 (&mvm_sta->deflink.lq_sta.rs_drv.pers.lock){+.+.}-{2:2}, at: rs_drv_get_rate+0x46/0xe7

and this task is already holding:
ffff8881066f80a8 (&sta->rate_ctrl_lock){+.-.}-{2:2}, at: rate_control_get_rate+0xbd/0x126
which would create a new lock dependency:
 (&sta->rate_ctrl_lock){+.-.}-{2:2} -> (&mvm_sta->deflink.lq_sta.rs_drv.pers.lock){+.+.}-{2:2}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&sta->rate_ctrl_lock){+.-.}-{2:2}
etc. etc. etc.

Changing the spin_lock() in rs_drv_get_rate() to spin_lock_bh() was not
enough to pacify lockdep, but changing them all on pers.lock has worked.

Fixes: a8938bc881d2 ("wifi: iwlwifi: mvm: Add locking to the rate read flow")
Signed-off-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/r/79ffcc22-9775-cb6d-3ffd-1a517c40beef@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
16 months agoMerge branch 'fix-small-bugs-and-annoyances-in-tc-testing'
Jakub Kicinski [Wed, 14 Jun 2023 03:49:16 +0000 (20:49 -0700)]
Merge branch 'fix-small-bugs-and-annoyances-in-tc-testing'

Vlad Buslov says:

====================
Fix small bugs and annoyances in tc-testing
====================

Link: https://lore.kernel.org/r/20230612075712.2861848-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests/tc-testing: Remove configs that no longer exist
Vlad Buslov [Mon, 12 Jun 2023 07:57:12 +0000 (09:57 +0200)]
selftests/tc-testing: Remove configs that no longer exist

Some qdiscs and classifiers have recently been retired from kernel.
However, tc-testing config is still cluttered with them which causes noise
when using merge_config.sh script to update existing config for tc-testing
compatibility. Remove the config settings for affected qdiscs and
classifiers.

Fixes: fb38306ceb9e ("net/sched: Retire ATM qdisc")
Fixes: 051d44209842 ("net/sched: Retire CBQ qdisc")
Fixes: bbe77c14ee61 ("net/sched: Retire dsmark qdisc")
Fixes: 265b4da82dbf ("net/sched: Retire rsvp classifier")
Fixes: 8c710f75256b ("net/sched: Retire tcindex classifier")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests/tc-testing: Fix SFB db test
Vlad Buslov [Mon, 12 Jun 2023 07:57:11 +0000 (09:57 +0200)]
selftests/tc-testing: Fix SFB db test

Setting very small value of db like 10ms introduces rounding errors when
converting to/from jiffies on some kernel configs. For example, on 250hz
the actual value will be set to 12ms which causes the test to fail:

 # $ sudo ./tdc.py  -d eth2 -e 3410
 #  -- ns/SubPlugin.__init__
 # Test 3410: Create SFB with db setting
 #
 # All test results:
 #
 # 1..1
 # not ok 1 3410 - Create SFB with db setting
 #         Could not match regex pattern. Verify command output:
 # qdisc sfb 1: root refcnt 2 rehash 600s db 12ms limit 1000p max 25p target 20p increment 0.000503548 decrement 4.57771e-05 penalty_rate 10pps penalty_burst 20p

Set the value to 100ms instead which currently seem to work on 100hz,
250hz, 300hz and 1000hz kernel configs.

Fixes: 6ad92dc56fca ("selftests/tc-testing: add selftests for sfb qdisc")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests/tc-testing: Fix Error: failed to find target LOG
Vlad Buslov [Mon, 12 Jun 2023 07:57:10 +0000 (09:57 +0200)]
selftests/tc-testing: Fix Error: failed to find target LOG

Add missing netfilter config dependency.

Fixes following example error when running tests via tdc.sh for all XT
tests:

 # $ sudo ./tdc.py -d eth2 -e 2029
 # Test 2029: Add xt action with log-prefix
 # exit: 255
 # exit: 0
 #  failed to find target LOG
 #
 # bad action parsing
 # parse_action: bad value (7:xt)!
 # Illegal "action"
 #
 # -----> teardown stage *** Could not execute: "$TC actions flush action xt"
 #
 # -----> teardown stage *** Error message: "Error: Cannot flush unknown TC action.
 # We have an error flushing
 # "
 # returncode 1; expected [0]
 #
 # -----> teardown stage *** Aborting test run.
 #
 # <_io.BufferedReader name=3> *** stdout ***
 #
 # <_io.BufferedReader name=5> *** stderr ***
 # "-----> teardown stage" did not complete successfully
 # Exception <class '__main__.PluginMgrTestFail'> ('teardown', ' failed to find target LOG\n\nbad action parsing\nparse_action: bad value (7:xt)!\nIllegal "action"\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 2 2029 Add xt action with log-prefix stage teardown)
 # ---------------
 # traceback
 #   File "/images/src/linux/tools/testing/selftests/tc-testing/./tdc.py", line 495, in test_runner
 #     res = run_one_test(pm, args, index, tidx)
 #   File "/images/src/linux/tools/testing/selftests/tc-testing/./tdc.py", line 434, in run_one_test
 #     prepare_env(args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
 #   File "/images/src/linux/tools/testing/selftests/tc-testing/./tdc.py", line 245, in prepare_env
 #     raise PluginMgrTestFail(
 # ---------------
 # accumulated output for this test:
 #  failed to find target LOG
 #
 # bad action parsing
 # parse_action: bad value (7:xt)!
 # Illegal "action"
 #
 # ---------------
 #
 # All test results:
 #
 # 1..1
 # ok 1 2029 - Add xt action with log-prefix # skipped - "-----> teardown stage" did not complete successfully

Fixes: 910d504bc187 ("selftests/tc-testings: add selftests for xt action")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests/tc-testing: Fix Error: Specified qdisc kind is unknown.
Vlad Buslov [Mon, 12 Jun 2023 07:57:09 +0000 (09:57 +0200)]
selftests/tc-testing: Fix Error: Specified qdisc kind is unknown.

All TEQL tests assume that sch_teql module is loaded. Load module in tdc.sh
before running qdisc tests.

Fixes following example error when running tests via tdc.sh for all TEQL
tests:

 # $ sudo ./tdc.py -d eth2 -e 84a0
 #  -- ns/SubPlugin.__init__
 # Test 84a0: Create TEQL with default setting
 # exit: 2
 # exit: 0
 # Error: Specified qdisc kind is unknown.
 #
 # -----> teardown stage *** Could not execute: "$TC qdisc del dev $DUMMY handle 1: root"
 #
 # -----> teardown stage *** Error message: "Error: Invalid handle.
 # "
 # returncode 2; expected [0]
 #
 # -----> teardown stage *** Aborting test run.
 #
 # <_io.BufferedReader name=3> *** stdout ***
 #
 # <_io.BufferedReader name=5> *** stderr ***
 # "-----> teardown stage" did not complete successfully
 # Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Error: Specified qdisc kind is unknown.\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 2 84a0 Create TEQL with default setting stage teardown)
 # ---------------
 # traceback
 #   File "/images/src/linux/tools/testing/selftests/tc-testing/./tdc.py", line 495, in test_runner
 #     res = run_one_test(pm, args, index, tidx)
 #   File "/images/src/linux/tools/testing/selftests/tc-testing/./tdc.py", line 434, in run_one_test
 #     prepare_env(args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
 #   File "/images/src/linux/tools/testing/selftests/tc-testing/./tdc.py", line 245, in prepare_env
 #     raise PluginMgrTestFail(
 # ---------------
 # accumulated output for this test:
 # Error: Specified qdisc kind is unknown.
 #
 # ---------------
 #
 # All test results:
 #
 # 1..1
 # ok 1 84a0 - Create TEQL with default setting # skipped - "-----> teardown stage" did not complete successfully

Fixes: cc62fbe114c9 ("selftests/tc-testing: add selftests for teql qdisc")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet: ethernet: ti: am65-cpsw: Call of_node_put() on error path
Dan Carpenter [Mon, 12 Jun 2023 07:18:50 +0000 (10:18 +0300)]
net: ethernet: ti: am65-cpsw: Call of_node_put() on error path

This code returns directly but it should instead call of_node_put()
to drop some reference counts.

Fixes: dab2b265dd23 ("net: ethernet: ti: am65-cpsw: Add support for SERDES configuration")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Link: https://lore.kernel.org/r/e3012f0c-1621-40e6-bf7d-03c276f6e07f@kili.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoMerge tag 'nios2_fix_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen...
Linus Torvalds [Wed, 14 Jun 2023 00:00:33 +0000 (17:00 -0700)]
Merge tag 'nios2_fix_v6.4' of git://git./linux/kernel/git/dinguyen/linux

Pull NIOS2 dts fix from Dinh Nguyen:

 - Fix tse_mac "max-frame-size" property

* tag 'nios2_fix_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
  nios2: dts: Fix tse_mac "max-frame-size" property

16 months agonios2: dts: Fix tse_mac "max-frame-size" property
Janne Grunau [Sun, 12 Feb 2023 12:16:32 +0000 (13:16 +0100)]
nios2: dts: Fix tse_mac "max-frame-size" property

The given value of 1518 seems to refer to the layer 2 ethernet frame
size without 802.1Q tag. Actual use of the "max-frame-size" including in
the consumer of the "altr,tse-1.0" compatible is the MTU.

Fixes: 95acd4c7b69c ("nios2: Device tree support")
Fixes: 61c610ec61bb ("nios2: Add Max10 device tree")
Cc: <stable@vger.kernel.org>
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
16 months agospi: dw: Replace incorrect spi_get_chipselect with set
Abe Kohandel [Tue, 13 Jun 2023 16:21:03 +0000 (09:21 -0700)]
spi: dw: Replace incorrect spi_get_chipselect with set

Commit 445164e8c136 ("spi: dw: Replace spi->chip_select references with
function calls") replaced direct access to spi.chip_select with
spi_*_chipselect calls but incorrectly replaced a set instance with a
get instance, replace the incorrect instance.

Fixes: 445164e8c136 ("spi: dw: Replace spi->chip_select references with function calls")
Signed-off-by: Abe Kohandel <abe.kohandel@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Serge Semin <fancer.lancer@gmail.com>
Link: https://lore.kernel.org/r/20230613162103.569812-1-abe.kohandel@intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
16 months agoMerge tag 'devicetree-fixes-for-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Tue, 13 Jun 2023 17:27:56 +0000 (10:27 -0700)]
Merge tag 'devicetree-fixes-for-6.4-3' of git://git./linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

 - Fix missing of_node_put() in init_overlay_changeset()

 - Fix schema for qcom,pmic-mpp "qcom,paired" property

 - Fix 'additionalProperties' in silvaco,i3c-master binding

 - usage-model.rst: Use documented "arm,primecell" compatible string

 - Update Damien Le Moal's email address

 - Fixes in Realtek Bluetooth binding

* tag 'devicetree-fixes-for-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: pinctrl: qcom,pmic-mpp: Fix schema for "qcom,paired"
  dt-bindings: i3c: silvaco,i3c-master: fix missing schema restriction
  of: overlay: Fix missing of_node_put() in error case of init_overlay_changeset()
  docs: zh_CN/devicetree: sync usage-model fix
  docs: dt: fix documented Primecell compatible string
  dt-bindings: Change Damien Le Moal's contact email
  dt-bindings: net: realtek-bluetooth: Fix double RTL8723CS in desc
  dt-bindings: net: realtek-bluetooth: Fix RTL8821CS binding

16 months agodt-bindings: pinctrl: qcom,pmic-mpp: Fix schema for "qcom,paired"
Rob Herring [Tue, 18 Apr 2023 15:06:06 +0000 (10:06 -0500)]
dt-bindings: pinctrl: qcom,pmic-mpp: Fix schema for "qcom,paired"

The "qcom,paired" schema is all wrong. First, it's a list rather than an
object(dictionary). Second, it is missing a required type. The meta-schema
normally catches this, but schemas under "$defs" was not getting checked.
A fix for that is pending.

Fixes: f9a06b810951 ("dt-bindings: pinctrl: qcom,pmic-mpp: Convert qcom pmic mpp bindings to YAML")
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230418150606.1528107-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
16 months agoregmap: regcache: Don't sync read-only registers
Takashi Iwai [Tue, 13 Jun 2023 11:22:40 +0000 (13:22 +0200)]
regmap: regcache: Don't sync read-only registers

regcache_maple_sync() tries to sync all cached values no matter
whether it's writable or not.  OTOH, regache_sync_val() does care the
wrtability and returns -EIO for a read-only register.  This results in
an error message like:
  snd_hda_codec_realtek hdaudioC0D0: Unable to sync register 0x2f0009. -5
and the sync loop is aborted incompletely.

This patch adds the writable register check to regcache_sync_val() for
addressing the bug above.

Note that, although we may add the check in the caller side
(regcache_maple_sync()), here we put in regcache_sync_val(), so that a
similar case like this can be avoided in future.

Fixes: f033c26de5a5 ("regmap: Add maple tree based register cache")
Link: https://lore.kernel.org/r/877cs7g6f1.wl-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://lore.kernel.org/r/20230613112240.3361-1-tiwai@suse.de
Signed-off-by: Mark Brown <broonie@kernel.org>
16 months agoMerge branch 'mptcp-fixes'
Jakub Kicinski [Mon, 12 Jun 2023 23:55:45 +0000 (16:55 -0700)]
Merge branch 'mptcp-fixes'

Matthieu Baerts says:

====================
selftests: mptcp: skip tests not supported by old kernels (part 3)

After a few years of increasing test coverage in the MPTCP selftests, we
realised [1] the last version of the selftests is supposed to run on old
kernels without issues.

Supporting older versions is not that easy for this MPTCP case: these
selftests are often validating the internals by checking packets that
are exchanged, when some MIB counters are incremented after some
actions, how connections are getting opened and closed in some cases,
etc. In other words, it is not limited to the socket interface between
the userspace and the kernelspace.

In addition to that, the current MPTCP selftests run a lot of different
sub-tests but the TAP13 protocol used in the selftests don't support
sub-tests: one failure in sub-tests implies that the whole selftest is
seen as failed at the end because sub-tests are not tracked. It is then
important to skip sub-tests not supported by old kernels.

To minimise the modifications and reduce the complexity to support old
versions, the idea is to look at external signs and skip the whole
selftest or just some sub-tests before starting them. This cannot be
applied in all cases.

Similar to the second part, this third one focuses on marking different
sub-tests as skipped if some MPTCP features are not supported. This
time, only in "mptcp_join.sh" selftest, the remaining one, is modified.
Several techniques are used here to achieve this task:

- Before starting some tests:

  - Check if a file (sysctl knob) is present: that's what patch 12/17 is
    doing for the userspace PM feature.

  - Check if a required kernel symbol is present in /proc/kallsyms:
    patches 9, 10, 14 and 15/17 are using this technique.

  - Check if it is possible to setup a particular network environment
    requiring Netfilter or TC: if the preparation step fail, the linked
    sub-test is marked as skipped. Patch 5/17 is doing that.

  - Check if a MIB counter is available: patches 7 and 13/17 do that.

  - Check if the kernel version is newer than a specific one: patch 1/17
    adds some helpers in mptcp_lib.sh to ease its use. That's not ideal
    and it is only used as last resort but as mentioned above, it is
    important to skip tests if they are not supported not to have the
    whole selftest always being marked as failed on old kernels. Patches
    11 and 17/17 are checking the kernel version. An alternative would
    be to ignore the results for some sub-tests but that's not ideal
    too. Note that SELFTESTS_MPTCP_LIB_NO_KVERSION_CHECK env var can be
    set to 1 not to skip these tests if the running kernel doesn't have
    a supported version.

- After having launched the tests:

  - Adapt the expectations depending on the presence of a kernel symbol
    (patch 6/17) or a kernel version (patch 8/17).

  - Check is a MIB counter is available and skip the verification if
    not. Patch 4/17 is using this technique.

Before skipping tests, SELFTESTS_MPTCP_LIB_EXPECT_ALL_FEATURES env var
value is checked: if it is set to 1, the test is marked as "failed"
instead of "skipped". MPTCP public CI expects to have all features
supported and it sets this env var to 1 to catch regressions in these
new checks.

Patch 2/17 uses 'iptables-legacy' if available because it might be
needed when using an older kernel not supporting iptables-nft.

Patch 3/17 adds some helpers used in the other patches mentioned to
easily mark sub-tests as skipped.

Patch 16/17 uniforms MPTCP Join "listener" tests: it was imported code
from userspace_pm.sh but without using the "code style" and ways of
using tools and printing messages from MPTCP Join selftest.

Link: https://lore.kernel.org/stable/CA+G9fYtDGpgT4dckXD-y-N92nqUxuvue_7AtDdBcHrbOMsDZLg@mail.gmail.com/
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
====================

Link: https://lore.kernel.org/r/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip mixed tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:52 +0000 (18:11 +0200)]
selftests: mptcp: join: skip mixed tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of a mix of subflows in v4 and v6 by the
in-kernel PM introduced by commit b9d69db87fb7 ("mptcp: let the
in-kernel PM use mixed IPv4 and IPv6 addresses").

It looks like there is no external sign we can use to predict the
expected behaviour. Instead of accepting different behaviours and thus
not really checking for the expected behaviour, we are looking here for
a specific kernel version. That's not ideal but it looks better than
removing the test because it cannot support older kernel versions.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: ad3493746ebe ("selftests: mptcp: add test-cases for mixed v4/v6 subflows")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: uniform listener tests
Matthieu Baerts [Sat, 10 Jun 2023 16:11:51 +0000 (18:11 +0200)]
selftests: mptcp: join: uniform listener tests

The alignment was different from the other tests because tabs were used
instead of spaces.

While at it, also use 'echo' instead of 'printf' to print the result to
keep the same style as done in the other sub-tests. And, even if it
should be better with, also remove 'stdbuf' and sed's '--unbuffered'
option because they are not used in the other subtests and they are not
available when using a minimal environment with busybox.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 178d023208eb ("selftests: mptcp: listener test for in-kernel PM")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip PM listener tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:50 +0000 (18:11 +0200)]
selftests: mptcp: join: skip PM listener tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of PM listener events introduced by commit
f8c9dfbd875b ("mptcp: add pm listener events").

It is possible to look for "mptcp_event_pm_listener" in kallsyms to know
in advance if the kernel supports this feature.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 178d023208eb ("selftests: mptcp: listener test for in-kernel PM")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip MPC backups tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:49 +0000 (18:11 +0200)]
selftests: mptcp: join: skip MPC backups tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of sending an MP_PRIO signal for the initial
subflow, introduced by commit c157bbe776b7 ("mptcp: allow the in kernel
PM to set MPC subflow priority").

It is possible to look for "mptcp_subflow_send_ack" in kallsyms because
it was needed to introduce the mentioned feature. So we can know in
advance if the feature is supported instead of trying and accepting any
results.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 914f6a59b10f ("selftests: mptcp: add MPC backup tests")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip fail tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:48 +0000 (18:11 +0200)]
selftests: mptcp: join: skip fail tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of the MP_FAIL / infinite mapping introduced
by commit 1e39e5a32ad7 ("mptcp: infinite mapping sending") and the
following ones.

It is possible to look for one of the infinite mapping counters to know
in advance if the this feature is available.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: b6e074e171bc ("selftests: mptcp: add infinite map testcase")
Cc: stable@vger.kernel.org
Fixes: 2ba18161d407 ("selftests: mptcp: add MP_FAIL reset testcase")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip userspace PM tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:47 +0000 (18:11 +0200)]
selftests: mptcp: join: skip userspace PM tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of the userspace PM introduced by commit
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
and the following ones.

It is possible to look for the MPTCP pm_type's sysctl knob to know in
advance if the userspace PM is available.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip fullmesh flag tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:46 +0000 (18:11 +0200)]
selftests: mptcp: join: skip fullmesh flag tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of the fullmesh flag for the in-kernel PM
introduced by commit 2843ff6f36db ("mptcp: remote addresses fullmesh")
and commit 1a0d6136c5f0 ("mptcp: local addresses fullmesh").

It looks like there is no easy external sign we can use to predict the
expected behaviour. We could add the flag and then check if it has been
added but for that, and for each fullmesh test, we would need to setup a
new environment, do the checks, clean it and then only start the test
from yet another clean environment. To keep it simple and avoid
introducing new issues, we look for a specific kernel version. That's
not ideal but an acceptable solution for this case.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 6a0653b96f5d ("selftests: mptcp: add fullmesh setting tests")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip backup if set flag on ID not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:45 +0000 (18:11 +0200)]
selftests: mptcp: join: skip backup if set flag on ID not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

Commit bccefb762439 ("selftests: mptcp: simplify pm_nl_change_endpoint")
has simplified the way the backup flag is set on an endpoint. Instead of
doing:

  ./pm_nl_ctl set 10.0.2.1 flags backup

Now we do:

  ./pm_nl_ctl set id 1 flags backup

The new way is easier to maintain but it is also incompatible with older
kernels not supporting the implicit endpoints putting in place the
infrastructure to set flags per ID, hence the second Fixes tag.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: bccefb762439 ("selftests: mptcp: simplify pm_nl_change_endpoint")
Cc: stable@vger.kernel.org
Fixes: 4cf86ae84c71 ("mptcp: strict local address ID selection")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip implicit tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:44 +0000 (18:11 +0200)]
selftests: mptcp: join: skip implicit tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of the implicit endpoints introduced by
commit d045b9eb95a9 ("mptcp: introduce implicit endpoints").

It is possible to look for "mptcp_subflow_send_ack" in kallsyms because
it was needed to introduce the mentioned feature. So we can know in
advance if the feature is supported instead of trying and accepting any
results.

Note that here and in the following commits, we re-do the same check for
each sub-test of the same function for a few reasons. The main one is
not to break the ID assign to each test in order to be able to easily
compare results between different kernel versions. Also, we can still
run a specific test even if it is skipped. Another reason is that it
makes it clear during the review that a specific subtest will be skipped
or not under certain conditions. At the end, it looks OK to call the
exact same helper multiple times: it is not a critical path and it is
the same code that is executed, not really more cases to maintain.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: support RM_ADDR for used endpoints or not
Matthieu Baerts [Sat, 10 Jun 2023 16:11:43 +0000 (18:11 +0200)]
selftests: mptcp: join: support RM_ADDR for used endpoints or not

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

At some points, a new feature caused internal behaviour changes we are
verifying in the selftests, see the Fixes tag below. It was not a UAPI
change but because in these selftests, we check some internal
behaviours, it is normal we have to adapt them from time to time after
having added some features.

It looks like there is no external sign we can use to predict the
expected behaviour. Instead of accepting different behaviours and thus
not really checking for the expected behaviour, we are looking here for
a specific kernel version. That's not ideal but it looks better than
removing the test because it cannot support older kernel versions.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip Fastclose tests if not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:42 +0000 (18:11 +0200)]
selftests: mptcp: join: skip Fastclose tests if not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the support of MP_FASTCLOSE introduced in commit
f284c0c77321 ("mptcp: implement fastclose xmit path").

If the MIB counter is not available, the test cannot be verified and the
behaviour will not be the expected one. So we can skip the test if the
counter is missing.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 01542c9bf9ab ("selftests: mptcp: add fastclose testcase")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: support local endpoint being tracked or not
Matthieu Baerts [Sat, 10 Jun 2023 16:11:41 +0000 (18:11 +0200)]
selftests: mptcp: join: support local endpoint being tracked or not

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

At some points, a new feature caused internal behaviour changes we are
verifying in the selftests, see the Fixes tag below. It was not a uAPI
change but because in these selftests, we check some internal
behaviours, it is normal we have to adapt them from time to time after
having added some features.

It is possible to look for "mptcp_pm_subflow_check_next" in kallsyms
because it was needed to introduce the mentioned feature. So we can know
in advance what the behaviour we are expecting here instead of
supporting the two behaviours.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip test if iptables/tc cmds fail
Matthieu Baerts [Sat, 10 Jun 2023 16:11:40 +0000 (18:11 +0200)]
selftests: mptcp: join: skip test if iptables/tc cmds fail

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

Some tests are using IPTables and/or TC commands to force some
behaviours. If one of these commands fails -- likely because some
features are not available due to missing kernel config -- we should
intercept the error and skip the tests requiring these features.

Note that if we expect to have these features available and if
SELFTESTS_MPTCP_LIB_EXPECT_ALL_FEATURES env var is set to 1, the tests
will be marked as failed instead of skipped.

This patch also replaces the 'exit 1' by 'return 1' not to stop the
selftest in the middle without the conclusion if there is an issue with
NF or TC.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 8d014eaa9254 ("selftests: mptcp: add ADD_ADDR timeout test case")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoselftests: mptcp: join: skip check if MIB counter not supported
Matthieu Baerts [Sat, 10 Jun 2023 16:11:39 +0000 (18:11 +0200)]
selftests: mptcp: join: skip check if MIB counter not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting all MPTCP features.

One of them is the MPTCP MIB counters introduced in commit fc518953bc9c
("mptcp: add and use MIB counter infrastructure") and more later. The
MPTCP Join selftest heavily relies on these counters.

If a counter is not supported by the kernel, it is not displayed when
using 'nstat -z'. We can then detect that and skip the verification. A
new helper (get_counter()) has been added to do the required checks and
return an error if the counter is not available.

Note that if we expect to have these features available and if
SELFTESTS_MPTCP_LIB_EXPECT_ALL_FEATURES env var is set to 1, the tests
will be marked as failed instead of skipped.

This new helper also makes sure we get the exact counter we want to
avoid issues we had in the past, e.g. with MPTcpExtRmAddr and
MPTcpExtRmAddrDrop sharing the same prefix. While at it, we uniform the
way we fetch a MIB counter.

Note for the backports: we rarely change these modified blocks so if
there is are conflicts, it is very likely because a counter is not used
in the older kernels and we don't need that chunk.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: b08fbf241064 ("selftests: add test-cases for MPTCP MP_JOIN")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>