platform/kernel/linux-rpi.git
16 months agodrivers: base: cacheinfo: Fix shared_cpu_map changes in event of CPU hotplug
K Prateek Nayak [Mon, 8 May 2023 08:41:14 +0000 (14:11 +0530)]
drivers: base: cacheinfo: Fix shared_cpu_map changes in event of CPU hotplug

While building the shared_cpu_map, check if the cache level and cache
type matches. On certain systems that build the cache topology based on
the instance ID, there are cases where the same ID may repeat across
multiple cache levels, leading inaccurate topology.

In event of CPU offlining, the cache_shared_cpu_map_remove() does not
consider if IDs at same level are being compared. As a result, when same
IDs repeat across different cache levels, the CPU going offline is not
removed from all the shared_cpu_map.

Below is the output of cache topology of CPU8 and it's SMT sibling after
CPU8 is offlined on a dual socket 3rd Generation AMD EPYC processor
(2 x 64C/128T) running kernel release v6.3:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu136/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu136/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list: 9-15,136-143

CPU8 is removed from index0 (L1i) but remains in the shared_cpu_list of
index1 (L1d) and index2 (L2). Since L1i, L1d, and L2 are shared by the
SMT siblings, and they have the same cache instance ID, CPU 2 is only
removed from the first index with matching ID which is index1 (L1i) in
this case. With this fix, the results are as expected when performing
the same experiment on the same system:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu136/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index1/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index2/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list: 9-15,136-143

When rebuilding topology, the same problem appears as
cache_shared_cpu_map_setup() implements a similar logic. Consider the
same 3rd Generation EPYC processor: CPUs in Core 1, that share the L1
and L2 caches, have L1 and L2 instance ID as 1. For all the CPUs on
the second chiplet, the L3 ID is also 1 leading to grouping on CPUs from
Core 1 (1, 17) and the entire second chiplet (8-15, 24-31) as CPUs
sharing one cache domain. This went undetected since x86 processors
depended on arch specific populate_cache_leaves() method to repopulate
the shared_cpus_map when CPU came back online until kernel release
v6.3-rc5.

Fixes: 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared caches at different levels")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20230508084115.1157-2-kprateek.nayak@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agomailbox: mailbox-test: fix a locking issue in mbox_test_message_write()
Dan Carpenter [Fri, 5 May 2023 09:22:09 +0000 (12:22 +0300)]
mailbox: mailbox-test: fix a locking issue in mbox_test_message_write()

There was a bug where this code forgot to unlock the tdev->mutex if the
kzalloc() failed.  Fix this issue, by moving the allocation outside the
lock.

Fixes: 2d1e952a2b8e ("mailbox: mailbox-test: Fix potential double-free in mbox_test_message_write()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Lee Jones <lee@kernel.org>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
16 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Linus Torvalds [Wed, 31 May 2023 18:13:09 +0000 (14:13 -0400)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:

 - Fix 64K ARM page size support in bnxt_re and efa

 - bnxt_re fixes for a memory leak, incorrect error handling and a
   remove a bogus FW failure when running on a VF

 - Update MAINTAINERS for hns and efa

 - Fix two rxe regressions added this merge window in error unwind and
   incorrect spinlock primitives

 - hns gets a better algorithm for allocating page tables to avoid
   running out of resources, and a timeout adjustment

 - Fix a text case failure in hns

 - Use after free in irdma and fix incorrect construction of a WQE
   causing mis-execution

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/irdma: Fix Local Invalidate fencing
  RDMA/irdma: Prevent QP use after free
  MAINTAINERS: Update maintainer of Amazon EFA driver
  RDMA/bnxt_re: Do not enable congestion control on VFs
  RDMA/bnxt_re: Fix return value of bnxt_re_process_raw_qp_pkt_rx
  RDMA/bnxt_re: Fix a possible memory leak
  RDMA/hns: Modify the value of long message loopback slice
  RDMA/hns: Fix base address table allocation
  RDMA/hns: Fix timeout attr in query qp for HIP08
  RDMA/efa: Fix unsupported page sizes in device
  RDMA/rxe: Convert spin_{lock_bh,unlock_bh} to spin_{lock_irqsave,unlock_irqrestore}
  RDMA/rxe: Fix double unlock in rxe_qp.c
  MAINTAINERS: Update maintainers of HiSilicon RoCE
  RDMA/bnxt_re: Fix the page_size used during the MR creation

16 months agoMerge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 31 May 2023 18:06:01 +0000 (14:06 -0400)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix two regressions in ext4 and a number of issues reported by syzbot"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable the lazy init thread when remounting read/write
  ext4: fix fsync for non-directories
  ext4: add lockdep annotations for i_data_sem for ea_inode's
  ext4: disallow ea_inodes with extended attributes
  ext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find()
  ext4: add EA_INODE checking to ext4_iget()

16 months agonvme: fix the name of Zone Append for verbose logging
Christoph Hellwig [Wed, 31 May 2023 12:54:54 +0000 (14:54 +0200)]
nvme: fix the name of Zone Append for verbose logging

No Management involved in Zone Appened.

Fixes: bd83fe6f2cd2 ("nvme: add verbose error logging")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
16 months agoscsi: stex: Fix gcc 13 warnings
Bart Van Assche [Mon, 29 May 2023 19:50:34 +0000 (12:50 -0700)]
scsi: stex: Fix gcc 13 warnings

gcc 13 may assign another type to enumeration constants than gcc 12. Split
the large enum at the top of source file stex.c such that the type of the
constants used in time expressions is changed back to the same type chosen
by gcc 12. This patch suppresses compiler warnings like this one:

In file included from ./include/linux/bitops.h:7,
                 from ./include/linux/kernel.h:22,
                 from drivers/scsi/stex.c:13:
drivers/scsi/stex.c: In function ‘stex_common_handshake’:
./include/linux/typecheck.h:12:25: error: comparison of distinct pointer types lacks a cast [-Werror]
   12 |         (void)(&__dummy == &__dummy2); \
      |                         ^~
./include/linux/jiffies.h:106:10: note: in expansion of macro ‘typecheck’
  106 |          typecheck(unsigned long, b) && \
      |          ^~~~~~~~~
drivers/scsi/stex.c:1035:29: note: in expansion of macro ‘time_after’
 1035 |                         if (time_after(jiffies, before + MU_MAX_DELAY * HZ)) {
      |                             ^~~~~~~~~~

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107405.

Cc: stable@vger.kernel.org
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230529195034.3077-1-bvanassche@acm.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
16 months agoHID: logitech-hidpp: Handle timeout differently from busy
Bastien Nocera [Wed, 31 May 2023 08:24:28 +0000 (10:24 +0200)]
HID: logitech-hidpp: Handle timeout differently from busy

If an attempt at contacting a receiver or a device fails because the
receiver or device never responds, don't restart the communication, only
restart it if the receiver or device answers that it's busy, as originally
intended.

This was the behaviour on communication timeout before commit 586e8fede795
("HID: logitech-hidpp: Retry commands when device is busy").

This fixes some overly long waits in a critical path on boot, when
checking whether the device is connected by getting its HID++ version.

Signed-off-by: Bastien Nocera <hadess@hadess.net>
Suggested-by: Mark Lord <mlord@pobox.com>
Fixes: 586e8fede795 ("HID: logitech-hidpp: Retry commands when device is busy")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217412
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
16 months agoriscv: Fix relocatable kernels with early alternatives using -fno-pie
Alexandre Ghiti [Fri, 26 May 2023 15:46:30 +0000 (17:46 +0200)]
riscv: Fix relocatable kernels with early alternatives using -fno-pie

Early alternatives are called with the mmu disabled, and then should not
access any global symbols through the GOT since it requires relocations,
relocations that we do before but *virtually*. So only use medany code
model for this early code.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com> # booted on nezha & unmatched
Fixes: 39b33072941f ("riscv: Introduce CONFIG_RELOCATABLE")
Link: https://lore.kernel.org/r/20230526154630.289374-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
16 months agonfsd: fix double fget() bug in __write_ports_addfd()
Dan Carpenter [Mon, 29 May 2023 11:35:55 +0000 (14:35 +0300)]
nfsd: fix double fget() bug in __write_ports_addfd()

The bug here is that you cannot rely on getting the same socket
from multiple calls to fget() because userspace can influence
that.  This is a kind of double fetch bug.

The fix is to delete the svc_alien_sock() function and instead do
the checking inside the svc_addsock() function.

Fixes: 3064639423c4 ("nfsd: check passed socket's net matches NFSd superblock's one")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
16 months agotracing/probe: trace_probe_primary_from_call(): checked list_first_entry
Pietro Borrello [Sat, 28 Jan 2023 16:23:41 +0000 (16:23 +0000)]
tracing/probe: trace_probe_primary_from_call(): checked list_first_entry

All callers of trace_probe_primary_from_call() check the return
value to be non NULL. However, the function returns
list_first_entry(&tpe->probes, ...) which can never be NULL.
Additionally, it does not check for the list being possibly empty,
possibly causing a type confusion on empty lists.
Use list_first_entry_or_null() which solves both problems.

Link: https://lore.kernel.org/linux-trace-kernel/20230128-list-entry-null-check-v1-1-8bde6a3da2ef@diag.uniroma1.it/
Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
16 months agoudp6: Fix race condition in udp6_sendmsg & connect
Vladislav Efanov [Tue, 30 May 2023 11:39:41 +0000 (14:39 +0300)]
udp6: Fix race condition in udp6_sendmsg & connect

Syzkaller got the following report:
BUG: KASAN: use-after-free in sk_setup_caps+0x621/0x690 net/core/sock.c:2018
Read of size 8 at addr ffff888027f82780 by task syz-executor276/3255

The function sk_setup_caps (called by ip6_sk_dst_store_flow->
ip6_dst_store) referenced already freed memory as this memory was
freed by parallel task in udpv6_sendmsg->ip6_sk_dst_lookup_flow->
sk_dst_check.

          task1 (connect)              task2 (udp6_sendmsg)
        sk_setup_caps->sk_dst_set |
                                  |  sk_dst_check->
                                  |      sk_dst_set
                                  |      dst_release
        sk_setup_caps references  |
        to already freed dst_entry|

The reason for this race condition is: sk_setup_caps() keeps using
the dst after transferring the ownership to the dst cache.

Found by Linux Verification Center (linuxtesting.org) with syzkaller.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Vladislav Efanov <VEfanov@ispras.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 months agoKVM: arm64: Document default vPMU behavior on heterogeneous systems
Oliver Upton [Thu, 25 May 2023 21:27:22 +0000 (21:27 +0000)]
KVM: arm64: Document default vPMU behavior on heterogeneous systems

KVM maintains a mask of supported CPUs when a vPMU type is explicitly
selected by userspace and is used to reject any attempt to run the vCPU
on an unsupported CPU. This is great, but we're still beholden to the
default behavior where vCPUs can be scheduled anywhere and guest
counters may silently stop working.

Avoid confusing the next poor sod to look at this code and document the
intended behavior.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230525212723.3361524-3-oliver.upton@linux.dev
16 months agoKVM: arm64: Iterate arm_pmus list to probe for default PMU
Oliver Upton [Thu, 25 May 2023 21:27:21 +0000 (21:27 +0000)]
KVM: arm64: Iterate arm_pmus list to probe for default PMU

To date KVM has relied on using a perf event to probe the core PMU at
the time of vPMU initialization. Behind the scenes perf_event_init()
would iteratively walk the PMUs of the system and return the PMU that
could handle the event. However, an upcoming change in perf core will
drop the iterative walk, thereby breaking the fragile dance we do on the
KVM side.

Avoid the problem altogether by iterating over the list of supported
PMUs maintained in KVM, returning the core PMU that matches the CPU
we were called on.

Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230525212723.3361524-2-oliver.upton@linux.dev
16 months agonet/netlink: fix NETLINK_LIST_MEMBERSHIPS length report
Pedro Tammela [Mon, 29 May 2023 15:33:35 +0000 (12:33 -0300)]
net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report

The current code for the length calculation wrongly truncates the reported
length of the groups array, causing an under report of the subscribed
groups. To fix this, use 'BITS_TO_BYTES()' which rounds up the
division by 8.

Fixes: b42be38b2778 ("netlink: add API to retrieve all group memberships")
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230529153335.389815-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoKVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()
Oliver Upton [Tue, 30 May 2023 19:32:13 +0000 (19:32 +0000)]
KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()

The reference count on page table allocations is increased for every
'counted' PTE (valid or donated) in the table in addition to the initial
reference from ->zalloc_page(). kvm_pgtable_stage2_free_removed() fails
to drop the last reference on the root of the table walk, meaning we
leak memory.

Fix it by dropping the last reference after the free walker returns,
at which point all references for 'counted' PTEs have been released.

Cc: stable@vger.kernel.org
Fixes: 5c359cca1faf ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make")
Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230530193213.1663411-1-oliver.upton@linux.dev
16 months agonet: sched: fix NULL pointer dereference in mq_attach
Zhengchao Shao [Sat, 27 May 2023 09:37:47 +0000 (17:37 +0800)]
net: sched: fix NULL pointer dereference in mq_attach

When use the following command to test:
1)ip link add bond0 type bond
2)ip link set bond0 up
3)tc qdisc add dev bond0 root handle ffff: mq
4)tc qdisc replace dev bond0 parent ffff:fff1 handle ffff: mq

The kernel reports NULL pointer dereference issue. The stack information
is as follows:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Internal error: Oops: 0000000096000006 [#1] SMP
Modules linked in:
pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mq_attach+0x44/0xa0
lr : qdisc_graft+0x20c/0x5cc
sp : ffff80000e2236a0
x29: ffff80000e2236a0 x28: ffff0000c0e59d80 x27: ffff0000c0be19c0
x26: ffff0000cae3e800 x25: 0000000000000010 x24: 00000000fffffff1
x23: 0000000000000000 x22: ffff0000cae3e800 x21: ffff0000c9df4000
x20: ffff0000c9df4000 x19: 0000000000000000 x18: ffff80000a934000
x17: ffff8000f5b56000 x16: ffff80000bb08000 x15: 0000000000000000
x14: 0000000000000000 x13: 6b6b6b6b6b6b6b6b x12: 6b6b6b6b00000001
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : ffff0000c0be0730 x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
x5 : ffff0000cae3e864 x4 : 0000000000000000 x3 : 0000000000000001
x2 : 0000000000000001 x1 : ffff8000090bc23c x0 : 0000000000000000
Call trace:
mq_attach+0x44/0xa0
qdisc_graft+0x20c/0x5cc
tc_modify_qdisc+0x1c4/0x664
rtnetlink_rcv_msg+0x354/0x440
netlink_rcv_skb+0x64/0x144
rtnetlink_rcv+0x28/0x34
netlink_unicast+0x1e8/0x2a4
netlink_sendmsg+0x308/0x4a0
sock_sendmsg+0x64/0xac
____sys_sendmsg+0x29c/0x358
___sys_sendmsg+0x90/0xd0
__sys_sendmsg+0x7c/0xd0
__arm64_sys_sendmsg+0x2c/0x38
invoke_syscall+0x54/0x114
el0_svc_common.constprop.1+0x90/0x174
do_el0_svc+0x3c/0xb0
el0_svc+0x24/0xec
el0t_64_sync_handler+0x90/0xb4
el0t_64_sync+0x174/0x178

This is because when mq is added for the first time, qdiscs in mq is set
to NULL in mq_attach(). Therefore, when replacing mq after adding mq, we
need to initialize qdiscs in the mq before continuing to graft. Otherwise,
it will couse NULL pointer dereference issue in mq_attach(). And the same
issue will occur in the attach functions of mqprio, taprio and htb.
ffff:fff1 means that the repalce qdisc is ingress. Ingress does not allow
any qdisc to be attached. Therefore, ffff:fff1 is incorrectly used, and
the command should be dropped.

Fixes: 6ec1c69a8f64 ("net_sched: add classful multiqueue dummy scheduler")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Tested-by: Peilin Ye <peilin.ye@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230527093747.3583502-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoMerge branch 'net-sched-fixes-for-sch_ingress-and-sch_clsact'
Jakub Kicinski [Wed, 31 May 2023 06:31:06 +0000 (23:31 -0700)]
Merge branch 'net-sched-fixes-for-sch_ingress-and-sch_clsact'

Peilin Ye says:

====================
net/sched: Fixes for sch_ingress and sch_clsact

These are v6 fixes for ingress and clsact Qdiscs, including only first 4
patches (already tested and reviewed) from v5.  Patch 5 and 6 from
previous versions are still under discussion and will be sent separately.

[a] https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b

Link to v5: https://lore.kernel.org/r/cover.1684887977.git.peilin.ye@bytedance.com/
Link to v4: https://lore.kernel.org/r/cover.1684825171.git.peilin.ye@bytedance.com/
Link to v3 (incomplete): https://lore.kernel.org/r/cover.1684821877.git.peilin.ye@bytedance.com/
Link to v2: https://lore.kernel.org/r/cover.1684796705.git.peilin.ye@bytedance.com/
Link to v1: https://lore.kernel.org/r/cover.1683326865.git.peilin.ye@bytedance.com/
====================

Link: https://lore.kernel.org/r/cover.1685388545.git.peilin.ye@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/sched: Prohibit regrafting ingress or clsact Qdiscs
Peilin Ye [Mon, 29 May 2023 19:54:26 +0000 (12:54 -0700)]
net/sched: Prohibit regrafting ingress or clsact Qdiscs

Currently, after creating an ingress (or clsact) Qdisc and grafting it
under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
e.g. a TBF Qdisc:

  $ ip link add ifb0 type ifb
  $ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
  $ tc qdisc add dev ifb0 clsact
  $ tc qdisc link dev ifb0 handle ffff: parent 1:1
  $ tc qdisc show dev ifb0
  qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
  qdisc clsact ffff: parent ffff:fff1 refcnt 2
                                      ^^^^^^^^

clsact's refcount has increased: it is now grafted under both
TC_H_CLSACT and 1:1.

ingress and clsact Qdiscs should only be used under TC_H_INGRESS
(TC_H_CLSACT).  Prohibit regrafting them.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
Peilin Ye [Mon, 29 May 2023 19:54:03 +0000 (12:54 -0700)]
net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs

Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
(TC_H_INGRESS, TC_H_CLSACT):

  $ ip link add name ifb0 type ifb
  $ tc qdisc add dev ifb0 parent ffff:fff1 htb
  $ tc qdisc add dev ifb0 clsact
  Error: Exclusivity flag on, cannot modify.
  $ drgn
  ...
  >>> ifb0 = netdev_get_by_name(prog, "ifb0")
  >>> qdisc = ifb0.ingress_queue.qdisc_sleeping
  >>> print(qdisc.ops.id.string_().decode())
  htb
  >>> qdisc.flags.value_() # TCQ_F_INGRESS
  2

Only allow ingress and clsact Qdiscs under ffff:fff1.  Return -EINVAL
for everything else.  Make TCQ_F_INGRESS a static flag of ingress and
clsact Qdiscs.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/sched: sch_clsact: Only create under TC_H_CLSACT
Peilin Ye [Mon, 29 May 2023 19:53:21 +0000 (12:53 -0700)]
net/sched: sch_clsact: Only create under TC_H_CLSACT

clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
equals TC_H_INGRESS).  Return -EOPNOTSUPP if 'parent' is not
TC_H_CLSACT.

Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet/sched: sch_ingress: Only create under TC_H_INGRESS
Peilin Ye [Mon, 29 May 2023 19:52:55 +0000 (12:52 -0700)]
net/sched: sch_ingress: Only create under TC_H_INGRESS

ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
Return -EOPNOTSUPP if 'parent' is not TC_H_INGRESS, similar to
mq_init().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonvme: improve handling of long keep alives
Uday Shankar [Thu, 25 May 2023 18:22:04 +0000 (12:22 -0600)]
nvme: improve handling of long keep alives

Upon keep alive completion, nvme_keep_alive_work is scheduled with the
same delay every time. If keep alive commands are completing slowly,
this may cause a keep alive timeout. The following trace illustrates the
issue, taking KATO = 8 and TBKAS off for simplicity:

1. t = 0: run nvme_keep_alive_work, send keep alive
2. t = ε: keep alive reaches controller, controller restarts its keep
          alive timer
3. t = 4: host receives keep alive completion, schedules
          nvme_keep_alive_work with delay 4
4. t = 8: run nvme_keep_alive_work, send keep alive

Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε
between the controller receiving successive keep alives. With ε small,
the controller is likely to detect a keep alive timeout.

Fix this by calculating the RTT of the keep alive command, and adjusting
the scheduling delay of the next keep alive work accordingly.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
16 months agoMerge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Tue, 30 May 2023 21:23:50 +0000 (17:23 -0400)]
Merge tag 'for-6.4-rc4-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "One bug fix and two build warning fixes:

   - call proper end bio callback for metadata RAID0 in a rare case of
     an unaligned block

   - fix uninitialized variable (reported by gcc 10.2)

   - fix warning about potential access beyond array bounds on mips64
     with 64k pages (runtime check would not allow that)"

* tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds
  btrfs: fix an uninitialized variable warning in btrfs_log_inode
  btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work

16 months agoMerge tag 'perf-tools-fixes-for-v6.4-2-2023-05-30' of git://git.kernel.org/pub/scm...
Linus Torvalds [Tue, 30 May 2023 21:12:13 +0000 (17:12 -0400)]
Merge tag 'perf-tools-fixes-for-v6.4-2-2023-05-30' of git://git./linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix BPF CO-RE naming convention for checking the availability of
   fields on 'union perf_mem_data_src' on the running kernel

 - Remove the use of llvm-strip on BPF skel object files, not needed,
   fixes a build breakage when the llvm package, that contains it in
   most distros, isn't installed

 - Fix tools that use both evsel->{bpf_counter_list,bpf_filters},
   removing them from a union

 - Remove extra "--" from the 'perf ftrace latency' --use-nsec option,
   previously it was working only when using the '-n' alternative

 - Don't stop building when both binutils-devel and a C++ compiler isn't
   available to compile the alternative C++ demangle support code,
   disable that feature instead

 - Sync the linux/in.h and coresight-pmu.h header copies with the kernel
   sources

 - Fix relative include path to cs-etm.h

* tag 'perf-tools-fixes-for-v6.4-2-2023-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf evsel: Separate bpf_counter_list and bpf_filters, can be used at the same time
  tools headers UAPI: Sync the linux/in.h with the kernel sources
  perf cs-etm: Copy kernel coresight-pmu.h header
  perf bpf: Do not use llvm-strip on BPF binary
  perf build: Don't compile demangle-cxx.cpp if not necessary
  perf arm: Fix include path to cs-etm.h
  perf bpf filter: Fix a broken perf sample data naming for BPF CO-RE
  perf ftrace latency: Remove unnecessary "--" from --use-nsec option

16 months agoMerge tag 'regmap-fix-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 30 May 2023 21:07:25 +0000 (17:07 -0400)]
Merge tag 'regmap-fix-v6.4-rc4' of git://git./linux/kernel/git/broonie/regmap

Pull regmap fixes from Mark Brown:
 "The most important fix here is for missing dropping of the RCU read
  lock when syncing maple tree register caches, the physical devices I
  have that use the code don't do any syncing so I'd only ever tested
  this with virtual devices and missed the fact that we need to drop the
  lock in order to write to buses that need to sleep.

  Otherwise there's a fix for an edge case when splitting up large batch
  writes which has been lurking for a long time, a check to make sure
  nobody writes new drivers with a bug that was found in several
  SoundWire drivers and a tweak to the way the new kunit tests are
  enabled to ensure they don't cause regmap to be enabled when it
  wouldn't otherwise be"

* tag 'regmap-fix-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: maple: Drop the RCU read lock while syncing registers
  regmap: sdw: check for invalid multi-register writes config
  regmap: Account for register length when chunking
  regmap: REGMAP_KUNIT should not select REGMAP

16 months agoMerge tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof...
Linus Torvalds [Tue, 30 May 2023 19:49:40 +0000 (15:49 -0400)]
Merge tag 'modules-6.4-rc5' of git://git./linux/kernel/git/mcgrof/linux

Pull modules fix from Luis Chamberlain:
 "A fix is provided for ia64. Even though ia64 is on life support it
  helps to fix issues if we can. Thanks to Linus for doing tons of the
  ia64 debugging"

* tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  module: fix module load for ia64

16 months agoext4: enable the lazy init thread when remounting read/write
Theodore Ts'o [Sat, 27 May 2023 03:57:29 +0000 (23:57 -0400)]
ext4: enable the lazy init thread when remounting read/write

In commit a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting
r/w until quota is re-enabled") we defer clearing tyhe SB_RDONLY flag
in struct super.  However, we didn't defer when we checked sb_rdonly()
to determine the lazy itable init thread should be enabled, with the
next result that the lazy inode table initialization would not be
properly started.  This can cause generic/231 to fail in ext4's
nojournal mode.

Fix this by moving when we decide to start or stop the lazy itable
init thread to after we clear the SB_RDONLY flag when we are
remounting the file system read/write.

Fixes a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until...")

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230527035729.1001605-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agoext4: fix fsync for non-directories
Jan Kara [Wed, 24 May 2023 10:44:53 +0000 (12:44 +0200)]
ext4: fix fsync for non-directories

Commit e360c6ed7274 ("ext4: Drop special handling of journalled data
from ext4_sync_file()") simplified ext4_sync_file() by dropping special
handling of journalled data mode as it was not needed anymore. However
that branch was also used for directories and symlinks and since the
fastcommit code does not track metadata changes to non-regular files, the
change has caused e.g. fsync(2) on directories to not commit transaction
as it should. Fix the problem by adding handling for non-regular files.

Fixes: e360c6ed7274 ("ext4: Drop special handling of journalled data from ext4_sync_file()")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/all/ZFqO3xVnmhL7zv1x@debian-BULLSEYE-live-builder-AMD64
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20230524104453.8734-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agoext4: add lockdep annotations for i_data_sem for ea_inode's
Theodore Ts'o [Wed, 24 May 2023 03:49:51 +0000 (23:49 -0400)]
ext4: add lockdep annotations for i_data_sem for ea_inode's

Treat i_data_sem for ea_inodes as being in their own lockdep class to
avoid lockdep complaints about ext4_setattr's use of inode_lock() on
normal inodes potentially causing lock ordering with i_data_sem on
ea_inodes in ext4_xattr_inode_write().  However, ea_inodes will be
operated on by ext4_setattr(), so this isn't a problem.

Cc: stable@kernel.org
Link: https://syzkaller.appspot.com/bug?extid=298c5d8fb4a128bc27b0
Reported-by: syzbot+298c5d8fb4a128bc27b0@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-5-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agoext4: disallow ea_inodes with extended attributes
Theodore Ts'o [Wed, 24 May 2023 03:49:50 +0000 (23:49 -0400)]
ext4: disallow ea_inodes with extended attributes

An ea_inode stores the value of an extended attribute; it can not have
extended attributes itself, or this will cause recursive nightmares.
Add a check in ext4_iget() to make sure this is the case.

Cc: stable@kernel.org
Reported-by: syzbot+e44749b6ba4d0434cd47@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-4-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agoext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find()
Theodore Ts'o [Wed, 24 May 2023 03:49:49 +0000 (23:49 -0400)]
ext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find()

If the ea_inode has been pushed out of the inode cache while there is
still a reference in the mb_cache, the lockdep subclass will not be
set on the inode, which can lead to some lockdep false positives.

Fixes: 33d201e0277b ("ext4: fix lockdep warning about recursive inode locking")
Cc: stable@kernel.org
Reported-by: syzbot+d4b971e744b1f5439336@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-3-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agofbdev: bw2: Convert to platform remove callback returning void
Uwe Kleine-König [Sat, 18 Mar 2023 23:53:43 +0000 (00:53 +0100)]
fbdev: bw2: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>
16 months agofbdev: broadsheetfb: Convert to platform remove callback returning void
Uwe Kleine-König [Sat, 18 Mar 2023 23:53:42 +0000 (00:53 +0100)]
fbdev: broadsheetfb: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>
16 months agofbdev: au1200fb: Convert to platform remove callback returning void
Uwe Kleine-König [Sat, 18 Mar 2023 23:53:41 +0000 (00:53 +0100)]
fbdev: au1200fb: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>
16 months agofbdev: au1100fb: Convert to platform remove callback returning void
Uwe Kleine-König [Sat, 18 Mar 2023 23:53:40 +0000 (00:53 +0100)]
fbdev: au1100fb: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>
16 months agofbdev: arcfb: Convert to platform remove callback returning void
Uwe Kleine-König [Sat, 18 Mar 2023 23:53:39 +0000 (00:53 +0100)]
fbdev: arcfb: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>
16 months agofbdev: au1100fb: Drop if with an always false condition
Uwe Kleine-König [Sat, 18 Mar 2023 23:53:38 +0000 (00:53 +0100)]
fbdev: au1100fb: Drop if with an always false condition

The driver core never calls a remove callback with the platform_device
pointer being NULL. So the check for this condition can just be dropped.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>
16 months agomodule: fix module load for ia64
Song Liu [Sun, 28 May 2023 23:00:41 +0000 (16:00 -0700)]
module: fix module load for ia64

Frank reported boot regression in ia64 as:

ELILO v3.16 for EFI/IA-64
..
Uncompressing Linux... done
Loading file AC100221.initrd.img...done
[    0.000000] Linux version 6.4.0-rc3 (root@x4270) (ia64-linux-gcc
(GCC) 12.2.0, GNU ld (GNU Binutils) 2.39) #1 SMP Thu May 25 15:52:20
CEST 2023
[    0.000000] efi: EFI v1.1 by HP
[    0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fe2a000
ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fe28000
[    0.000000] PCDP: v3 at 0x3fe28000
[    0.000000] earlycon: uart8250 at MMIO 0x00000000f4050000 (options
'9600n8')
[    0.000000] printk: bootconsole [uart8250] enabled
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x000000003FE2A000 000028 (v02 HP    )
[    0.000000] ACPI: XSDT 0x000000003FE2A02C 0000CC (v01 HP     rx2620
00000000 HP   00000000)
[...]
[    3.793350] Run /init as init process
Loading, please wait...
Starting systemd-udevd version 252.6-1
[    3.951100] ------------[ cut here ]------------
[    3.951100] WARNING: CPU: 6 PID: 140 at kernel/module/main.c:1547
__layout_sections+0x370/0x3c0
[    3.949512] Unable to handle kernel paging request at virtual address
1000000000000000
[    3.951100] Modules linked in:
[    3.951100] CPU: 6 PID: 140 Comm: (udev-worker) Not tainted 6.4.0-rc3 #1
[    3.956161] (udev-worker)[142]: Oops 11003706212352 [1]
[    3.951774] Hardware name: hp server rx2620                   , BIOS
04.29
11/30/2007
[    3.951774]
[    3.951774] Call Trace:
[    3.958339] Unable to handle kernel paging request at virtual address
1000000000000000
[    3.956161] Modules linked in:
[    3.951774]  [<a0000001000156d0>] show_stack.part.0+0x30/0x60
[    3.951774]                                 sp=e000000183a67b20
bsp=e000000183a61628
[    3.956161]
[    3.956161]

which bisect to module_memory change [1].

Debug showed that ia64 uses some special sections:

__layout_sections: section .got (sh_flags 10000002) matched to MOD_INVALID
__layout_sections: section .sdata (sh_flags 10000003) matched to MOD_INVALID
__layout_sections: section .sbss (sh_flags 10000003) matched to MOD_INVALID

All these sections are loaded to module core memory before [1].

Fix ia64 boot by loading these sections to MOD_DATA (core rw data).

[1] commit ac3b43283923 ("module: replace module_layout with module_memory")

Fixes: ac3b43283923 ("module: replace module_layout with module_memory")
Reported-by: Frank Scheiner <frank.scheiner@web.de>
Closes: https://lists.debian.org/debian-ia64/2023/05/msg00010.html
Closes: https://marc.info/?l=linux-ia64&m=168509859125505
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
16 months agonvme: check IO start time when deciding to defer KA
Uday Shankar [Thu, 25 May 2023 18:22:03 +0000 (12:22 -0600)]
nvme: check IO start time when deciding to defer KA

When a command completes, we set a flag which will skip sending a
keep alive at the next run of nvme_keep_alive_work when TBKAS is on.
However, if the command was submitted long ago, it's possible that
the controller may have also restarted its keep alive timer (as a
result of receiving the command) long ago. The following trace
demonstrates the issue, assuming TBKAS is on and KATO = 8 for
simplicity:

1. t = 0: submit I/O commands A, B, C, D, E
2. t = 0.5: commands A, B, C, D, E reach controller, restart its keep
            alive timer
3. t = 1: A completes
4. t = 2: run nvme_keep_alive_work, see recent completion, do nothing
5. t = 3: B completes
6. t = 4: run nvme_keep_alive_work, see recent completion, do nothing
7. t = 5: C completes
8. t = 6: run nvme_keep_alive_work, see recent completion, do nothing
9. t = 7: D completes
10. t = 8: run nvme_keep_alive_work, see recent completion, do nothing
11. t = 9: E completes

At this point, 8.5 seconds have passed without restarting the
controller's keep alive timer, so the controller will detect a keep
alive timeout.

Fix this by checking the IO start time when deciding to defer sending a
keep alive command. Only set comp_seen if the command started after the
most recent run of nvme_keep_alive_work. With this change, the
completions of B, C, and D will not set comp_seen and the run of
nvme_keep_alive_work at t = 4 will send a keep alive.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
16 months agonvme: double KA polling frequency to avoid KATO with TBKAS on
Uday Shankar [Thu, 25 May 2023 18:22:02 +0000 (12:22 -0600)]
nvme: double KA polling frequency to avoid KATO with TBKAS on

With TBKAS on, the completion of one command can defer sending a
keep alive for up to twice the delay between successive runs of
nvme_keep_alive_work. The current delay of KATO / 2 thus makes it
possible for one command to defer sending a keep alive for up to
KATO, which can result in the controller detecting a KATO. The following
trace demonstrates the issue, taking KATO = 8 for simplicity:

1. t = 0: run nvme_keep_alive_work, no keep-alive sent
2. t = ε: I/O completion seen, set comp_seen = true
3. t = 4: run nvme_keep_alive_work, see comp_seen == true,
          skip sending keep-alive, set comp_seen = false
4. t = 8: run nvme_keep_alive_work, see comp_seen == false,
          send a keep-alive command.

Here, there is a delay of 8 - ε between receiving a command completion
and sending the next command. With ε small, the controller is likely to
detect a keep alive timeout.

Fix this by running nvme_keep_alive_work with a delay of KATO / 4
whenever TBKAS is on. Going through the above trace now gives us a
worst-case delay of 4 - ε, which is in line with the recommendation of
sending a command every KATO / 2 in the NVMe specification.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
16 months agonvme: fix miss command type check
min15.li [Fri, 26 May 2023 17:06:56 +0000 (17:06 +0000)]
nvme: fix miss command type check

In the function nvme_passthru_end(), only the value of the command
opcode is checked, without checking the command type (IO command or
Admin command). When we send a Dataset Management command (The opcode
of the Dataset Management command is the same as the Set Feature
command), kernel thinks it is a set feature command, then sets the
controller's keep alive interval, and calls nvme_keep_alive_work().

Signed-off-by: min15.li <min15.li@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
16 months agousb: typec: tps6598x: Fix broken polling mode after system suspend/resume
Roger Quadros [Tue, 30 May 2023 06:59:26 +0000 (09:59 +0300)]
usb: typec: tps6598x: Fix broken polling mode after system suspend/resume

During system resume we need to resume the polling workqueue
if client->irq is not set else polling will no longer work.

Fixes: 0d6a119cecd7 ("usb: typec: tps6598x: Add support for polling interrupts status")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20230530065926.6161-1-rogerq@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agoserial: cpm_uart: Fix a COMPILE_TEST dependency
Herve Codina [Tue, 23 May 2023 08:59:02 +0000 (10:59 +0200)]
serial: cpm_uart: Fix a COMPILE_TEST dependency

In a COMPILE_TEST configuration, the cpm_uart driver uses symbols from
the cpm_uart_cpm2.c file. This file is compiled only when CONFIG_CPM2 is
set.

Without this dependency, the linker fails with some missing symbols for
COMPILE_TEST configuration that needs SERIAL_CPM without enabling CPM2.

This lead to:
  depends on CPM2 || CPM1 || (PPC32 && CPM2 && COMPILE_TEST)

This dependency does not make sense anymore and can be simplified
removing all the COMPILE_TEST part.

Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305160221.9XgweObz-lkp@intel.com/
Fixes: e3e7b13bffae ("serial: allow COMPILE_TEST for some drivers")
Cc: stable <stable@kernel.org>
Link: https://lore.kernel.org/r/20230523085902.75837-3-herve.codina@bootlin.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agosoc: fsl: cpm1: Fix TSA and QMC dependencies in case of COMPILE_TEST
Herve Codina [Tue, 23 May 2023 08:59:01 +0000 (10:59 +0200)]
soc: fsl: cpm1: Fix TSA and QMC dependencies in case of COMPILE_TEST

In order to compile tsa.c and qmc.c, CONFIG_CPM must be set.

Without this dependency, the linker fails with some missing
symbols for COMPILE_TEST configurations that need QMC without
enabling CPM.

Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305160221.9XgweObz-lkp@intel.com/
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20230523085902.75837-2-herve.codina@bootlin.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agotty: serial: fsl_lpuart: use UARTCTRL_TXINV to send break instead of UARTCTRL_SBK
Sherry Sun [Fri, 19 May 2023 09:47:51 +0000 (17:47 +0800)]
tty: serial: fsl_lpuart: use UARTCTRL_TXINV to send break instead of UARTCTRL_SBK

LPUART IP now has two known bugs, one is that CTS has higher priority
than the break signal, which causes the break signal sending through
UARTCTRL_SBK may impacted by the CTS input if the HW flow control is
enabled. It exists on all platforms we support in this driver.
So we add a workaround patch for this issue: commit c4c81db5cf8b
("tty: serial: fsl_lpuart: disable the CTS when send break signal").

Another IP bug is i.MX8QM LPUART may have an additional break character
being sent after SBK was cleared. It may need to add some delay between
clearing SBK and re-enabling CTS to ensure that the SBK latch are
completely cleared.

But we found that during the delay period before CTS is enabled, there
is still a risk that Bluetooth data in TX FIFO may be sent out during
this period because of break off and CTS disabled(even if BT sets CTS
line deasserted, data is still sent to BT).

Due to this risk, we have to drop the CTS-disabling workaround for SBK
bugs, use TXINV seems to be a better way to replace SBK feature and
avoid above risk. Also need to disable the transmitter to prevent any
data from being sent out during break, then invert the TX line to send
break. Then disable the TXINV when turn off break and re-enable
transmitter.

Fixes: c4c81db5cf8b ("tty: serial: fsl_lpuart: disable the CTS when send break signal")
Cc: stable <stable@kernel.org>
Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Link: https://lore.kernel.org/r/20230519094751.28948-1-sherry.sun@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agoserial: 8250_tegra: Fix an error handling path in tegra_uart_probe()
Christophe JAILLET [Sun, 14 May 2023 11:25:42 +0000 (13:25 +0200)]
serial: 8250_tegra: Fix an error handling path in tegra_uart_probe()

If an error occurs after reset_control_deassert(), it must be re-asserted,
as already done in the .remove() function.

Fixes: c6825c6395b7 ("serial: 8250_tegra: Create Tegra specific 8250 driver")
Cc: stable <stable@kernel.org>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/f8130f35339cc80edc6b9aac4bb2a60b60a226bf.1684063511.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agoMerge branch 'selftests-mptcp-skip-tests-not-supported-by-old-kernels-part-1'
Paolo Abeni [Tue, 30 May 2023 11:21:05 +0000 (13:21 +0200)]
Merge branch 'selftests-mptcp-skip-tests-not-supported-by-old-kernels-part-1'

Matthieu Baerts says:

====================
selftests: mptcp: skip tests not supported by old kernels (part 1)

After a few years of increasing test coverage in the MPTCP selftests, we
realised [1] the last version of the selftests is supposed to run on old
kernels without issues.

Supporting older versions is not that easy for this MPTCP case: these
selftests are often validating the internals by checking packets that
are exchanged, when some MIB counters are incremented after some
actions, how connections are getting opened and closed in some cases,
etc. In other words, it is not limited to the socket interface between
the userspace and the kernelspace. In addition, the current selftests
run a lot of different sub-tests but the TAP13 protocol used in the
selftests don't support sub-tests: in other words, one failure in
sub-tests implies that the whole selftest is seen as failed at the end
because sub-tests are not tracked. It is then important to skip
sub-tests not supported by old kernels.

To minimise the modifications and reduce the complexity to support old
versions, the idea is to look at external signs and skip the whole
selftests or just some sub-tests before starting them.

This first part focuses on marking the different selftests as skipped
if MPTCP is not even supported. That's what is done in patches 2 to 8.
Patch 2/8 introduces a new file (mptcp_lib.sh) to be able to re-use some
helpers in the different selftests. The first MPTCP selftest has been
introduced in v5.6.

Patch 1/8 is a bit different but still linked: it modifies mptcp_join.sh
selftest not to use 'cmp --bytes' which is not supported by the BusyBox
implementation. It is apparently quite common to use BusyBox in CI
environments. This tool is needed for a subtest introduced in v6.1.

Link: https://lore.kernel.org/stable/CA+G9fYtDGpgT4dckXD-y-N92nqUxuvue_7AtDdBcHrbOMsDZLg@mail.gmail.com/
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
====================

Link: https://lore.kernel.org/r/20230528-upstream-net-20230528-mptcp-selftests-support-old-kernels-part-1-v1-0-a32d85577fc6@tessares.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: userspace pm: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:33 +0000 (19:35 +0200)]
selftests: mptcp: userspace pm: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped".

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 259a834fadda ("selftests: mptcp: functional tests for the userspace PM type")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: sockopt: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:32 +0000 (19:35 +0200)]
selftests: mptcp: sockopt: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped".

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: dc65fe82fb07 ("selftests: mptcp: add packet mark test case")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: simult flows: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:31 +0000 (19:35 +0200)]
selftests: mptcp: simult flows: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped".

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 1a418cb8e888 ("mptcp: simult flow self-tests")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: diag: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:30 +0000 (19:35 +0200)]
selftests: mptcp: diag: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped".

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: df62f2ec3df6 ("selftests/mptcp: add diag interface tests")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: join: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:29 +0000 (19:35 +0200)]
selftests: mptcp: join: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped".

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: b08fbf241064 ("selftests: add test-cases for MPTCP MP_JOIN")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: pm nl: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:28 +0000 (19:35 +0200)]
selftests: mptcp: pm nl: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped".

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: eedbc685321b ("selftests: add PM netlink functional tests")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: connect: skip if MPTCP is not supported
Matthieu Baerts [Sun, 28 May 2023 17:35:27 +0000 (19:35 +0200)]
selftests: mptcp: connect: skip if MPTCP is not supported

Selftests are supposed to run on any kernels, including the old ones not
supporting MPTCP.

A new check is then added to make sure MPTCP is supported. If not, the
test stops and is marked as "skipped". Note that this check can also
mark the test as failed if 'SELFTESTS_MPTCP_LIB_EXPECT_ALL_FEATURES' env
var is set to 1: by doing that, we can make sure a test is not being
skipped by mistake.

A new shared file is added here to be able to re-used the same check in
the different selftests we have.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Fixes: 048d19d444be ("mptcp: add basic kselftest for mptcp")
Cc: stable@vger.kernel.org
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoselftests: mptcp: join: avoid using 'cmp --bytes'
Matthieu Baerts [Sun, 28 May 2023 17:35:26 +0000 (19:35 +0200)]
selftests: mptcp: join: avoid using 'cmp --bytes'

BusyBox's 'cmp' command doesn't support the '--bytes' parameter.

Some CIs -- i.e. LKFT -- use BusyBox and have the mptcp_join.sh test
failing [1] because their 'cmp' command doesn't support this '--bytes'
option:

    cmp: unrecognized option '--bytes=1024'
    BusyBox v1.35.0 () multi-call binary.

    Usage: cmp [-ls] [-n NUM] FILE1 [FILE2]

Instead, 'head --bytes' can be used as this option is supported by
BusyBox. A temporary file is needed for this operation.

Because it is apparently quite common to use BusyBox, it is certainly
better to backport this fix to impacted kernels.

Fixes: 6bf41020b72b ("selftests: mptcp: update and extend fastclose test-cases")
Cc: stable@vger.kernel.org
Link: https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.3-rc5-5-g148341f0a2f5/testrun/16088933/suite/kselftest-net-mptcp/test/net_mptcp_userspace_pm_sh/log
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agonet: mana: Fix perf regression: remove rx_cqes, tx_cqes counters
Haiyang Zhang [Fri, 26 May 2023 15:38:57 +0000 (08:38 -0700)]
net: mana: Fix perf regression: remove rx_cqes, tx_cqes counters

The apc->eth_stats.rx_cqes is one per NIC (vport), and it's on the
frequent and parallel code path of all queues. So, r/w into this
single shared variable by many threads on different CPUs creates a
lot caching and memory overhead, hence perf regression. And, it's
not accurate due to the high volume concurrent r/w.

For example, a workload is iperf with 128 threads, and with RPS
enabled. We saw perf regression of 25% with the previous patch
adding the counters. And this patch eliminates the regression.

Since the error path of mana_poll_rx_cq() already has warnings, so
keeping the counter and convert it to a per-queue variable is not
necessary. So, just remove this counter from this high frequency
code path.

Also, remove the tx_cqes counter for the same reason. We have
warnings & other counters for errors on that path, and don't need
to count every normal cqe processing.

Cc: stable@vger.kernel.org
Fixes: bd7fc6e1957c ("net: mana: Add new MANA VF performance counters for easier troubleshooting")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1685115537-31675-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoMerge branch 'two-fixes-for-smcrv2'
Paolo Abeni [Tue, 30 May 2023 09:26:33 +0000 (11:26 +0200)]
Merge branch 'two-fixes-for-smcrv2'

Wen Gu says:

====================
Two fixes for SMCRv2

This patch set includes two bugfix for SMCRv2.
====================

Link: https://lore.kernel.org/r/1685101741-74826-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agonet/smc: Don't use RMBs not mapped to new link in SMCRv2 ADD LINK
Wen Gu [Fri, 26 May 2023 11:49:01 +0000 (19:49 +0800)]
net/smc: Don't use RMBs not mapped to new link in SMCRv2 ADD LINK

We encountered a crash when using SMCRv2. It is caused by a logical
error in smc_llc_fill_ext_v2().

 BUG: kernel NULL pointer dereference, address: 0000000000000014
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 7 PID: 453 Comm: kworker/7:4 Kdump: loaded Tainted: G        W   E      6.4.0-rc3+ #44
 Workqueue: events smc_llc_add_link_work [smc]
 RIP: 0010:smc_llc_fill_ext_v2+0x117/0x280 [smc]
 RSP: 0018:ffffacb5c064bd88 EFLAGS: 00010282
 RAX: ffff9a6bc1c3c02c RBX: ffff9a6be3558000 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: 0000000000000002 RDI: 000000000000000a
 RBP: ffffacb5c064bdb8 R08: 0000000000000040 R09: 000000000000000c
 R10: ffff9a6bc0910300 R11: 0000000000000002 R12: 0000000000000000
 R13: 0000000000000002 R14: ffff9a6bc1c3c02c R15: ffff9a6be3558250
 FS:  0000000000000000(0000) GS:ffff9a6eefdc0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000014 CR3: 000000010b078003 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  smc_llc_send_add_link+0x1ae/0x2f0 [smc]
  smc_llc_srv_add_link+0x2c9/0x5a0 [smc]
  ? cc_mkenc+0x40/0x60
  smc_llc_add_link_work+0xb8/0x140 [smc]
  process_one_work+0x1e5/0x3f0
  worker_thread+0x4d/0x2f0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xe5/0x120
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x2c/0x50
  </TASK>

When an alernate RNIC is available in system, SMC will try to add a new
link based on the RNIC for resilience. All the RMBs in use will be mapped
to the new link. Then the RMBs' MRs corresponding to the new link will be
filled into SMCRv2 LLC ADD LINK messages.

However, smc_llc_fill_ext_v2() mistakenly accesses to unused RMBs which
haven't been mapped to the new link and have no valid MRs, thus causing
a crash. So this patch fixes the logic.

Fixes: b4ba4652b3f8 ("net/smc: extend LLC layer for SMC-Rv2")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agonet/smc: Scan from current RMB list when no position specified
Wen Gu [Fri, 26 May 2023 11:49:00 +0000 (19:49 +0800)]
net/smc: Scan from current RMB list when no position specified

When finding the first RMB of link group, it should start from the
current RMB list whose index is 0. So fix it.

Fixes: b4ba4652b3f8 ("net/smc: extend LLC layer for SMC-Rv2")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agorxrpc: Truncate UTS_RELEASE for rxrpc version
David Howells [Fri, 26 May 2023 11:34:54 +0000 (12:34 +0100)]
rxrpc: Truncate UTS_RELEASE for rxrpc version

UTS_RELEASE has a maximum length of 64 which can cause rxrpc_version to
exceed the 65 byte message limit.

Per the rx spec[1]: "If a server receives a packet with a type value of 13,
and the client-initiated flag set, it should respond with a 65-byte payload
containing a string that identifies the version of AFS software it is
running."

The current implementation causes a compile error when WERROR is turned on
and/or UTS_RELEASE exceeds the length of 49 (making the version string more
than 64 characters).

Fix this by generating the string during module initialisation and limiting
the UTS_RELEASE segment of the string does not exceed 49 chars.  We need to
make sure that the 64 bytes includes "linux-" at the front and " AF_RXRPC"
at the back as this may be used in pattern matching.

Fixes: 44ba06987c0b ("RxRPC: Handle VERSION Rx protocol packets")
Reported-by: Kenny Ho <Kenny.Ho@amd.com>
Link: https://lore.kernel.org/r/20230523223944.691076-1-Kenny.Ho@amd.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Kenny Ho <Kenny.Ho@amd.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Lunn <andrew@lunn.ch>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Link: https://web.mit.edu/kolya/afs/rx/rx-spec
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Link: https://lore.kernel.org/r/654974.1685100894@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
16 months agoKVM: arm64: Populate fault info for watchpoint
Akihiko Odaki [Tue, 30 May 2023 02:46:51 +0000 (11:46 +0900)]
KVM: arm64: Populate fault info for watchpoint

When handling ESR_ELx_EC_WATCHPT_LOW, far_el2 member of struct
kvm_vcpu_fault_info will be copied to far member of struct
kvm_debug_exit_arch and exposed to the userspace. The userspace will
see stale values from older faults if the fault info does not get
populated.

Fixes: 8fb2046180a0 ("KVM: arm64: Move early handlers to per-EC handlers")
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230530024651.10014-1-akihiko.odaki@daynix.com
Cc: stable@vger.kernel.org
16 months agopowerpc/xmon: Use KSYM_NAME_LEN in array size
Maninder Singh [Mon, 29 May 2023 11:13:37 +0000 (16:43 +0530)]
powerpc/xmon: Use KSYM_NAME_LEN in array size

kallsyms_lookup() which in turn calls kallsyms_lookup_buildid() writes
to index "KSYM_NAME_LEN - 1".

Thus the array passed as namebuf to kallsyms_lookup() should be
KSYM_NAME_LEN in size.

In xmon.c the array was defined to be "128" bytes directly, without
using KSYM_NAME_LEN. Commit b8a94bfb3395 ("kallsyms: increase maximum
kernel symbol length to 512") changed the value to 512, but missed
updating the xmon code.

Fixes: b8a94bfb3395 ("kallsyms: increase maximum kernel symbol length to 512")
Cc: stable@vger.kernel.org # v6.1+
Co-developed-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
[mpe: Tweak change log wording and fix commit reference]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230529111337.352990-2-maninder1.s@samsung.com
16 months agopowerpc/iommu: Limit number of TCEs to 512 for H_STUFF_TCE hcall
Gaurav Batra [Thu, 25 May 2023 14:34:54 +0000 (09:34 -0500)]
powerpc/iommu: Limit number of TCEs to 512 for H_STUFF_TCE hcall

Currently in tce_freemulti_pSeriesLP() there is no limit on how many
TCEs are passed to the H_STUFF_TCE hcall. This has not caused an issue
until now, but newer firmware releases have started enforcing a limit of
512 TCEs per call.

The limit is correct per the specification (PAPR v2.12 § 14.5.4.2.3).

The code has been in it's current form since it was initially merged.

Cc: stable@vger.kernel.org
Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
[mpe: Tweak change log wording & add PAPR reference]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230525143454.56878-1-gbatra@linux.vnet.ibm.com
16 months agopowerpc/crypto: Fix aes-gcm-p10 link errors
Michael Ellerman [Thu, 25 May 2023 02:43:21 +0000 (12:43 +1000)]
powerpc/crypto: Fix aes-gcm-p10 link errors

The recently added P10 AES/GCM code added some files containing
CRYPTOGAMS perl-asm code which are near duplicates of the p8 files
found in drivers/crypto/vmx.

In particular the newly added files produce functions with identical
names to the existing code.

When the kernel is built with CONFIG_CRYPTO_AES_GCM_P10=y and
CONFIG_CRYPTO_DEV_VMX_ENCRYPT=y that leads to link errors, eg:

  ld: drivers/crypto/vmx/aesp8-ppc.o: in function `aes_p8_set_encrypt_key':
  (.text+0xa0): multiple definition of `aes_p8_set_encrypt_key'; arch/powerpc/crypto/aesp8-ppc.o:(.text+0xa0): first defined here
  ...
  ld: drivers/crypto/vmx/ghashp8-ppc.o: in function `gcm_ghash_p8':
  (.text+0x140): multiple definition of `gcm_ghash_p8'; arch/powerpc/crypto/ghashp8-ppc.o:(.text+0x2e4): first defined here

Fix it for now by renaming the newly added files and functions to use
"p10" instead of "p8" in the names.

Fixes: 45a4672b9a6e ("crypto: p10-aes-gcm - Update Kconfig and Makefile")
Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230525150501.37081-1-mpe@ellerman.id.au
16 months agotcp: Return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if user_mss set
Cambda Zhu [Sat, 27 May 2023 04:03:17 +0000 (12:03 +0800)]
tcp: Return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if user_mss set

This patch replaces the tp->mss_cache check in getting TCP_MAXSEG
with tp->rx_opt.user_mss check for CLOSE/LISTEN sock. Since
tp->mss_cache is initialized with TCP_MSS_DEFAULT, checking if
it's zero is probably a bug.

With this change, getting TCP_MAXSEG before connecting will return
default MSS normally, and return user_mss if user_mss is set.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Jack Yang <mingliang@linux.alibaba.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/CANn89i+3kL9pYtkxkwxwNMzvC_w3LNUum_2=3u+UyLBmGmifHA@mail.gmail.com/#t
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Link: https://lore.kernel.org/netdev/14D45862-36EA-4076-974C-EA67513C92F6@linux.alibaba.com/
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230527040317.68247-1-cambda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agotcp: deny tcp_disconnect() when threads are waiting
Eric Dumazet [Fri, 26 May 2023 16:34:58 +0000 (16:34 +0000)]
tcp: deny tcp_disconnect() when threads are waiting

Historically connect(AF_UNSPEC) has been abused by syzkaller
and other fuzzers to trigger various bugs.

A recent one triggers a divide-by-zero [1], and Paolo Abeni
was able to diagnose the issue.

tcp_recvmsg_locked() has tests about sk_state being not TCP_LISTEN
and TCP REPAIR mode being not used.

Then later if socket lock is released in sk_wait_data(),
another thread can call connect(AF_UNSPEC), then make this
socket a TCP listener.

When recvmsg() is resumed, it can eventually call tcp_cleanup_rbuf()
and attempt a divide by 0 in tcp_rcv_space_adjust() [1]

This patch adds a new socket field, counting number of threads
blocked in sk_wait_event() and inet_wait_for_connect().

If this counter is not zero, tcp_disconnect() returns an error.

This patch adds code in blocking socket system calls, thus should
not hurt performance of non blocking ones.

Note that we probably could revert commit 499350a5a6e7 ("tcp:
initialize rcv_mss to TCP_MIN_MSS instead of 0") to restore
original tcpi_rcv_mss meaning (was 0 if no payload was ever
received on a socket)

[1]
divide error: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 13832 Comm: syz-executor.5 Not tainted 6.3.0-rc4-syzkaller-00224-g00c7b5f4ddc5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
RIP: 0010:tcp_rcv_space_adjust+0x36e/0x9d0 net/ipv4/tcp_input.c:740
Code: 00 00 00 00 fc ff df 4c 89 64 24 48 8b 44 24 04 44 89 f9 41 81 c7 80 03 00 00 c1 e1 04 44 29 f0 48 63 c9 48 01 e9 48 0f af c1 <49> f7 f6 48 8d 04 41 48 89 44 24 40 48 8b 44 24 30 48 c1 e8 03 48
RSP: 0018:ffffc900033af660 EFLAGS: 00010206
RAX: 4a66b76cbade2c48 RBX: ffff888076640cc0 RCX: 00000000c334e4ac
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000001
RBP: 00000000c324e86c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880766417f8
R13: ffff888028fbb980 R14: 0000000000000000 R15: 0000000000010344
FS: 00007f5bffbfe700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32f25000 CR3: 000000007ced0000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
tcp_recvmsg_locked+0x100e/0x22e0 net/ipv4/tcp.c:2616
tcp_recvmsg+0x117/0x620 net/ipv4/tcp.c:2681
inet6_recvmsg+0x114/0x640 net/ipv6/af_inet6.c:670
sock_recvmsg_nosec net/socket.c:1017 [inline]
sock_recvmsg+0xe2/0x160 net/socket.c:1038
____sys_recvmsg+0x210/0x5a0 net/socket.c:2720
___sys_recvmsg+0xf2/0x180 net/socket.c:2762
do_recvmmsg+0x25e/0x6e0 net/socket.c:2856
__sys_recvmmsg net/socket.c:2935 [inline]
__do_sys_recvmmsg net/socket.c:2958 [inline]
__se_sys_recvmmsg net/socket.c:2951 [inline]
__x64_sys_recvmmsg+0x20f/0x260 net/socket.c:2951
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f5c0108c0f9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5bffbfe168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00007f5c011ac050 RCX: 00007f5c0108c0f9
RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000003
RBP: 00007f5c010e7b39 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5c012cfb1f R14: 00007f5bffbfe300 R15: 0000000000022000
</TASK>

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
Diagnosed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20230526163458.2880232-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoaf_packet: do not use READ_ONCE() in packet_bind()
Eric Dumazet [Fri, 26 May 2023 15:43:42 +0000 (15:43 +0000)]
af_packet: do not use READ_ONCE() in packet_bind()

A recent patch added READ_ONCE() in packet_bind() and packet_bind_spkt()

This is better handled by reading pkt_sk(sk)->num later
in packet_do_bind() while appropriate lock is held.

READ_ONCE() in writers are often an evidence of something being wrong.

Fixes: 822b5a1c17df ("af_packet: Fix data-races of pkt_sk(sk)->num.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230526154342.2533026-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonetlink: specs: correct types of legacy arrays
Jakub Kicinski [Fri, 26 May 2023 22:06:53 +0000 (15:06 -0700)]
netlink: specs: correct types of legacy arrays

ethtool has some attrs which dump multiple scalars into
an attribute. The spec currently expects one attr per entry.

Fixes: a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230526220653.65538-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agonet: usb: qmi_wwan: Set DTR quirk for BroadMobi BM818
Sebastian Krzyszkowiak [Fri, 26 May 2023 14:38:11 +0000 (16:38 +0200)]
net: usb: qmi_wwan: Set DTR quirk for BroadMobi BM818

BM818 is based on Qualcomm MDM9607 chipset.

Fixes: 9a07406b00cd ("net: usb: qmi_wwan: Add the BroadMobi BM818 card")
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Krzyszkowiak <sebastian.krzyszkowiak@puri.sm>
Acked-by: Bjørn Mork <bjorn@mork.no>
Link: https://lore.kernel.org/r/20230526-bm818-dtr-v1-1-64bbfa6ba8af@puri.sm
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
16 months agoata: libata-scsi: Use correct device no in ata_find_dev()
Damien Le Moal [Mon, 22 May 2023 11:09:57 +0000 (20:09 +0900)]
ata: libata-scsi: Use correct device no in ata_find_dev()

For devices not attached to a port multiplier and managed directly by
libata, the device number passed to ata_find_dev() must always be lower
than the maximum number of devices returned by ata_link_max_devices().
That is 1 for SATA devices or 2 for an IDE link with master+slave
devices. This device number is the SCSI device ID which matches these
constraints as the IDs are generated per port and so never exceed the
maximum number of devices for the link being used.

However, for libsas managed devices, SCSI device IDs are assigned per
struct scsi_host, leading to device IDs for SATA devices that can be
well in excess of libata per-link maximum number of devices. This
results in ata_find_dev() to always return NULL for libsas managed
devices except for the first device of the target scsi_host with ID
(device number) equal to 0. This issue is visible by executing the
hdparm utility, which fails. E.g.:

hdparm -i /dev/sdX
/dev/sdX:
  HDIO_GET_IDENTITY failed: No message of desired type

Fix this by rewriting ata_find_dev() to ignore the device number for
non-PMP attached devices with a link with at most 1 device, that is SATA
devices. For these, the device number 0 is always used to
return the correct pointer to the struct ata_device of the port link.
This change excludes IDE master/slave setups (maximum number of devices
per link is 2) and port-multiplier attached devices. Also, to be
consistant with the fact that SCSI device IDs and channel numbers used
as device numbers are both unsigned int, change the devno argument of
ata_find_dev() to unsigned int.

Reported-by: Xingui Yang <yangxingui@huawei.com>
Fixes: 41bda9c98035 ("libata-link: update hotplug to handle PMP links")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
16 months agoRDMA/irdma: Fix Local Invalidate fencing
Mustafa Ismail [Mon, 22 May 2023 15:56:54 +0000 (10:56 -0500)]
RDMA/irdma: Fix Local Invalidate fencing

If the local invalidate fence is indicated in the WR, only the read fence
is currently being set in WQE. Fix this to set both the read and local
fence in the WQE.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20230522155654.1309-4-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
16 months agoRDMA/irdma: Prevent QP use after free
Mustafa Ismail [Mon, 22 May 2023 15:56:53 +0000 (10:56 -0500)]
RDMA/irdma: Prevent QP use after free

There is a window where the poll cq may use a QP that has been freed.
This can happen if a CQE is polled before irdma_clean_cqes() can clear the
CQE's related to the QP and the destroy QP races to free the QP memory.
then the QP structures are used in irdma_poll_cq.  Fix this by moving the
clearing of CQE's before the reference is removed and the QP is destroyed.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20230522155654.1309-3-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
16 months agoMAINTAINERS: Update maintainer of Amazon EFA driver
Michael Margolin [Thu, 25 May 2023 09:44:44 +0000 (09:44 +0000)]
MAINTAINERS: Update maintainer of Amazon EFA driver

Change EFA driver maintainer from Gal Pressman to myself. Keep Gal as a
reviewer at his request.

Link: https://lore.kernel.org/r/20230525094444.12570-1-mrgolin@amazon.com
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Acked-by: Gal Pressman <gal.pressman@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
16 months agomm: page_table_check: Ensure user pages are not slab pages
Ruihan Li [Mon, 15 May 2023 13:09:58 +0000 (21:09 +0800)]
mm: page_table_check: Ensure user pages are not slab pages

The current uses of PageAnon in page table check functions can lead to
type confusion bugs between struct page and slab [1], if slab pages are
accidentally mapped into the user space. This is because slab reuses the
bits in struct page to store its internal states, which renders PageAnon
ineffective on slab pages.

Since slab pages are not expected to be mapped into the user space, this
patch adds BUG_ON(PageSlab(page)) checks to make sure that slab pages
are not inadvertently mapped. Otherwise, there must be some bugs in the
kernel.

Reported-by: syzbot+fcf1a817ceb50935ce99@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1]
Fixes: df4e817b7108 ("mm: page table check")
Cc: <stable@vger.kernel.org> # 5.17
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/r/20230515130958.32471-5-lrh2000@pku.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agomm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM
Ruihan Li [Mon, 15 May 2023 13:09:57 +0000 (21:09 +0800)]
mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM

Without EXCLUSIVE_SYSTEM_RAM, users are allowed to map arbitrary
physical memory regions into the userspace via /dev/mem. At the same
time, pages may change their properties (e.g., from anonymous pages to
named pages) while they are still being mapped in the userspace, leading
to "corruption" detected by the page table check.

To avoid these false positives, this patch makes PAGE_TABLE_CHECK
depends on EXCLUSIVE_SYSTEM_RAM. This dependency is understandable
because PAGE_TABLE_CHECK is a hardening technique but /dev/mem without
STRICT_DEVMEM (i.e., !EXCLUSIVE_SYSTEM_RAM) is itself a security
problem.

Even with EXCLUSIVE_SYSTEM_RAM, I/O pages may be still allowed to be
mapped via /dev/mem. However, these pages are always considered as named
pages, so they won't break the logic used in the page table check.

Cc: <stable@vger.kernel.org> # 5.17
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/r/20230515130958.32471-4-lrh2000@pku.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agousb: usbfs: Use consistent mmap functions
Ruihan Li [Mon, 15 May 2023 13:09:56 +0000 (21:09 +0800)]
usb: usbfs: Use consistent mmap functions

When hcd->localmem_pool is non-null, localmem_pool is used to allocate
DMA memory. In this case, the dma address will be properly returned (in
dma_handle), and dma_mmap_coherent should be used to map this memory
into the user space. However, the current implementation uses
pfn_remap_range, which is supposed to map normal pages.

Instead of repeating the logic in the memory allocation function, this
patch introduces a more robust solution. Here, the type of allocated
memory is checked by testing whether dma_handle is properly set. If
dma_handle is properly returned, it means some DMA pages are allocated
and dma_mmap_coherent should be used to map them. Otherwise, normal
pages are allocated and pfn_remap_range should be called. This ensures
that the correct mmap functions are used consistently, independently
with logic details that determine which type of memory gets allocated.

Fixes: a0e710a7def4 ("USB: usbfs: fix mmap dma mismatch")
Cc: stable@vger.kernel.org
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Link: https://lore.kernel.org/r/20230515130958.32471-3-lrh2000@pku.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agousb: usbfs: Enforce page requirements for mmap
Ruihan Li [Mon, 15 May 2023 13:09:55 +0000 (21:09 +0800)]
usb: usbfs: Enforce page requirements for mmap

The current implementation of usbdev_mmap uses usb_alloc_coherent to
allocate memory pages that will later be mapped into the user space.
Meanwhile, usb_alloc_coherent employs three different methods to
allocate memory, as outlined below:
 * If hcd->localmem_pool is non-null, it uses gen_pool_dma_alloc to
   allocate memory;
 * If DMA is not available, it uses kmalloc to allocate memory;
 * Otherwise, it uses dma_alloc_coherent.

However, it should be noted that gen_pool_dma_alloc does not guarantee
that the resulting memory will be page-aligned. Furthermore, trying to
map slab pages (i.e., memory allocated by kmalloc) into the user space
is not resonable and can lead to problems, such as a type confusion bug
when PAGE_TABLE_CHECK=y [1].

To address these issues, this patch introduces hcd_alloc_coherent_pages,
which addresses the above two problems. Specifically,
hcd_alloc_coherent_pages uses gen_pool_dma_alloc_align instead of
gen_pool_dma_alloc to ensure that the memory is page-aligned. To replace
kmalloc, hcd_alloc_coherent_pages directly allocates pages by calling
__get_free_pages.

Reported-by: syzbot+fcf1a817ceb50935ce99@syzkaller.appspotmail.comm
Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1]
Fixes: f7d34b445abc ("USB: Add support for usbfs zerocopy.")
Fixes: ff2437befd8f ("usb: host: Fix excessive alignment restriction for local memory allocations")
Cc: stable@vger.kernel.org
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://lore.kernel.org/r/20230515130958.32471-2-lrh2000@pku.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agodt-bindings: usb: snps,dwc3: Fix "snps,hsphy_interface" type
Marek Vasut [Mon, 15 May 2023 17:24:56 +0000 (19:24 +0200)]
dt-bindings: usb: snps,dwc3: Fix "snps,hsphy_interface" type

The "snps,hsphy_interface" is string, not u8. Fix the type.

Fixes: 389d77658801 ("dt-bindings: usb: Convert DWC USB3 bindings to DT schema")
Cc: stable <stable@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Marek Vasut <marex@denx.de>
Link: https://lore.kernel.org/r/20230515172456.179049-1-marex@denx.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agoblock: fix revalidate performance regression
Damien Le Moal [Mon, 29 May 2023 07:32:37 +0000 (16:32 +0900)]
block: fix revalidate performance regression

The scsi driver function sd_read_block_characteristics() always calls
disk_set_zoned() to a disk zoned model correctly, in case the device
model changed. This is done even for regular disks to set the zoned
model to BLK_ZONED_NONE and free any zone related resources if the drive
previously was zoned.

This behavior significantly impact the time it takes to revalidate disks
on a large system as the call to disk_clear_zone_settings() done from
disk_set_zoned() for the BLK_ZONED_NONE case results in the device
request queued to be frozen, even if there are no zone resources to
free.

Avoid this overhead for non-zoned devices by not calling
disk_clear_zone_settings() in disk_set_zoned() if the device model
was already set to BLK_ZONED_NONE, which is always the case for regular
devices.

Reported by: Brian Bunker <brian@purestorage.com>

Fixes: 508aebb80527 ("block: introduce blk_queue_clear_zone_settings()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230529073237.1339862-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
16 months agousb: gadget: udc: fix NULL dereference in remove()
Dan Carpenter [Thu, 25 May 2023 15:38:37 +0000 (18:38 +0300)]
usb: gadget: udc: fix NULL dereference in remove()

The "udc" pointer was never set in the probe() function so it will
lead to a NULL dereference in udc_pci_remove() when we do:

usb_del_gadget_udc(&udc->gadget);

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/ZG+A/dNpFWAlCChk@kili
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agousb: gadget: f_fs: Add unbind event before functionfs_unbind
Uttkarsh Aggarwal [Thu, 25 May 2023 09:28:54 +0000 (14:58 +0530)]
usb: gadget: f_fs: Add unbind event before functionfs_unbind

While exercising the unbind path, with the current implementation
the functionfs_unbind would be calling which waits for the ffs->mutex
to be available, however within the same time ffs_ep0_read is invoked
& if no setup packets are pending, it will invoke function
wait_event_interruptible_exclusive_locked_irq which by definition waits
for the ev.count to be increased inside the same mutex for which
functionfs_unbind is waiting.
This creates deadlock situation because the functionfs_unbind won't
get the lock until ev.count is increased which can only happen if
the caller ffs_func_unbind can proceed further.

Following is the illustration:

CPU1 CPU2

ffs_func_unbind() ffs_ep0_read()
mutex_lock(ffs->mutex)
wait_event(ffs->ev.count)
functionfs_unbind()
  mutex_lock(ffs->mutex)
  mutex_unlock(ffs->mutex)

ffs_event_add()

<deadlock>

Fix this by moving the event unbind before functionfs_unbind
to ensure the ev.count is incrased properly.

Fixes: 6a19da111057 ("usb: gadget: f_fs: Prevent race during ffs_ep0_queue_wait")
Cc: stable <stable@kernel.org>
Signed-off-by: Uttkarsh Aggarwal <quic_uaggarwa@quicinc.com>
Link: https://lore.kernel.org/r/20230525092854.7992-1-quic_uaggarwa@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agousb: cdns3: fix NCM gadget RX speed 20x slow than expection at iMX8QM
Frank Li [Thu, 18 May 2023 15:49:45 +0000 (11:49 -0400)]
usb: cdns3: fix NCM gadget RX speed 20x slow than expection at iMX8QM

At iMX8QM platform, enable NCM gadget and run 'iperf3 -s'.
At host, run 'iperf3 -V -c fe80::6863:98ff:feef:3e0%enxc6e147509498'

[  5]   0.00-1.00   sec  1.55 MBytes  13.0 Mbits/sec   90   4.18 KBytes
[  5]   1.00-2.00   sec  1.44 MBytes  12.0 Mbits/sec   75   4.18 KBytes
[  5]   2.00-3.00   sec  1.48 MBytes  12.4 Mbits/sec   75   4.18 KBytes

Expected speed should be bigger than 300Mbits/sec.

The root cause of this performance drop was found to be data corruption
happening at 4K borders in some Ethernet packets, leading to TCP
checksum errors. This corruption occurs from the position
(4K - (address & 0x7F)) to 4K. The u_ether function's allocation of
skb_buff reserves 64B, meaning all RX addresses resemble 0xXXXX0040.

Force trb_burst_size to 16 can fix this problem.

Cc: stable@vger.kernel.org
Fixes: 7733f6c32e36 ("usb: cdns3: Add Cadence USB3 DRD Driver")
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20230518154946.3666662-1-Frank.Li@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
16 months agoRISC-V: mark hibernation as nonportable
Conor Dooley [Fri, 26 May 2023 10:59:08 +0000 (11:59 +0100)]
RISC-V: mark hibernation as nonportable

Hibernation support depends on firmware marking its reserved/PMP
protected regions as not accessible from Linux.
The latest versions of the de-facto SBI implementation (OpenSBI) do
not do this, having dropped the no-map property to enable 1 GiB huge
page mappings by the kernel.
This was exposed by commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages
for the linear mapping"), which made the first 2 MiB of DRAM (where SBI
typically resides) accessible by the kernel.
Attempting to hibernate with either OpenSBI, or other implementations
following its lead, will lead to a kernel panic ([1], [2]) as the
hibernation process will attempt to save/restore any mapped regions,
including the PMP protected regions in use by the SBI implementation.

Mark hibernation as depending on "NONPORTABLE", as only a small subset
of systems are capable of supporting it, until such time that an SBI
implementation independent way to communicate what regions are in use
has been agreed on.

As hibernation support landed in v6.4-rc1, disabling it for most
platforms does not constitute a regression. The alternative would have
been reverting commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for
the linear mapping").
Doing so would permit hibernation on platforms with these SBI
implementations, but would limit the options we have to solve the
protection of the region without causing a regression in hibernation
support.

Reported-by: Song Shuai <suagrfillet@gmail.com>
Link: https://lore.kernel.org/all/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/
Reported-by: JeeHeng Sia <jeeheng.sia@starfivetech.com>
Link: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230526-astride-detonator-9ae120051159@wendy
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
16 months agoMerge tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Mon, 29 May 2023 11:20:13 +0000 (07:20 -0400)]
Merge tag 'trace-v6.4-rc3' of git://git./linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "User events:

   - Use long instead of int for storing the enable set/clear bit, as it
     was found that big endian machines could end up using the wrong
     bits.

   - Split allocating mm and attaching it. This keeps the allocation
     separate from the registration and avoids various races.

   - Remove RCU locking around pin_user_pages_remote() as that can
     schedule. The RCU protection is no longer needed with the above
     split of mm allocation and attaching.

   - Rename the "link" fields of the various structs to something more
     meaningful.

   - Add comments around user_event_mm struct usage and locking
     requirements.

  Timerlat tracer:

   - Fix missed wakeup of timerlat thread caused by the timerlat
     interrupt triggering when tracing is off. The timer interrupt
     handler needs to always wake up the timerlat thread regardless if
     tracing is enabled or not, otherwise, it will never wake up.

  Histograms:

   - Fix regression of breaking the "stacktrace" modifier for variables.
     That modifier cannot be used for values, but can be used for
     variables that are passed from one histogram to the next. This was
     broken when adding the restriction to values as the variable logic
     used the same code.

   - Rename the special field "stacktrace" to "common_stacktrace".

     Special fields (that are not actually part of the event, but can
     act just like event fields, like 'comm' and 'timestamp') should be
     prefixed with 'common_' for consistency. To keep backward
     compatibility, 'stacktrace' can still be used (as with the special
     field 'cpu'), but can be overridden if the event has a field called
     'stacktrace'.

   - Update the synthetic event selftests to use the new name (synthetic
     events are created by histograms)

  Tracing bootup selftests:

   - Reorganize the code to keep artifacts of the selftests not compiled
     in when selftests are not configured.

   - Add various cond_resched() around the selftest code, as the
     softlock watchdog was triggering much more often. It appears that
     the kernel runs slower now with full debugging enabled.

   - While debugging ftrace with ftrace (using an instance ring buffer
     instead of the top level one), I found that the selftests were
     disabling prints to the debug instance.

     This should not happen, as the selftests only disable printing to
     the main buffer as the selftests examine the main buffer to see if
     it has what it expects, and prints can make the tests fail.

     Make the selftests only disable printing to the toplevel buffer,
     and leave the instance buffers alone"

* tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Have function_graph selftest call cond_resched()
  tracing: Only make selftest conditionals affect the global_trace
  tracing: Make tracing_selftest_running/delete nops when not used
  tracing: Have tracer selftests call cond_resched() before running
  tracing: Move setting of tracing_selftest_running out of register_tracer()
  tracing/selftests: Update synthetic event selftest to use common_stacktrace
  tracing: Rename stacktrace field to common_stacktrace
  tracing/histograms: Allow variables to have some modifiers
  tracing/user_events: Document user_event_mm one-shot list usage
  tracing/user_events: Rename link fields for clarity
  tracing/user_events: Remove RCU lock while pinning pages
  tracing/user_events: Split up mm alloc and attach
  tracing/timerlat: Always wakeup the timerlat thread
  tracing/user_events: Use long vs int for atomic bit ops

16 months agoMerge tag 'v6.4-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Mon, 29 May 2023 11:05:49 +0000 (07:05 -0400)]
Merge tag 'v6.4-p3' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
 "Fix an alignment crash in x86/aria"

* tag 'v6.4-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors

16 months agoRevert "module: error out early on concurrent load of the same module file"
Linus Torvalds [Mon, 29 May 2023 10:40:33 +0000 (06:40 -0400)]
Revert "module: error out early on concurrent load of the same module file"

This reverts commit 9828ed3f695a138f7add89fa2a186ababceb8006.

Sadly, it does seem to cause failures to load modules. Johan Hovold reports:

 "This change breaks module loading during boot on the Lenovo Thinkpad
  X13s (aarch64).

  Specifically it results in indefinite probe deferral of the display
  and USB (ethernet) which makes it a pain to debug. Typing in the dark
  to acquire some logs reveals that other modules are missing as well"

Since this was applied late as a "let's try this", I'm reverting it
asap, and we can try to figure out what goes wrong later.  The excessive
parallel module loading problem is annoying, but not noticeable in
normal situations, and this was only meant as an optimistic workaround
for a user-space bug.

One possible solution may be to do the optimistic exclusive open first,
and then use a lock to serialize loading if that fails.

Reported-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/lkml/ZHRpH-JXAxA6DnzR@hovoldconsulting.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 months agotracing: Have function_graph selftest call cond_resched()
Steven Rostedt (Google) [Sun, 28 May 2023 05:17:42 +0000 (01:17 -0400)]
tracing: Have function_graph selftest call cond_resched()

When all kernel debugging is enabled (lockdep, KSAN, etc), the function
graph enabling and disabling can take several seconds to complete. The
function_graph selftest enables and disables function graph tracing
several times. With full debugging enabled, the soft lockup watchdog was
triggering because the selftest was running without ever scheduling.

Add cond_resched() throughout the test to make sure it does not trigger
the soft lockup detector.

Link: https://lkml.kernel.org/r/20230528051742.1325503-6-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
16 months agotracing: Only make selftest conditionals affect the global_trace
Steven Rostedt (Google) [Sun, 28 May 2023 05:17:41 +0000 (01:17 -0400)]
tracing: Only make selftest conditionals affect the global_trace

The tracing_selftest_running and tracing_selftest_disabled variables were
to keep trace_printk() and other writes from affecting the tracing
selftests, as the tracing selftests would examine the ring buffer to see
if it contained what it expected or not. trace_printk() and friends could
add to the ring buffer and cause the selftests to fail (and then disable
the tracer that was being tested). To keep that from happening, these
variables were added and would keep trace_printk() and friends from
writing to the ring buffer while the tests were going on.

But this was only the top level ring buffer (owned by the global_trace
instance). There is no reason to prevent writing into ring buffers of
other instances via the trace_array_printk() and friends. For the
functions that could be used by other instances, check if the global_trace
is the tracer instance that is being written to before deciding to not
allow the write.

Link: https://lkml.kernel.org/r/20230528051742.1325503-5-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
16 months agotracing: Make tracing_selftest_running/delete nops when not used
Steven Rostedt (Google) [Sun, 28 May 2023 05:17:40 +0000 (01:17 -0400)]
tracing: Make tracing_selftest_running/delete nops when not used

There's no reason to test the condition variables tracing_selftest_running
or tracing_selftest_delete when tracing selftests are not enabled. Make
them define 0s when not the selftests are not configured in.

Link: https://lkml.kernel.org/r/20230528051742.1325503-4-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
16 months agotracing: Have tracer selftests call cond_resched() before running
Steven Rostedt (Google) [Sun, 28 May 2023 05:17:39 +0000 (01:17 -0400)]
tracing: Have tracer selftests call cond_resched() before running

As there are more and more internal selftests being added to the Linux
kernel (KSAN, lockdep, etc) the selftests are taking longer to run when
these are enabled. Add a cond_resched() to the calling of
do_run_tracer_selftest() to force a schedule if NEED_RESCHED is set,
otherwise the soft lockup watchdog may trigger on boot up.

Link: https://lkml.kernel.org/r/20230528051742.1325503-3-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
16 months agotracing: Move setting of tracing_selftest_running out of register_tracer()
Steven Rostedt (Google) [Sun, 28 May 2023 05:17:38 +0000 (01:17 -0400)]
tracing: Move setting of tracing_selftest_running out of register_tracer()

The variables tracing_selftest_running and tracing_selftest_disabled are
only used for when CONFIG_FTRACE_STARTUP_TEST is enabled. Make them only
visible within the selftest code. The setting of those variables are in
the register_tracer() call, and set in a location where they do not need
to be. Create a wrapper around run_tracer_selftest() called
do_run_tracer_selftest() which sets those variables, and have
register_tracer() call that instead.

Having those variables only set within the CONFIG_FTRACE_STARTUP_TEST
scope gets rid of them (and also the ability to remove testing against
them) when the startup tests are not enabled (most cases).

Link: https://lkml.kernel.org/r/20230528051742.1325503-2-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
16 months agoMerge tag 'phy-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Linus Torvalds [Mon, 29 May 2023 00:05:07 +0000 (20:05 -0400)]
Merge tag 'phy-fixes-6.4' of git://git./linux/kernel/git/phy/linux-phy

Pull phy fixes from Vinod Koul:

 - init count imbalance fix in qcom-qmp-pcie and combo drivers

 - kernel doc header fix for qcom-snps driver

 - mediatek floating point comparison fix

 - amlogic fix register value

* tag 'phy-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
  phy: qcom-snps: correct struct qcom_snps_hsphy kerneldoc
  phy: amlogic: phy-meson-g12a-mipi-dphy-analog: fix CNTL2_DIF_TX_CTL0 value
  phy: mediatek: rework the floating point comparisons to fixed point
  phy: qcom-qmp-pcie-msm8996: fix init-count imbalance
  phy: qcom-qmp-combo: fix init-count imbalance

16 months agoMerge tag 'dmaengine-fix-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul...
Linus Torvalds [Sun, 28 May 2023 23:59:08 +0000 (19:59 -0400)]
Merge tag 'dmaengine-fix-6.4' of git://git./linux/kernel/git/vkoul/dmaengine

Pull dmaengine fixes from Vinod Koul:
 "Driver fixes for the at-hdmac, pl330, TI and IDXD drivers:

   - AT HDMAC driver fixes for Flow Controller bitfield, peripheral ID
     handling and potential NULL dereference check

   - PL330 function rename to avoid conflicts

   - build warning fix for pm function in TI driver

   - IDXD driver fix for passing freed memory"

* tag 'dmaengine-fix-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
  dmaengine: at_hdmac: Extend the Flow Controller bitfield to three bits
  dmaengine: at_hdmac: Repair bitfield macros for peripheral ID handling
  dmaengine: pl330: rename _start to prevent build error
  dmaengine: at_xdmac: fix potential Oops in at_xdmac_prep_interleaved()
  dmaengine: ti: k3-udma: annotate pm function with __maybe_unused
  dmaengine: idxd: Fix passing freed memory in idxd_cdev_open()

16 months agoefi: Bump stub image version for macOS HVF compatibility
Akihiro Suda [Sun, 28 May 2023 17:36:02 +0000 (19:36 +0200)]
efi: Bump stub image version for macOS HVF compatibility

The macOS hypervisor framework includes a host-side VMM called
VZLinuxBootLoader [1] which implements native support for booting the
Linux kernel inside a guest directly (instead of, e.g., via GRUB
installed inside the guest). On x86, it incorporates a BIOS style loader
that does not implement or expose EFI to the loaded kernel. However,
this loader appears to fail when the 'image minor version' field in the
kernel image's PE/COFF header (which is generally only used by EFI based
bootloaders) is set to any value other than 0x0. [2]

Commit e346bebbd36b1576 ("efi: libstub: Always enable initrd command
line loader and bump version") incremented the EFI stub image minor
version to convey that all EFI stub kernels now implement support for
the initrd= command line option, and do so in a way where it can load
initrd images from any filesystem known to the EFI firmware (as opposed
to prior implementations that could only load initrds from the same
volume that the kernel image was loaded from).

Unfortunately, bumping the version to v1.1 triggers this issue in
VZLinuxBootLoader, breaking the boot on x86. So let's keep the image
minor version at 0x0, and bump the image major version instead.

While at it, convert this field to a bit field, so that individual
features are discoverable from it, as suggested by Linus. So let's bump
the major version to v3, and document the initrd= command line loading
feature as being represented by bit 1 in the mask.

Note that, due to the prior interpretation as a monotonically increasing
version field, loaders are still permitted to assume that the LoadFile2
initrd loading feature is supported for any major version value >= 1,
even if bit 0 is not set.

[1] https://developer.apple.com/documentation/virtualization/vzlinuxbootloader
[2] https://lore.kernel.org/linux-efi/CAG8fp8Teu4G9JuenQrqGndFt2Gy+V4YgJ=hN1xX7AD940YKf3A@mail.gmail.com/

Fixes: e346bebbd36b1576 ("efi: libstub: Always enable initrd command ...")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217485
Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
[ardb: rewrite comment and commit log]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
16 months agoext4: add EA_INODE checking to ext4_iget()
Theodore Ts'o [Wed, 24 May 2023 03:49:48 +0000 (23:49 -0400)]
ext4: add EA_INODE checking to ext4_iget()

Add a new flag, EXT4_IGET_EA_INODE which indicates whether the inode
is expected to have the EA_INODE flag or not.  If the flag is not
set/clear as expected, then fail the iget() operation and mark the
file system as corrupted.

This commit also makes the ext4_iget() always perform the
is_bad_inode() check even when the inode is already inode cache.  This
allows us to remove the is_bad_inode() check from the callers of
ext4_iget() in the ea_inode code.

Reported-by: syzbot+cbb68193bdb95af4340a@syzkaller.appspotmail.com
Reported-by: syzbot+62120febbd1ee3c3c860@syzkaller.appspotmail.com
Reported-by: syzbot+edce54daffee36421b4c@syzkaller.appspotmail.com
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
16 months agoLinux 6.4-rc4
Linus Torvalds [Sun, 28 May 2023 11:49:00 +0000 (07:49 -0400)]
Linux 6.4-rc4

16 months agoMerge tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 28 May 2023 11:42:05 +0000 (07:42 -0400)]
Merge tag 'x86-urgent-2023-05-28' of git://git./linux/kernel/git/tip/tip

Pull x86 cpu fix from Thomas Gleixner:
 "A single fix for x86:

   - Prevent a bogus setting for the number of HT siblings, which is
     caused by the CPUID evaluation trainwreck of X86. That recomputes
     the value for each CPU, so the last CPU "wins". That can cause
     completely bogus sibling values"

* tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms

16 months agoMerge tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 28 May 2023 11:37:23 +0000 (07:37 -0400)]
Merge tag 'perf-urgent-2023-05-28' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A small set of perf fixes:

   - Make the MSR-readout based CHA discovery work around broken
     discovery tables in some SPR firmwares.

   - Prevent saving PEBS configuration which has software bits set that
     cause a crash when restored into the relevant MSR"

* tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Correct the number of CHAs on SPR
  perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS

16 months agoMerge tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 May 2023 11:33:29 +0000 (07:33 -0400)]
Merge tag 'objtool-urgent-2023-05-28' of git://git./linux/kernel/git/tip/tip

Pull unwinder fixes from Thomas Gleixner:
 "A set of unwinder and tooling fixes:

   - Ensure that the stack pointer on x86 is aligned again so that the
     unwinder does not read past the end of the stack

   - Discard .note.gnu.property section which has a pointlessly
     different alignment than the other note sections. That confuses
     tooling of all sorts including readelf, libbpf and pahole"

* tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/show_trace_log_lvl: Ensure stack pointer is aligned, again
  vmlinux.lds.h: Discard .note.gnu.property section

16 months agoMerge tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 May 2023 11:15:33 +0000 (07:15 -0400)]
Merge tag 'core-debugobjects-2023-05-28' of git://git./linux/kernel/git/tip/tip

Pull debugobjects fixes from Thomas Gleixner:
 "Two fixes for debugobjects:

   - Prevent the allocation path from waking up kswapd.

     That's a long standing issue due to the GFP_ATOMIC allocation flag.
     As debug objects can be invoked from pretty much any context waking
     kswapd can end up in arbitrary lock chains versus the waitqueue
     lock

   - Correct the explicit lockdep wait-type violation in
     debug_object_fill_pool()"

* tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobjects: Don't wake up kswapd from fill_pool()
  debugobjects,locking: Annotate debug_object_fill_pool() wait type violation