The kernel occasionally crashes in cpumask_clear_cpu(), which is called
within exit_round_robin(), because when executing clear_bit(nr, addr) with
nr set to 0xffffffff, the address calculation may cause misalignment within
the memory, leading to access to an invalid memory address.
----------
BUG: unable to handle kernel paging request at
ffffffffe0740618
...
CPU: 3 PID:
2919323 Comm: acpi_pad/14 Kdump: loaded Tainted: G OE X --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1
...
RIP: 0010:power_saving_thread+0x313/0x411 [acpi_pad]
Code: 89 cd 48 89 d3 eb d1 48 c7 c7 55 70 72 c0 e8 64 86 b0 e4 c6 05 0d a1 02 00 01 e9 bc fd ff ff 45 89 e4 42 8b 04 a5 20 82 72 c0 <f0> 48 0f b3 05 f4 9c 01 00 42 c7 04 a5 20 82 72 c0 ff ff ff ff 31
RSP: 0018:
ff72a5d51fa77ec8 EFLAGS:
00010202
RAX:
00000000ffffffff RBX:
ff462981e5d8cb80 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
0000000000000246 RDI:
0000000000000246
RBP:
ff46297556959d80 R08:
0000000000000382 R09:
ff46297c8d0f38d8
R10:
0000000000000000 R11:
0000000000000001 R12:
000000000000000e
R13:
0000000000000000 R14:
ffffffffffffffff R15:
000000000000000e
FS:
0000000000000000(0000) GS:
ff46297a800c0000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffffffffe0740618 CR3:
0000007e20410004 CR4:
0000000000771ee0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
PKRU:
55555554
Call Trace:
? acpi_pad_add+0x120/0x120 [acpi_pad]
kthread+0x10b/0x130
? set_kthread_struct+0x50/0x50
ret_from_fork+0x1f/0x40
...
CR2:
ffffffffe0740618
crash> dis -lr
ffffffffc0726923
...
/usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./include/linux/cpumask.h: 114
0xffffffffc0726918 <power_saving_thread+776>: mov %r12d,%r12d
/usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./include/linux/cpumask.h: 325
0xffffffffc072691b <power_saving_thread+779>: mov -0x3f8d7de0(,%r12,4),%eax
/usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./arch/x86/include/asm/bitops.h: 80
0xffffffffc0726923 <power_saving_thread+787>: lock btr %rax,0x19cf4(%rip) # 0xffffffffc0740620 <pad_busy_cpus_bits>
crash> px tsk_in_cpu[14]
$66 = 0xffffffff
crash> px 0xffffffffc072692c+0x19cf4
$99 = 0xffffffffc0740620
crash> sym 0xffffffffc0740620
ffffffffc0740620 (b) pad_busy_cpus_bits [acpi_pad]
crash> px pad_busy_cpus_bits[0]
$42 = 0xfffc0
----------
To fix this, ensure that tsk_in_cpu[tsk_index] != -1 before calling
cpumask_clear_cpu() in exit_round_robin(), just as it is done in
round_robin_cpu().
Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
Link: https://patch.msgid.link/20240825141352.25280-1-snishika@redhat.com
[ rjw: Subject edit, avoid updates to the same value ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>