lib: sbi: Set the state of a hart to START_PENDING after the hart is ready
When a boot hart executes sbi_hsm_hart_start() to start a secondary hart,
next_arg1, next_addr and next_mode for the latter are stored in the scratch
area after the state has been set to SBI_HSM_STATE_START_PENDING.
The secondary hart waits in the loop with wfi() in sbi_hsm_hart_wait() at
that time. However, "wfi" instruction is not guaranteed to wait for an
interrupt to be received by the hart, it is just a hint for the CPU.
According to RISC-V Privileged Architectures spec. v20211203, even an
implementation of "wfi" as "nop" is legal.
So, the secondary might leave the loop in sbi_hsm_hart_wait() as soon as
its state has been set to SBI_HSM_STATE_START_PENDING, even if it got no
IPI or it got an IPI unrelated to sbi_hsm_hart_start(). This could lead to
the following race condition when booting Linux, for example:
Boot hart (#0) Secondary hart (#1)
runs Linux startup code waits in sbi_hsm_hart_wait()
sbi_ecall(SBI_EXT_HSM,
SBI_EXT_HSM_HART_START,
...)
enters sbi_hsm_hart_start()
sets state of hart #1 to START_PENDING
leaves sbi_hsm_hart_wait()
runs to the end of init_warmboot()
returns to scratch->next_addr
(next_addr can be garbage here)
sets next_addr, etc. for hart #1
(no good: hart #1 has already left)
sends IPI to hart #1
(no good either)
If this happens, the secondary hart jumps to a wrong next_addr at the end
of init_warmboot(), which leads to a system hang or crash.
To reproduce the issue more reliably, one could add a delay in
sbi_hsm_hart_start() after setting the hart's state but before sending
IPI to that hart:
hstate = atomic_cmpxchg(&hdata->state, SBI_HSM_STATE_STOPPED,
SBI_HSM_STATE_START_PENDING);
...
+ sbi_timer_mdelay(10);
init_count = sbi_init_count(hartid);
rscratch->next_arg1 = arg1;
rscratch->next_addr = saddr;
The issue can be reproduced, for example, in a QEMU VM with '-machine virt'
and 2 or more CPUs, with Linux as the guest OS.
This patch moves writing of next_arg1, next_addr and next_mode for the
secondary hart before setting its state to SBI_HSM_STATE_START_PENDING.
In theory, it is possible that two or more harts enter sbi_hsm_hart_start()
for the same target hart simultaneously. To make sure the current hart has
exclusive access to the scratch area of the target hart at that point, a
per-hart 'start_ticket' is used. It is initially 0. The current hart tries
to acquire the ticket first (set it to 1) at the beginning of
sbi_hsm_hart_start() and only proceeds if it has successfully acquired it.
The target hart reads next_addr, etc., and then the releases the ticket
(sets it to 0) before calling sbi_hart_switch_mode(). This way, even if
some other hart manages to enter sbi_hsm_hart_start() after the ticket has
been released but before the target hart jumps to next_addr, it will not
cause problems.
atomic_cmpxchg() already has "acquire" semantics, among other things, so
no additional barriers are needed in hsm_start_ticket_acquire(). No hart
can perform or observe the update of *rscratch before setting of
'start_ticket' to 1.
atomic_write() only imposes ordering of writes, so an explicit barrier is
needed in hsm_start_ticket_release() to ensure its "release" semantics.
This guarantees that reads of scratch->next_addr, etc., in
sbi_hsm_hart_start_finish() cannot happen after 'start_ticket' has been
released.
Signed-off-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>