timers/nohz: Add a comment about broken iowait counter update race
The per-cpu iowait task counter is incremented locally upon sleeping.
But since the task can be woken to (and by) another CPU, the counter may
then be decremented remotely. This is the source of a race involving
readers VS writer of idle/iowait sleeptime.
The following scenario shows an example where a /proc/stat reader
observes a pending sleep time as IO whereas that pending sleep time
later eventually gets accounted as non-IO.
CPU 0 CPU 1 CPU 2
----- ----- ------
//io_schedule() TASK A
current->in_iowait = 1
rq(0)->nr_iowait++
//switch to idle
// READ /proc/stat
// See nr_iowait_cpu(0) == 1
return ts->iowait_sleeptime +
ktime_sub(ktime_get(), ts->idle_entrytime)
//try_to_wake_up(TASK A)
rq(0)->nr_iowait--
//idle exit
// See nr_iowait_cpu(0) == 0
ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)
As a result subsequent reads on /proc/stat may expose backward progress.
This is unfortunately hardly fixable. Just add a comment about that
condition.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-5-frederic@kernel.org