review.tizen.org Git - platform/kernel/linux-rpi.git/log

Merge branch 'lkmm.2020.11.06a' into HEAD

lkmm.2020.11.06a: Linux-kernel memory model (LKMM) updates.

Merge branch 'kcsan.2020.11.06a' into HEAD

kcsan.2020.11.06a: Kernel concurrency sanitizer (KCSAN) updates.

Merge branches 'cpuinfo.2020.11.06a', 'doc.2020.11.06a', 'fixes.2020.11.19b', 'lockdep.2020.11.02a', 'tasks.2020.11.06a' and 'torture.2020.11.06a' into HEAD

cpuinfo.2020.11.06a: Speedups for /proc/cpuinfo.
doc.2020.11.06a: Documentation updates.
fixes.2020.11.19b: Miscellaneous fixes.
lockdep.2020.11.02a: Lockdep-RCU updates to avoid "unused variable".
tasks.2020.11.06a: Tasks-RCU updates.
torture.2020.11.06a': Torture-test updates.

srcu: Take early exit on memory-allocation failure

It turns out that init_srcu_struct() can be invoked from usermode tasks,
and that fatal signals received by these tasks can cause memory-allocation
failures. These failures are not handled well by init_srcu_struct(),
so much so that NULL pointer dereferences can result. This commit
therefore causes init_srcu_struct() to take an early exit upon detection
of memory-allocation failure.

Link: https://lore.kernel.org/lkml/20200908144306.33355-1-aik@ozlabs.ru/
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu/tree: Defer kvfree_rcu() allocation to a clean context

The current memmory-allocation interface causes the following difficulties
for kvfree_rcu():

a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
   complain about violation of the nesting rules, as in "BUG: Invalid
   wait context".  This Kconfig option checks for proper raw_spinlock
   vs. spinlock nesting, in particular, it is not legal to acquire a
   spinlock_t while holding a raw_spinlock_t.

   This is a problem because kfree_rcu() uses raw_spinlock_t whereas the
   "page allocator" internally deals with spinlock_t to access to its
   zones. The code also can be broken from higher level of view:
   <snip>
       raw_spin_lock(&some_lock);
       kfree_rcu(some_pointer, some_field_offset);
   <snip>

b) If built with CONFIG_PREEMPT_RT, spinlock_t is converted into
   sleeplock.  This means that invoking the page allocator from atomic
   contexts results in "BUG: scheduling while atomic".

c) Please note that call_rcu() is already invoked from raw atomic context,
   so it is only reasonable to expaect that kfree_rcu() and kvfree_rcu()
   will also be called from atomic raw context.

This commit therefore defers page allocation to a clean context using the
combination of an hrtimer and a workqueue.  The hrtimer stage is required
in order to avoid deadlocks with the scheduler.  This deferred allocation
is required only when kvfree_rcu()'s per-CPU page cache is empty.

Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Do not report strict GPs for outgoing CPUs

An outgoing CPU is marked offline in a stop-machine handler and most
of that CPU's services stop at that point, including IRQ work queues.
However, that CPU must take another pass through the scheduler and through
a number of CPU-hotplug notifiers, many of which contain RCU readers.
In the past, these readers were not a problem because the outgoing CPU
has interrupts disabled, so that rcu_read_unlock_special() would not
be invoked, and thus RCU would never attempt to queue IRQ work on the
outgoing CPU.

This changed with the advent of the CONFIG_RCU_STRICT_GRACE_PERIOD
Kconfig option, in which rcu_read_unlock_special() is invoked upon exit
from almost all RCU read-side critical sections.  Worse yet, because
interrupts are disabled, rcu_read_unlock_special() cannot immediately
report a quiescent state and will therefore attempt to defer this
reporting, for example, by queueing IRQ work.  Which fails with a splat
because the CPU is already marked as being offline.

But it turns out that there is no need to report this quiescent state
because rcu_report_dead() will do this job shortly after the outgoing
CPU makes its final dive into the idle loop.  This commit therefore
makes rcu_read_unlock_special() refrain from queuing IRQ work onto
outgoing CPUs.

Fixes: 44bad5b3cca2 ("rcu: Do full report for .need_qs for strict GPs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jann Horn <jannh@google.com>

rcu: Fix a typo in rcu_blocking_is_gp() header comment

This commit fixes a typo in the rcu_blocking_is_gp() function's header
comment.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Prevent lockdep-RCU splats on lock acquisition/release

The rcu_cpu_starting() and rcu_report_dead() functions transition the
current CPU between online and offline state from an RCU perspective.
Unfortunately, this means that the rcu_cpu_starting() function's lock
acquisition and the rcu_report_dead() function's lock releases happen
while the CPU is offline from an RCU perspective, which can result
in lockdep-RCU splats about using RCU from an offline CPU.  And this
situation can also result in too-short grace periods, especially in
guest OSes that are subject to vCPU preemption.

This commit therefore uses sequence-count-like synchronization to forgive
use of RCU while RCU thinks a CPU is offline across the full extent of
the rcu_cpu_starting() and rcu_report_dead() function's lock acquisitions
and releases.

One approach would have been to use the actual sequence-count primitives
provided by the Linux kernel.  Unfortunately, the resulting code looks
completely broken and wrong, and is likely to result in patches that
break RCU in an attempt to address this appearance of broken wrongness.
Plus there is no net savings in lines of code, given the additional
explicit memory barriers required.

Therefore, this sequence count is instead implemented by a new ->ofl_seq
field in the rcu_node structure.  If this counter's value is an odd
number, RCU forgives RCU read-side critical sections on other CPUs covered
by the same rcu_node structure, even if those CPUs are offline from
an RCU perspective.  In addition, if a given leaf rcu_node structure's
->ofl_seq counter value is an odd number, rcu_gp_init() delays starting
the grace period until that counter value changes.

[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs

Testing showed that rcu_pending() can return 1 when offloaded callbacks
are ready to execute. This invokes RCU core processing, for example,
by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
However, rcu_core() explicitly avoids in any way manipulating offloaded
callbacks, which are instead handled by the rcuog and rcuoc kthreads,
which work independently of rcu_core().

One exception to this independence is that rcu_core() invokes
do_nocb_deferred_wakeup(), however, rcu_pending() also checks
rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
invoking rcu_core() when needed.

This commit therefore avoids needlessly invoking RCU core processing
by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
This reduces overhead, for example, by reducing softirq activity.

This change passed 30 minute tests of TREE01 through TREE09 each.

On TREE08, there is at most 150us from the time that rcu_pending() chose
not to invoke RCU core processing to the time when the ready callbacks
were invoked by the rcuoc kthread. This provides further evidence that
there is no need to invoke rcu_core() for offloaded callbacks that are
ready to invoke.

Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu,ftrace: Fix ftrace recursion

Kim reported that perf-ftrace made his box unhappy. It turns out that
commit:

ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr")

removed one too many notrace qualifiers, probably due to there not being
a helpful comment.

This commit therefore reinstates the notrace and adds a comment to avoid
losing it again.

[ paulmck: Apply Steven Rostedt's feedback on the comment. ]
Fixes: ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr")
Reported-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu/tree: Make struct kernel_param_ops definitions const

These should be const, so make it so.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu/tree: Add a warning if CPU being onlined did not report QS already

Currently, rcu_cpu_starting() checks to see if the RCU core expects a
quiescent state from the incoming CPU.  However, the current interaction
between RCU quiescent-state reporting and CPU-hotplug operations should
mean that the incoming CPU never needs to report a quiescent state.
First, the outgoing CPU reports a quiescent state if needed.  Second,
the race where the CPU is leaving just as RCU is initializing a new
grace period is handled by an explicit check for this condition.  Third,
the CPU's leaf rcu_node structure's ->lock serializes these checks.

This means that if rcu_cpu_starting() ever feels the need to report
a quiescent state, then there is a bug somewhere in the CPU hotplug
code or the RCU grace-period handling code.  This commit therefore
adds a WARN_ON_ONCE() to bring that bug to everyone's attention.

Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config

This commit clarifies that the "p" and the "s" in the in the RCU_NOCB_CPU
config-option description refer to the "x" in the "rcuox/N" kthread name.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
[ paulmck: While in the area, update description and advice. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Fix single-CPU check in rcu_blocking_is_gp()

Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses
num_online_cpus() to determine whether there is only one CPU online.  When
there is only a single CPU online, the simple fact that synchronize_rcu()
could be legally called implies that a full grace period has elapsed.
Therefore, in the single-CPU case, synchronize_rcu() simply returns
immediately.  Unfortunately, num_online_cpus() is unreliable while a
CPU-hotplug operation is transitioning to or from single-CPU operation
because:

1. num_online_cpus() uses atomic_read(&__num_online_cpus) to
locklessly sample the number of online CPUs.  The hotplug locks
are not held, which means that an incoming CPU can concurrently
update this count.  This in turn means that an RCU read-side
critical section on the incoming CPU might observe updates
prior to the grace period, but also that this critical section
might extend beyond the end of the optimized synchronize_rcu().
This breaks RCU's fundamental guarantee.

2. In addition, num_online_cpus() does no ordering, thus providing
another way that RCU's fundamental guarantee can be broken by
the current code.

3. The most probable failure mode happens on outgoing CPUs.
The outgoing CPU updates the count of online CPUs in the
CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
and of itself due to preemption being disabled at the call
to num_online_cpus().  Unfortunately, after that stop-machine
handler returns, the CPU takes one last trip through the
scheduler (which has RCU readers) and, after the resulting
context switch, one final dive into the idle loop.  During this
time, RCU needs to keep track of two CPUs, but num_online_cpus()
will say that there is only one, which in turn means that the
surviving CPU will incorrectly ignore the outgoing CPU's RCU
read-side critical sections.

This problem is illustrated by the following litmus test in which P0()
corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
The herd7 tool confirms that the "exists" clause can be satisfied,
thus demonstrating that this breakage can happen according to the Linux
kernel memory model.

   {
     int x = 0;
     atomic_t numonline = ATOMIC_INIT(1);
   }

   P0(int *x, atomic_t *numonline)
   {
     int r0;
     WRITE_ONCE(*x, 1);
     r0 = atomic_read(numonline);
     if (r0 == 1) {
       smp_mb();
     } else {
       synchronize_rcu();
     }
     WRITE_ONCE(*x, 2);
   }

   P1(int *x, atomic_t *numonline)
   {
     int r0; int r1;

     atomic_inc(numonline);
     smp_mb();
     rcu_read_lock();
     r0 = READ_ONCE(*x);
     smp_rmb();
     r1 = READ_ONCE(*x);
     rcu_read_unlock();
   }

   locations [x;numonline;]

   exists (1:r0=0 /\ 1:r1=2)

It is important to note that these problems arise only when the system
is transitioning to or from single-CPU operation.

One solution would be to hold the CPU-hotplug locks while sampling
num_online_cpus(), which was in fact the intent of the (redundant)
preempt_disable() and preempt_enable() surrounding this call to
num_online_cpus().  Actually blocking CPU hotplug would not only result
in excessive overhead, but would also unnecessarily impede CPU-hotplug
operations.

This commit therefore follows long-standing RCU tradition by maintaining
a separate RCU-specific set of CPU-hotplug books.

This separate set of books is implemented by a new ->n_online_cpus field
in the rcu_state structure that maintains RCU's count of the online CPUs.
This count is incremented early in the CPU-online process, so that
the critical transition away from single-CPU operation will occur when
there is only a single CPU.  Similarly for the critical transition to
single-CPU operation, the counter is decremented late in the CPU-offline
process, again while there is only a single CPU.  Because there is only
ever a single CPU when the ->n_online_cpus field undergoes the critical
1->2 and 2->1 transitions, full memory ordering and mutual exclusion is
provided implicitly and, better yet, for free.

In the case where the CPU is coming online, nothing will happen until
the current CPU helps it come online.  Therefore, the new CPU will see
all accesses prior to the optimized grace period, which means that RCU
does not need to further delay this new CPU.  In the case where the CPU
is going offline, the outgoing CPU is totally out of the picture before
the optimized grace period starts, which means that this outgoing CPU
cannot see any of the accesses following that grace period.  Again,
RCU needs no further interaction with the outgoing CPU.

This does mean that synchronize_rcu() will unnecessarily do a few grace
periods the hard way just before the second CPU comes online and just
after the second-to-last CPU goes offline, but it is not worth optimizing
this uncommon case.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Implement rcu_segcblist_is_offloaded() config dependent

This commit simplifies the use of the rcu_segcblist_is_offloaded() API so
that its callers no longer need to check the RCU_NOCB_CPU Kconfig option.
Note that rcu_segcblist_is_offloaded() is defined in the header file,
which means that the generated code should be just as efficient as before.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

list.h: Update comment to explicitly note circular lists

The students in the Operating System Lecture Section at the
American University of Sharjah were confused by the header comment
in include/linux/list.h, which says "Simple doubly linked list
implementation". This comment means "simple" as in "not complex",
but "simple" is often used in this context to mean "not circular".
This commit therefore avoids this ambiguity by explicitly calling out
"circular".

Signed-off-by: Asif Rasheed <b00073877@aus.edu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Panic after fixed number of stalls

Some stalls are transient, so that system fully recovers. This commit
therefore allows users to configure the number of stalls that must happen
in order to trigger kernel panic.

Signed-off-by: chao <chao@eero.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

x86/smpboot: Move rcu_cpu_starting() earlier

The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:

=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x77/0x97
__is_insn_slot_addr+0x15d/0x170
kernel_text_address+0xba/0xe0
? get_stack_info+0x22/0xa0
__kernel_text_address+0x9/0x30
show_trace_log_lvl+0x17d/0x380
? dump_stack+0x77/0x97
dump_stack+0x77/0x97
__lock_acquire+0xdf7/0x1bf0
lock_acquire+0x258/0x3d0
? vprintk_emit+0x6d/0x2c0
_raw_spin_lock+0x27/0x40
? vprintk_emit+0x6d/0x2c0
vprintk_emit+0x6d/0x2c0
printk+0x4d/0x69
start_secondary+0x1c/0x100
secondary_startup_64_no_verify+0xb8/0xbb

This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.

Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Allow rcu_irq_enter_check_tick() from NMI

Eugenio managed to tickle #PF from NMI context which resulted in
hitting a WARN in RCU through irqentry_enter() ->
__rcu_irq_enter_check_tick().

However, this situation is perfectly sane and does not warrant an
WARN. The #PF will (necessarily) be atomic and not require messing
with the tick state, so early return is correct. This commit
therefore removes the WARN.

Fixes: aaf2bc50df1f ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Reported-by: "Eugenio Pérez" <eupm90@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled

The try_invoke_on_locked_down_task() function requires that
interrupts be enabled, but it is called with interrupts disabled from
rcu_print_task_stall(), resulting in an "IRQs not enabled as expected"
diagnostic. This commit therefore updates rcu_print_task_stall()
to accumulate a list of the first few tasks while holding the current
leaf rcu_node structure's ->lock, then releases that lock and only then
uses try_invoke_on_locked_down_task() to attempt to obtain per-task
detailed information. Of course, as soon as ->lock is released, the
task might exit, so the get_task_struct() function is used to prevent
the task structure from going away in the meantime.

Link: https://lore.kernel.org/lkml/000000000000903d5805ab908fc4@google.com/
Fixes: 5bef8da66a9c ("rcu: Add per-task state to RCU CPU stall warnings")
Reported-by: syzbot+cb3b69ae80afd6535b0e@syzkaller.appspotmail.com
Reported-by: syzbot+f04854e1c5c9e913cc27@syzkaller.appspotmail.com
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/memory-model: Label MP tests' producers and consumers

This commit adds comments that label the MP tests' producer and consumer
processes, and also that label the "exists" clause as the bad outcome.

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/memory-model: Use "buf" and "flag" for message-passing tests

The use of "x" and "y" for message-passing tests is fine for people
familiar with memory models and litmus-test nomenclature, but is a bit
obtuse for others.  This commit therefore substitutes "buf" for "x" and
"flag" for "y" for the MP tests.  There are a few special-case MP tests
that use locks and these are unchanged.  There is another MP test that
uses pointers, and this is changed to name the pointer "p".

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/memory-model: Add types to litmus tests

This commit adds type information for global variables in the litmus
tests in order to allow easier use with klitmus7.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/memory-model: Add a glossary of LKMM terms

[ paulmck: Apply Alan Stern feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

docs/memory-barriers.txt: Fix a typo in CPU MEMORY BARRIERS section

Commit 39323c6 ("smp_mb__{before,after}_atomic(): update Documentation")
has a typo in CPU MEORY BARRIERS section:
"RMW functions that do not imply are memory barrier are ..." should be
"RMW functions that do not imply a memory barrier are ...".

This patch fixes this typo.

Signed-off-by: Fox Chen <foxhlchen@gmail.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/memory-model: Document categories of ordering primitives

The Linux kernel has a number of categories of ordering primitives, which
are recorded in the LKMM implementation and hinted at by cheatsheet.txt.
But there is no overview of these categories, and such an overview
is needed in order to understand multithreaded LKMM litmus tests.
This commit therefore adds an ordering.txt as well as extracting a
control-dependencies.txt from memory-barriers.txt.  It also updates the
README file.

[ paulmck:  Apply Akira Yokosawa file-placement feedback. ]
[ paulmck:  Apply Alan Stern feedback. ]
[ paulmck:  Apply self-review feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

kcsan: Fix encoding masks and regain address bit

The watchpoint encoding masks for size and address were off-by-one bit
each, with the size mask using 1 unnecessary bit and the address mask
missing 1 bit. However, due to the way the size is shifted into the
encoded watchpoint, we were effectively wasting and never using the
extra bit.

For example, on x86 with PAGE_SIZE==4K, we have 1 bit for the is-write
bit, 14 bits for the size bits, and then 49 bits left for the address.
Prior to this fix we would end up with this usage:

[ write<1> | size<14> | wasted<1> | address<48> ]

Fix it by subtracting 1 bit from the GENMASK() end and start ranges of
size and address respectively. The added static_assert()s verify that
the masks are as expected. With the fixed version, we get the expected
usage:

[ write<1> | size<14> | address<49> ]

Functionally no change is expected, since that extra address bit is
insignificant for enabled architectures.

Acked-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Make the units of ->init_fract be jiffies

Currently, the units of ->init_fract are milliseconds while those of
->gp_sleep are jiffies. For consistency with each other and with the
argument of schedule_timeout_idle(), this commit changes the units of
->init_fract to jiffies.

This change does affect the backoff algorithm, but only on systems where
HZ is not 1000, and even there the change makes more sense, given that the
current setup would "back off" to the same number of jiffies repeatedly.
In contrast, with this change, the number of jiffies waited increases
on each pass through the loop in the rcu_tasks_wait_gp() function.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/rcutorture: Fix BUG parsing of console.log

For the rcutorture test summary log file console.log of virtual machines is
parsed. When a console.log contains "DEBUG", BUG counter is incremented
because regular expression does not handle to ignore DEBUG.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/nolibc: Fix a spelling error in a comment

Fix a spelling in the comment line.

s/memry/memory/p

This is on linux-next.

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Make kvm-check-branches.sh use --allcpus

Currently the kvm-check-branches.sh script calculates the number of CPUs
and passes this to the kvm.sh --cpus command-line argument. This works,
but this commit saves a line by instead using the new kvm.sh --allcpus
command-line argument.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture/nolibc: Fix a typo in header file

This fixes a typo. Before this, the AT_FDCWD macro would be defined
regardless of whether or not it's been defined before.

Signed-off-by: Samuel Hernandez <sam.hernandez.amador@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture: Don't do need_resched() testing if ->sync is NULL

If cur_ops->sync is NULL, rcu_torture_fwd_prog_nr() will nevertheless
attempt to call through it. This commit therefore flags cases where
neither need_resched() nor call_rcu() forward-progress testing
can be performed due to NULL function pointers, and also causes
rcu_torture_fwd_prog_nr() to take an early exit if cur_ops->sync()
is NULL.

Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

locktorture: Invoke percpu_free_rwsem() to do percpu-rwsem cleanup

When executing the LOCK06 locktorture scenario featuring percpu-rwsem,
the RCU callback rcu_sync_func() may still be pending after locktorture
module is removed.  This can in turn lead to the following Oops:

  BUG: unable to handle page fault for address: ffffffffc00eb920
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 800000013691c163
  Oops: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:rcu_cblist_dequeue+0x12/0x30
  Call Trace:
   <IRQ>
   rcu_core+0x1b1/0x860
   __do_softirq+0xfe/0x326
   asm_call_on_stack+0x12/0x20
   </IRQ>
   do_softirq_own_stack+0x5f/0x80
   irq_exit_rcu+0xaf/0xc0
   sysvec_apic_timer_interrupt+0x2e/0xb0
   asm_sysvec_apic_timer_interrupt+0x12/0x20

This commit avoids tis problem by adding an exit hook in lock_torture_ops
and using it to call percpu_free_rwsem() for percpu rwsem torture during
the module-cleanup function, thus ensuring that rcu_sync_func() completes
before module exits.

It is also necessary to call the exit hook if lock_torture_init()
fails half-way, so this commit also adds an ->init_called field in
lock_torture_cxt to indicate that exit hook, if present, must be called.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

scftorture: Add full-test stutter capability

In virtual environments on systems with hardware assist, inter-processor
interrupts must do very different things based on whether the target
vCPU is running or not. This commit therefore enables torture-test
stuttering to better test these running/not-running transitions.

Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Allow alternative forms of kvm.sh command-line arguments

This commit allows --build-only as a synonym for --buildonly, --kconfigs
for --kconfig, and --kmake-args for --kmake-arg.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture: Small code cleanups

The rcu_torture_cleanup() function fails to NULL out the reader_tasks
pointer after freeing it and its fakewriter_tasks loop has redundant
braces. This commit therefore cleans these up.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Accept time units on kvm.sh --duration argument

The "--duration <minutes>" has worked well for a very long time, but
it can be inconvenient to compute the minutes for (say) a 28-hour run.
It can also be annoying to have to let a simple boot test run for a full
minute. This commit therefore permits an "s" suffix to specify seconds,
"m" to specify minutes (which remains the default), "h" suffix to specify
hours, and "d" to specify days.

With this change, "--duration 5" still specifies that each scenario
run for five minutes, but "--duration 30s" runs for only 30 seconds,
"--duration 8h" runs for eight hours, and "--duration 2d" runs for
two days.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture:  Make stutter_wait() caller restore priority

Currently, stutter_wait() will happily spin waiting for the stutter
interval to end even if the caller is running at a real-time priority
level.  This could starve normal-priority tasks for no good reason.  This
commit therefore drops the calling task's priority to SCHED_OTHER MAX_NICE
if stutter_wait() needs to wait.  But when it waits, stutter_wait()
returns true, which allows the caller to restore the priority if needed.
Callers that were already running at SCHED_OTHER MAX_NICE obviously
do not need any changes, but this commit also restores priority for
higher-priority callers.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Force weak-hashed pointers on console log

Although the rcutorture scripting now deals correctly with full-up
security-induced pointer obfuscation, it is still counter-productive for
kernel hackers who are analyzing console output.  This commit therefore
sets the debug_boot_weak_hash kernel boot parameter, which enables
printing of weak-hashed pointers for torture-test runs.

Please note that this change applies only to runs initiated by the
kvm.sh scripting.  If you are instead using modprobe and rmmod, it is
your responsibility to build and boot the underlying kernel to your taste.

Please note further that this change does not result in a security hole
in normal use.  The rcutorture testing runs with a negligible userspace,
no networking, and no user interaction.  Besides which, there is no data
of value that can be extracted from an rcutorture guest OS that could
not also be extracted from the host that this guest is running on.

Suggested-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture: Prevent hangs for invalid arguments

If an rcutorture torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good.  What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument.  This commit therefore forces an immediate
kernel shutdown if a rcu_torture_init()-time error occurs, thus avoiding
the appearance of a hang.  It also forces a console splat in this case
to clearly indicate the presence of an error.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Prevent jitter processes from delaying failed run

Even when the kernel panics and qemu dies, runs with jitter enabled will
continue uselessly until the jitter.sh processes terminate. This can
be annoying if a planned one-hour run instead dies during boot.

This commit therefore kills the jitter.sh processes when the run ends
more than one minute prior to the termination time specified by the
kvm.sh --duration argument or its default.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

locktorture: Prevent hangs for invalid arguments

If an locktorture torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good.  What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument.  This commit therefore forces an immediate
kernel shutdown if a lock_torture_init()-time error occurs, thus avoiding
the appearance of a hang.  It also forces a console splat in this case
to clearly indicate the presence of an error.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

locktorture: Ignore nreaders_stress if no readlock support

Exclusive locks do not have readlock support, which means that a
locktorture run with the following module parameters will do nothing:

torture_type=mutex_lock nwriters_stress=0 nreaders_stress=1

This commit therefore rejects this combination for exclusive locks by
returning -EINVAL during module init.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture: Adjust scenarios SRCU-t and SRCU-u to make kconfig happy

The SRCU-u scenario expects to enable lockdep but to also disable the
CONFIG_PREEMPT_COUNT kconfig option. This no longer works. This commit
therefore instead enables lockdep in SRCU-t, which then allows SRCU-u
to disable CONFIG_PREEMPT_COUNT.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

refscale: Prevent hangs for invalid arguments

If an refscale torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good.  What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument.  This commit therefore forces an immediate
kernel shutdown if a ref_scale_init()-time error occurs, thus avoiding
the appearance of a hang.  It also forces a console splat in this case
to clearly indicate the presence of an error.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcuscale: Prevent hangs for invalid arguments

If an rcuscale torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good.  What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument.  This commit therefore forces an immediate
kernel shutdown if a rcu_scale_init()-time error occurs, thus avoiding
the appearance of a hang.  It also forces a console splat in this case
to clearly indicate the presence of an error.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Exclude "NOHZ tick-stop error" from fatal errors

The "NOHZ tick-stop error: Non-RCU local softirq work is pending"
warning happens frequently and appears to be irrelevant to the various
torture tests. This commit therefore filters it out.

If there proves to be a need to pay attention to it a later commit will
add an "advice" category to allow the user to immediately see that
although something happened, it was not an indictment of the system
being tortured.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcuscale: Avoid divide by zero

The rcuscale test module does not use batches, so there is only
ever one batch. This commit therefore informs the kvm-recheck-rcuscale.sh
script of this fact of life.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcuscale: Add RCU Tasks Trace

This commit adds the ability to test performance and scalability of RCU
Tasks Trace updaters.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

scftorture: Add an alternative IPI vector

The scftorture tests currently use only smp_call_function() and
friends, which means that these tests cannot locate bugs caused by
interactions between different IPI vectors. This commit therefore adds
the rescheduling IPI to the mix.

Note that this commit permits resched_cpus() only when scftorture is
built in. This is a workaround. Longer term, this will use real wakeups
rather than resched_cpu().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Make torture_stutter() use hrtimer

The torture_stutter() function uses schedule_timeout_interruptible()
to time the stutter duration, but this can miss race conditions due to
its being time-synchronized with everything else that is based on the
timer wheels. This commit therefore converts torture_stutter() to use
the high-resolution timers via schedule_hrtimeout(), and also to fuzz
the stutter interval. While in the area, this commit also limits the
spin-loop portion of the stutter_wait() function's wait loop to two
jiffies, down from about one second.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

torture: Periodically pause in stutter_wait()

Running locktorture scenario LOCK05 results in hangs:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05

The lock_torture_writer() kthreads set themselves to MAX_NICE while
running SCHED_OTHER.  Other locktorture kthreads run at default niceness,
also SCHED_OTHER.  This results in these other locktorture kthreads
indefinitely preempting the lock_torture_writer() kthreads.  Note that
the cond_resched() in the stutter_wait() function's loop is ineffective
because this scenario is built with CONFIG_PREEMPT=y.

It is not clear that such indefinite preemption is supposed to happen, but
in the meantime this commit prevents kthreads running in stutter_wait()
from being completely CPU-bound, thus allowing the other threads to get
some CPU in a timely fashion.  This commit also uses hrtimers to provide
very short sleeps to avoid degrading the sudden-on testing that stutter
is supposed to provide.

Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

locktorture: Track time of last ->writeunlock()

This commit adds a last_lock_release variable that tracks the time of
the last ->writeunlock() call, which allows easier diagnosing of lock
hangs when using a kernel debugger.

Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

docs/rcu: Update the call_rcu() API

This commit updates the documented API of call_rcu() to use the
rcu_callback_t typedef instead of the open-coded function definition.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

docs: RCU: Requirements.rst: Fix a list block

As warned by Sphinx:
.../Documentation/RCU/Design/Requirements/Requirements.rst:1959: WARNING: Unexpected indentation.

The list block is missing a space before it, making Sphinx to get
it wrong. This commit therefore adds the missing space characters.

Fixes: 2a721e5f0b2c ("docs: Update RCU's hotplug requirements with a bit about design")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

docs: Update RCU's hotplug requirements with a bit about design

The rcu_barrier() section of the "Hotplug CPU" section discusses
deadlocks, however the description of deadlocks other than those involving
rcu_barrier() is rather incomplete.

This commit therefore continues the section by describing how RCU's
design handles CPU hotplug in a deadlock-free way.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs

Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to
learn their clock frequency. Which is a bit strange, given that waking
them from idle likely significantly changes their clock frequency.
This commit therefore avoids sending /proc/cpuinfo-induced IPIs to
idle CPUs.

[ paulmck: Also check for idle in arch_freq_prepare_all(). ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>

x86/cpu: Avoid cpuinfo-induced IPI pileups

The aperfmperf_snapshot_cpu() function is invoked upon access to
/proc/cpuinfo, and it does do an early exit if the specified CPU has
recently done a snapshot.  Unfortunately, the indication that a snapshot
has been completed is set in an IPI handler, and the execution of this
handler can be delayed by any number of unfortunate events.  This means
that a system that starts a number of applications, each of which
parses /proc/cpuinfo, can suffer from an smp_call_function_single()
storm, especially given that each access to /proc/cpuinfo invokes
smp_call_function_single() for all CPUs.  Please note that this is not
theoretical speculation.  Note also that one CPU's pending IPI serves
all requests, so there is no point in ever having more than one IPI
pending to a given CPU.

This commit therefore suppresses duplicate IPIs to a given CPU via a
new ->scfpending field in the aperfmperf_sample structure.  This field
is set to the value one if an IPI is pending to the corresponding CPU
and to zero otherwise.

The aperfmperf_snapshot_cpu() function uses atomic_xchg() to set this
field to the value one and sample the old value.  If this function's
"wait" parameter is zero, smp_call_function_single() is called only if
the old value of the ->scfpending field was zero.  The IPI handler uses
atomic_set_release() to set this new field to zero just before returning,
so that the prior stores into the aperfmperf_sample structure are seen
by future requests that get to the atomic_xchg().  Future requests that
pass the elapsed-time check are ordered by the fact that on x86 loads act
as acquire loads, just as was the case prior to this change.  The return
value is based off of the age of the prior snapshot, just as before.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
[ paulmck: Allow /proc/cpuinfo to take advantage of arch_freq_get_on_cpu(). ]
[ paulmck: Add comment on memory barrier. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>

torture: Don't kill gdb sessions

The rcutorture scripting will do a "kill -9" on any guest OS that exceeds
its --duration by more than a few minutes, which is very valuable when
bugs result in hangs. However, this is a problem when the "hang" was due
to a --gdb debugging session.

This commit therefore refrains from killing the guest OS when a debugging
session is in progress. This means that the user must manually kill the
kvm.sh process group if a hang really does occur.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

refscale: Bounds-check module parameters

The default value for refscale.nreaders is -1, which results in the code
setting the value to three-quarters of the number of CPUs. On single-CPU
systems, this results in three-quarters of the value one, which the C
language's integer arithmetic rounds to zero. This in turn results in
a divide-by-zero error.

This commit therefore adds bounds checking to the refscale module
parameters, so that if they are less than one, they are set to the
value one.

Reported-by: kernel test robot <lkp@intel.com>
Tested-by "Chen, Rong A" <rong.a.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture: Make grace-period kthread report match RCU flavor being tested

At the end of the test and after rcu_torture_writer() stalls, rcutorture
invokes show_rcu_gp_kthreads() in order to dump out information on the
RCU grace-period kthread. This makes a lot of sense when testing vanilla
RCU, but not so much for the other flavors. This commit therefore allows
per-flavor kthread-dump functions to be specified.

[ paulmck: Apply feedback from kernel test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Convert rcu_tasks_wait_gp() for-loop to while-loop

The infinite for-loop in rcu_tasks_wait_gp() has its only exit at the
top of the loop, so this commit does the straightforward conversion to
a while-loop, thus saving a few lines.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcutorture: Make preemptible TRACE02 enable lockdep

Currently, the CONFIG_PREEMPT_NONE=y rcutorture TRACE01 rcutorture
scenario enables lockdep. This limits its ability to find bugs due to
non-preemptible sections of code being RCU readers, and pretty much all
code thus appearing to lockdep to be an RCU reader. This commit therefore
moves lockdep testing to the CONFIG_PREEMPT=y rcutorture TRACE02 scenario.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Prevent RCU_LOCKDEP_WARN() from swallowing the condition

We run into a unused variable warning in bridge code when variable is
only used inside the condition of rcu_dereference_protected().

#define mlock_dereference(X, br) \
rcu_dereference_protected(X, lockdep_is_held(&br->multicast_lock))

Since on builds with CONFIG_PROVE_RCU=n rcu_dereference_protected()
compiles to nothing the compiler doesn't see the variable use.

This commit therefore prevents this warning by adding the condition as
dead code.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
CC: paulmck@kernel.org
CC: josh@joshtriplett.org
CC: rostedt@goodmis.org
CC: mathieu.desnoyers@efficios.com
CC: joel@joelfernandes.org
CC: jiangshanlai@gmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

lockdep: Provide dummy forward declaration of *_is_held() helpers

When CONFIG_LOCKDEP is not set, lock_is_held() and lockdep_is_held()
are not declared or defined.  This forces all callers to use #ifdefs
around these checks.

Recent RCU changes added a lot of lockdep_is_held() calls inside
rcu_dereference_protected().  This macro hides its argument on !LOCKDEP
builds, which can lead to false-positive unused-variable warnings.

This commit therefore provides forward declarations of lock_is_held()
and lockdep_is_held() but without defining them.  This way callers
(including those internal to RCU) can keep them visible to the compiler
on !LOCKDEP builds and instead depend on dead code elimination to remove
the references, which in turn prevents the linker from complaining about
the lack of the corresponding function definitions.

[ paulmck: Apply Peter Zijlstra feedback on "extern". ]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
CC: peterz@infradead.org
CC: mingo@redhat.com
CC: will@kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

srcu: Use a more appropriate lockdep helper

The lockdep_is_held() macro is defined as:

#define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map)

This hides away the dereference, so that builds with !LOCKDEP don't break.
This works in current kernels because the RCU_LOCKDEP_WARN() eliminates
its condition at preprocessor time in !LOCKDEP kernels. However, later
patches in this series will cause the compiler to see this condition even
in !LOCKDEP kernels. This commit prepares for this upcoming change by
switching from lock_is_held() to lockdep_is_held().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
CC: jiangshanlai@gmail.com
CC: paulmck@kernel.org
CC: josh@joshtriplett.org
CC: rostedt@goodmis.org
CC: mathieu.desnoyers@efficios.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

net: sched: Remove broken definitions and un-hide for !LOCKDEP

Currently, variables used only within lockdep expressions are flagged as
unused, requiring that these variables' declarations be decorated with
either #ifdef or __maybe_unused. This results in ugly code. This commit
therefore causes the full definitions of the lockdep_tcf_chain_is_locked()
and lockdep_tcf_proto_is_locked() functions to be visible even when
lockdep is not enabled, thus removing the need for the previous empty
functions that were provided in non-lockdep kernels. This approach
further relies on dead-code elimination to remove any references to
functions or variables that are not available in non-lockdep kernels.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
CC: jhs@mojatatu.com
CC: xiyou.wangcong@gmail.com
CC: jiri@resnulli.us
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

net: Un-hide lockdep_sock_is_held() for !LOCKDEP

Currently, variables used only within lockdep expressions are flagged
as unused, requiring that these variables' declarations be decorated
with either #ifdef or __maybe_unused. This results in ugly code.
This commit therefore causes the lockdep_sock_is_held() function to be
visible even when lockdep is not enabled, thus removing the need for
these decorations. This approach further relies on dead-code elimination
to remove any references to functions or variables that are not available
in non-lockdep kernels.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

rcu: Un-hide lockdep maps for !LOCKDEP

Currently, variables used only within lockdep expressions are flagged as
unused, requiring that these variables' declarations be decorated with
either #ifdef or __maybe_unused. This results in ugly code. This commit
therefore causes the RCU lock maps to be visible even when lockdep is not
enabled, thus removing the need for these decorations. This approach
further relies on dead-code elimination to remove any references to
functions or variables that are not available in non-lockdep kernels.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

sched: Un-hide lockdep_tasklist_lock_is_held() for !LOCKDEP

Currently, variables used only within lockdep expressions are flagged as
unused, requiring that these variables' declarations be decorated with
either #ifdef or __maybe_unused. This results in ugly code. This commit
therefore causes the lockdep_tasklist_lock_is_held() function to be
visible even when lockdep is not enabled, thus removing the need for
these decorations. This approach further relies on dead-code elimination
to remove any references to functions or variables that are not available
in non-lockdep kernels.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

kcsan: Never set up watchpoints on NULL pointers

Avoid setting up watchpoints on NULL pointers, as otherwise we would
crash inside the KCSAN runtime (when checking for value changes) instead
of the instrumented code.

Because that may be confusing, skip any address less than PAGE_SIZE.

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

kcsan: selftest: Ensure that address is at least PAGE_SIZE

In preparation of supporting only addresses not within the NULL page,
change the selftest to never use addresses that are less than PAGE_SIZE.

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

doc: Present the role of READ_ONCE()

This commit adds an explanation of the special cases where READ_ONCE()
may be used in place of a member of the rcu_dereference() family.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools/memory-model: Move Documentation description to Documentation/README

This commit moves the descriptions of the files residing in
tools/memory-model/Documentation to a README file in that directory,
leaving behind the description of tools/memory-model/Documentation/README
itself. After this change, tools/memory-model/Documentation/README
provides a guide to the files in the tools/memory-model/Documentation
directory, guiding people with different skills and needs to the most
appropriate starting point.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

tools: memory-model: Document that the LKMM can easily miss control dependencies

Add a small section to the litmus-tests.txt documentation file for
the Linux Kernel Memory Model explaining that the memory model often
fails to recognize certain control dependencies.

Suggested-by: Akira Yokosawa <akiyks@gmail.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Linux 5.10-rc1

treewide: Convert macro and uses of __section(foo) to __section("foo")

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kernel/sys.c: fix prototype of prctl_get_tid_address()

tid_addr is not a "pointer to (pointer to int in userspace)"; it is in
fact a "pointer to (pointer to int in userspace) in userspace". So
sparse rightfully complains about passing a kernel pointer to
put_user().

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: remove kzfree() compatibility definition

Commit 453431a54934 ("mm, treewide: rename kzfree() to
kfree_sensitive()") renamed kzfree() to kfree_sensitive(),
but it left a compatibility definition of kzfree() to avoid
being too disruptive.

Since then a few more instances of kzfree() have slipped in.

Just get rid of them and remove the compatibility definition
once and for all.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

checkpatch: enable GIT_DIR environment use to set git repository location

If set, use the environment variable GIT_DIR to change the default .git
location of the kernel git tree.

If GIT_DIR is unset, keep using the current ".git" default.

Link: https://lkml.kernel.org/r/c5e23b45562373d632fccb8bc04e563abba4dd1d.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'timers-urgent-2020-10-25' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
"A time namespace fix and a matching selftest. The futex absolute
  timeouts which are based on CLOCK_MONOTONIC require time namespace
  corrected. This was missed in the original time namesapce support"

* tag 'timers-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/timens: Add a test for futex()
  futex: Adjust absolute futex timeouts with per time namespace offset

Merge tag 'sched-urgent-2020-10-25' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
"Two scheduler fixes:

   - A trivial build fix for sched_feat() to compile correctly with
     CONFIG_JUMP_LABEL=n

   - Replace a zero lenght array with a flexible array"

* tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/features: Fix !CONFIG_JUMP_LABEL case
  sched: Replace zero-length array with flexible-array

Merge tag 'perf-urgent-2020-10-25' of git://git./linux/kernel/git/tip/tip

Pull perf fix from Thomas Gleixner:
"A single fix to compute the field offset of the SNOOPX bit in the data
source bitmask of perf events correctly"

* tag 'perf-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: correct SNOOPX field offset

Merge tag 'locking-urgent-2020-10-25' of git://git./linux/kernel/git/tip/tip

Pull locking fix from Thomas Gleixner:
"Just a trivial fix for kernel-doc warnings"

* tag 'locking-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/seqlocks: Fix kernel-doc warnings

Merge tag 'ntb-5.10' of git://github.com/jonmason/ntb

Pull NTB fixes from Jon Mason.

* tag 'ntb-5.10' of git://github.com/jonmason/ntb:
  NTB: Use struct_size() helper in devm_kzalloc()
  ntb: intel: Fix memleak in intel_ntb_pci_probe
  NTB: hw: amd: fix an issue about leak system resources

Merge branch 'i2c/for-5.10' of git://git./linux/kernel/git/wsa/linux

Pull i2c fix from Wolfram Sang:
"Regression fix for rc1 and stable kernels as well"

* 'i2c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: core: Restore acpi_walk_dep_device_list() getting called after registering the ACPI i2c devs

Merge tag '5.10-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull more cifs updates from Steve French:
"Add support for stat of various special file types (WSL reparse points
  for char, block, fifo)"

* tag '5.10-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number
  smb3: add some missing definitions from MS-FSCC
  smb3: remove two unused variables
  smb3: add support for stat of WSL reparse points for special file types

Merge branch 'parisc-5.10-2' of git://git./linux/kernel/git/deller/parisc-linux

Pull more parisc updates from Helge Deller:

- During this merge window O_NONBLOCK was changed to become 000200000,
   but we missed that the syscalls timerfd_create(), signalfd4(),
   eventfd2(), pipe2(), inotify_init1() and userfaultfd() do a strict
   bit-wise check of the flags parameter.

   To provide backward compatibility with existing userspace we
   introduce parisc specific wrappers for those syscalls which filter
   out the old O_NONBLOCK value and replaces it with the new one.

- Prevent HIL bus driver to get stuck when keyboard or mouse isn't
   attached

- Improve error return codes when setting rtc time

- Minor documentation fix in pata_ns87415.c

* 'parisc-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  ata: pata_ns87415.c: Document support on parisc with superio chip
  parisc: Add wrapper syscalls to fix O_NONBLOCK flag usage
  hil/parisc: Disable HIL driver when it gets stuck
  parisc: Improve error return codes when setting rtc time

Merge tag 'for-linus-5.10b-rc1c-tag' of git://git./linux/kernel/git/xen/tip

Pull more xen updates from Juergen Gross:

- a series for the Xen pv block drivers adding module parameters for
   better control of resource usge

- a cleanup series for the Xen event driver

* tag 'for-linus-5.10b-rc1c-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  Documentation: add xen.fifo_events kernel parameter description
  xen/events: unmask a fifo event channel only if it was masked
  xen/events: only register debug interrupt for 2-level events
  xen/events: make struct irq_info private to events_base.c
  xen: remove no longer used functions
  xen-blkfront: Apply changed parameter name to the document
  xen-blkfront: add a parameter for disabling of persistent grants
  xen-blkback: add a parameter for disabling of persistent grants

Merge tag 'safesetid-5.10' of git://github.com/micah-morton/linux

Pull SafeSetID updates from Micah Morton:
"The changes are mostly contained to within the SafeSetID LSM, with the
  exception of a few 1-line changes to change some ns_capable() calls to
  ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that
  is examined by SafeSetID code and nothing else in the kernel.

  The changes to SafeSetID internally allow for setting up GID
  transition security policies, as already existed for UIDs"

* tag 'safesetid-5.10' of git://github.com/micah-morton/linux:
  LSM: SafeSetID: Fix warnings reported by test bot
  LSM: SafeSetID: Add GID security policy handling
  LSM: Signal to SafeSetID when setting group IDs

Merge tag '20201024-v4-5.10' of git://git./linux/kernel/git/wtarreau/prandom

Pull random32 updates from Willy Tarreau:
"Make prandom_u32() less predictable.

  This is the cleanup of the latest series of prandom_u32
  experimentations consisting in using SipHash instead of Tausworthe to
  produce the randoms used by the network stack.

  The changes to the files were kept minimal, and the controversial
  commit that used to take noise from the fast_pool (f227e3ec3b5c) was
  reverted. Instead, a dedicated "net_rand_noise" per_cpu variable is
  fed from various sources of activities (networking, scheduling) to
  perturb the SipHash state using fast, non-trivially predictable data,
  instead of keeping it fully deterministic. The goal is essentially to
  make any occasional memory leakage or brute-force attempt useless.

  The resulting code was verified to be very slightly faster on x86_64
  than what is was with the controversial commit above, though this
  remains barely above measurement noise. It was also tested on i386 and
  arm, and build- tested only on arm64"

Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
* tag '20201024-v4-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom:
  random32: add a selftest for the prandom32 code
  random32: add noise from network and scheduling activity
  random32: make prandom_u32() output unpredictable

i2c: core: Restore acpi_walk_dep_device_list() getting called after registering the ACPI i2c devs

Commit 21653a4181ff ("i2c: core: Call i2c_acpi_install_space_handler()
before i2c_acpi_register_devices()")'s intention was to only move the
acpi_install_address_space_handler() call to the point before where
the ACPI declared i2c-children of the adapter where instantiated by
i2c_acpi_register_devices().

But i2c_acpi_install_space_handler() had a call to
acpi_walk_dep_device_list() hidden (that is I missed it) at the end
of it, so as an unwanted side-effect now acpi_walk_dep_device_list()
was also being called before i2c_acpi_register_devices().

Move the acpi_walk_dep_device_list() call to the end of
i2c_acpi_register_devices(), so that it is once again called *after*
the i2c_client-s hanging of the adapter have been created.

This fixes the Microsoft Surface Go 2 hanging at boot.

Fixes: 21653a4181ff ("i2c: core: Call i2c_acpi_install_space_handler() before i2c_acpi_register_devices()")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209627
Reported-by: Rainer Finke <rainer@finke.cc>
Reported-by: Kieran Bingham <kieran.bingham@ideasonboard.com>
Suggested-by: Maximilian Luz <luzmaximilian@gmail.com>
Tested-by: Kieran Bingham <kieran.bingham@ideasonboard.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>

Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Christoph
     - rdma error handling fixes (Chao Leng)
     - fc error handling and reconnect fixes (James Smart)
     - fix the qid displace when tracing ioctl command (Keith Busch)
     - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
     - fix MTDT for passthru (Logan Gunthorpe)
     - blacklist Write Same on more devices (Kai-Heng Feng)
     - fix an uninitialized work struct (zhenwei pi)"

- lightnvm out-of-bounds fix (Colin)

- SG allocation leak fix (Doug)

- rnbd fixes (Gioh, Guoqing, Jack)

- zone error translation fixes (Keith)

- kerneldoc markup fix (Mauro)

- zram lockdep fix (Peter)

- Kill unused io_context members (Yufen)

- NUMA memory allocation cleanup (Xianting)

- NBD config wakeup fix (Xiubo)

* tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
  block: blk-mq: fix a kernel-doc markup
  nvme-fc: shorten reconnect delay if possible for FC
  nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
  nvme-fc: fix error loop in create_hw_io_queues
  nvme-fc: fix io timeout to abort I/O
  null_blk: use zone status for max active/open
  nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
  nvmet: cleanup nvmet_passthru_map_sg()
  nvmet: limit passthru MTDS by BIO_MAX_PAGES
  nvmet: fix uninitialized work for zero kato
  nvme-pci: disable Write Zeroes on Sandisk Skyhawk
  nvme: use queuedata for nvme_req_qid
  nvme-rdma: fix crash due to incorrect cqe
  nvme-rdma: fix crash when connect rejected
  block: remove unused members for io_context
  blk-mq: remove the calling of local_memory_node()
  zram: Fix __zram_bvec_{read,write}() locking order
  skd_main: remove unused including <linux/version.h>
  sgl_alloc_order: fix memory leak
  lightnvm: fix out-of-bounds write to array devices->info[]
  ...

Merge tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

- fsize was missed in previous unification of work flags

- Few fixes cleaning up the flags unification creds cases (Pavel)

- Fix NUMA affinities for completely unplugged/replugged node for io-wq

- Two fallout fixes from the set_fs changes. One local to io_uring, one
   for the splice entry point that io_uring uses.

- Linked timeout fixes (Pavel)

- Removal of ->flush() ->files work-around that we don't need anymore
   with referenced files (Pavel)

- Various cleanups (Pavel)

* tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
  splice: change exported internal do_splice() helper to take kernel offset
  io_uring: make loop_rw_iter() use original user supplied pointers
  io_uring: remove req cancel in ->flush()
  io-wq: re-set NUMA node affinities if CPUs come online
  io_uring: don't reuse linked_timeout
  io_uring: unify fsize with def->work_flags
  io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
  io_uring: do poll's hash_node init in common code
  io_uring: inline io_poll_task_handler()
  io_uring: remove extra ->file check in poll prep
  io_uring: make cached_cq_overflow non atomic_t
  io_uring: inline io_fail_links()
  io_uring: kill ref get/drop in personality init
  io_uring: flags-based creds init in queue

Merge tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block

Pull libata fixes from Jens Axboe:
"Two minor libata fixes:

   - Fix a DMA boundary mask regression for sata_rcar (Geert)

   - kerneldoc markup fix (Mauro)"

* tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
  ata: fix some kernel-doc markups
  ata: sata_rcar: Fix DMA boundary mask

Merge branch 'work.misc' of git://git./linux/kernel/git/viro/vfs

Pull misc vfs updates from Al Viro:
"Assorted stuff all over the place (the largest group here is
  Christoph's stat cleanups)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove KSTAT_QUERY_FLAGS
  fs: remove vfs_stat_set_lookup_flags
  fs: move vfs_fstatat out of line
  fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
  fs: remove vfs_statx_fd
  fs: omfs: use kmemdup() rather than kmalloc+memcpy
  [PATCH] reduce boilerplate in fsid handling
  fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
  selftests: mount: add nosymfollow tests
  Add a "nosymfollow" mount option.

Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

- document the new dma_{alloc,free}_pages() API

- two fixups for the dma-mapping.h split

* tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: document dma_{alloc,free}_pages
  dma-mapping: move more functions to dma-map-ops.h
  ARM/sa1111: add a missing include of dma-map-ops.h

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
"Two fixes for this merge window, and an unrelated bugfix for a host
  hang"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: ioapic: break infinite recursion on lazy EOI
  KVM: vmx: rename pi_init to avoid conflict with paride
  KVM: x86/mmu: Avoid modulo operator on 64-bit value to fix i386 build

Merge tag 'x86_seves_fixes_for_v5.10_rc1' of git://git./linux/kernel/git/tip/tip

Pull x86 SEV-ES fixes from Borislav Petkov:
"Three fixes to SEV-ES to correct setting up the new early pagetable on
  5-level paging machines, to always map boot_params and the kernel
  cmdline, and disable stack protector for ../compressed/head{32,64}.c.
  (Arvind Sankar)"

* tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/64: Explicitly map boot_params and command line
  x86/head/64: Disable stack protection for head$(BITS).o
  x86/boot/64: Initialize 5-level paging variables earlier