Sebastian Andrzej Siewior [Tue, 3 Aug 2021 14:16:15 +0000 (16:16 +0200)]
sched: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-33-bigeasy@linutronix.de
Quentin Perret [Thu, 5 Aug 2021 10:21:54 +0000 (11:21 +0100)]
sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
the call must not touch scheduling parameters (nice or priority). This
is particularly handy for uclamp when used in conjunction with
SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
impacts uclamp values.
However, sched_setattr always checks whether the priorities and nice
values passed in sched_attr are valid first, even if those never get
used down the line. This is useless at best since userspace can
trivially bypass this check to set the uclamp values by specifying low
priorities. However, it is cumbersome to do so as there is no single
expression of this that skips both RT and CFS checks at once. As such,
userspace needs to query the task policy first with e.g. sched_getattr
and then set sched_attr.sched_priority accordingly. This is racy and
slower than a single call.
As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
is specified, simply inherit them in this case to match the policy
inheritance of SCHED_FLAG_KEEP_POLICY.
Reported-by: Wei Wang <wvw@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-3-qperret@google.com
Quentin Perret [Thu, 5 Aug 2021 10:21:53 +0000 (11:21 +0100)]
sched: Fix UCLAMP_FLAG_IDLE setting
The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
uclamp active task (that is, when buckets.tasks reaches 0 for all
buckets) to maintain the last uclamp.max and prevent blocked util from
suddenly becoming visible.
However, there is an asymmetry in how the flag is set and cleared which
can lead to having the flag set whilst there are active tasks on the rq.
Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
called at enqueue time, but set in uclamp_rq_dec_id() which is called
both when dequeueing a task _and_ in the update_uclamp_active() path. As
a result, when both uclamp_rq_{dec,ind}_id() are called from
update_uclamp_active(), the flag ends up being set but not cleared,
hence leaving the runqueue in a broken state.
Fix this by clearing the flag in update_uclamp_active() as well.
Fixes:
e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
Dietmar Eggemann [Wed, 4 Aug 2021 13:59:25 +0000 (15:59 +0200)]
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
A missing clock update is causing the following warning:
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
sub_running_bw.isra.0+0x190/0x1a0
...
CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1
Hardware name: WIWYNN Mt.Jade Server System
B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.
20210526 (SCP:
1.06.
20210526) 2021/05/26
...
Call trace:
sub_running_bw.isra.0+0x190/0x1a0
migrate_task_rq_dl+0xf8/0x1e0
set_task_cpu+0xa8/0x1f0
try_to_wake_up+0x150/0x3d4
wake_up_q+0x64/0xc0
__up_write+0xd0/0x1c0
up_write+0x4c/0x2b0
cppc_set_perf+0x120/0x2d0
cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
__cpufreq_driver_target+0x74/0x140
sugov_work+0x64/0x80
kthread_worker_fn+0xe0/0x230
kthread+0x138/0x140
ret_from_fork+0x10/0x18
The task causing this is the `cppc_fie` DL task introduced by
commit
1eb5dde674f5 ("cpufreq: CPPC: Add support for frequency
invariance").
With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on
slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm
Server):
DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter
is in `non_contending` state, migrate_task_rq_dl() calls
sub_running_bw()->__sub_running_bw()->cpufreq_update_util()->
rq_clock()->assert_clock_updated()
on p.
Fix this by updating the clock for a non_contending task in
migrate_task_rq_dl() before calling sub_running_bw().
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com
Mel Gorman [Wed, 4 Aug 2021 11:58:57 +0000 (12:58 +0100)]
sched/fair: Avoid a second scan of target in select_idle_cpu
When select_idle_cpu starts scanning for an idle CPU, it starts with
a target CPU that has already been checked by select_idle_sibling.
This patch starts with the next CPU instead.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210804115857.6253-3-mgorman@techsingularity.net
Mel Gorman [Wed, 4 Aug 2021 11:58:56 +0000 (12:58 +0100)]
sched/fair: Use prev instead of new target as recent_used_cpu
After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the
last wakeup. It still works, but is less efficient than it could be.
This patch preserves recent_used_cpu for longer.
The impact on SIS efficiency is tiny so the SIS statistic patches were
used to track the hit rate for using recent_used_cpu. With perf bench
pipe on a 2-socket Cascadelake machine, the hit rate went from 57.14%
to 85.32%. For more intensive wakeup loads like hackbench, the hit rate
is almost negligible but rose from 0.21% to 6.64%. For scaling loads
like tbench, the hit rate goes from almost 0% to 25.42% overall. Broadly
speaking, on tbench, the success rate is much higher for lower thread
counts and drops to almost 0 as the workload scales to towards saturation.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210804115857.6253-2-mgorman@techsingularity.net
Quentin Perret [Tue, 27 Jul 2021 10:11:02 +0000 (11:11 +0100)]
sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
SCHED_FLAG_SUGOV is supposed to be a kernel-only flag that userspace
cannot interact with. However, sched_getattr() currently reports it
in sched_flags if called on a sugov worker even though it is not
actually defined in a UAPI header. To avoid this, make sure to
clean-up the sched_flags field in sched_getattr() before returning to
userspace.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-3-qperret@google.com
Quentin Perret [Tue, 27 Jul 2021 10:11:01 +0000 (11:11 +0100)]
sched/deadline: Fix reset_on_fork reporting of DL tasks
It is possible for sched_getattr() to incorrectly report the state of
the reset_on_fork flag when called on a deadline task.
Indeed, if the flag was set on a deadline task using sched_setattr()
with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
p->sched_reset_on_fork will be set, but __setscheduler() will bail out
early, which means that the dl_se->flags will not get updated by
__setscheduler_params()->__setparam_dl(). Consequently, if
sched_getattr() is then called on the task, __getparam_dl() will
override kattr.sched_flags with the now out-of-date copy in dl_se->flags
and report the stale value to userspace.
To fix this, make sure to only copy the flags that are relevant to
sched_deadline to and from the dl_se->flags field.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
Wang Hui [Wed, 21 Jul 2021 09:11:09 +0000 (17:11 +0800)]
sched: remove redundant on_rq status change
activate_task/deactivate_task will change on_rq status,
no need to do it again.
Signed-off-by: Wang Hui <john.wanghui@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210721091109.1406043-1-john.wanghui@huawei.com
Mika Penttilä [Thu, 22 Jul 2021 06:39:46 +0000 (09:39 +0300)]
sched/numa: Fix is_core_idle()
Use the loop variable instead of the function argument to test the
other SMT siblings for idle.
Fixes:
ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
Signed-off-by: Mika Penttilä <mika.penttila@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Link: https://lkml.kernel.org/r/20210722063946.28951-1-mika.penttila@gmail.com
Yuan ZhaoXiong [Sun, 6 Jun 2021 13:11:55 +0000 (21:11 +0800)]
sched: Optimize housekeeping_cpumask() in for_each_cpu_and()
On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
the others are used for housekeeping. When many housekeeping cpus are
in idle state, we can observe huge time burn in the loop for searching
nearest busy housekeeper cpu by ftrace.
9) | get_nohz_timer_target() {
9) | housekeeping_test_cpu() {
9) 0.390 us | housekeeping_get_mask.part.1();
9) 0.561 us | }
9) 0.090 us | __rcu_read_lock();
9) 0.090 us | housekeeping_cpumask();
9) 0.521 us | housekeeping_cpumask();
9) 0.140 us | housekeeping_cpumask();
...
9) 0.500 us | housekeeping_cpumask();
9) | housekeeping_any_cpu() {
9) 0.090 us | housekeeping_get_mask.part.1();
9) 0.100 us | sched_numa_find_closest();
9) 0.491 us | }
9) 0.100 us | __rcu_read_unlock();
9) + 76.163 us | }
for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
function the
for_each_cpu_and(i, sched_domain_span(sd),
housekeeping_cpumask(HK_FLAG_TIMER))
equals to below:
for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
That will cause that housekeeping_cpumask() will be invoked many times.
The housekeeping_cpumask() function returns a const value, so it is
unnecessary to invoke it every time. This patch can minimize the worst
searching time from ~76us to ~16us in my testing.
Similarly, the find_new_ilb() function has the same problem.
Co-developed-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1622985115-51007-1-git-send-email-yuanzhaoxiong@baidu.com
Hailong Liu [Sun, 6 Jun 2021 11:54:51 +0000 (19:54 +0800)]
sched/sysctl: Move extern sysctl declarations to sched.h
Since commit '
8a99b6833c88(sched: Move SCHED_DEBUG sysctl to debugfs)',
SCHED_DEBUG sysctls are moved to debugfs, so these extern sysctls in
include/linux/sched/sysctl.h are no longer needed for sysctl.c, even
some are no longer needed.
So move those extern sysctls that needed by kernel/sched/debug.c to
kernel/sched/sched.h, and remove others that are no longer needed.
Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210606115451.26745-1-liuhailongg6@163.com
Julian Wiedmann [Tue, 1 Jun 2021 15:11:20 +0000 (17:11 +0200)]
wait: use LIST_HEAD_INIT() to initialize wait_queue_head
Replace the open-coded initialization with the right macro.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210601151120.329223-1-jwi@linux.ibm.com
Valentin Schneider [Tue, 18 May 2021 13:07:25 +0000 (14:07 +0100)]
sched/debug: Don't update sched_domain debug directories before sched_debug_init()
Since CPU capacity asymmetry can stem purely from maximum frequency
differences (e.g. Pixel 1), a rebuild of the scheduler topology can be
issued upon loading cpufreq, see:
arch_topology.c::init_cpu_capacity_callback()
Turns out that if this rebuild happens *before* sched_debug_init() is
run (which is a late initcall), we end up messing up the sched_domain debug
directory: passing a NULL parent to debugfs_create_dir() ends up creating
the directory at the debugfs root, which in this case creates
/sys/kernel/debug/domains (instead of /sys/kernel/debug/sched/domains).
This currently doesn't happen on asymmetric systems which use cpufreq-scpi
or cpufreq-dt drivers, as those are loaded via
deferred_probe_initcall() (it is also a late initcall, but appears to be
ordered *after* sched_debug_init()).
Ionela has been working on detecting maximum frequency asymmetry via ACPI,
and that actually happens via a *device* initcall, thus before
sched_debug_init(), and causes the aforementionned debugfs mayhem.
One option would be to punt sched_debug_init() down to
fs_initcall_sync(). Preventing update_sched_domain_debugfs() from running
before sched_debug_init() appears to be the safer option.
Fixes:
3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs")
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lore.kernel.org/r/20210514095339.12979-1-ionela.voinescu@arm.com
Odin Ugedal [Thu, 24 Jun 2021 11:18:15 +0000 (13:18 +0200)]
sched/fair: Ensure _sum and _avg values stay consistent
The _sum and _avg values are in general sync together with the PELT
divider. They are however not always completely in perfect sync,
resulting in situations where _sum gets to zero while _avg stays
positive. Such situations are undesirable.
This comes from the fact that PELT will increase period_contrib, also
increasing the PELT divider, without updating _sum and _avg values to
stay in perfect sync where (_sum == _avg * divider). However, such PELT
change will never lower _sum, making it impossible to end up in a
situation where _sum is zero and _avg is not.
Therefore, we need to ensure that when subtracting load outside PELT,
that when _sum is zero, _avg is also set to zero. This occurs when
(_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
or equal to the current _sum, while the subtracted _avg is smaller than
the current _avg.
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
Beata Michalska [Thu, 3 Jun 2021 14:06:27 +0000 (15:06 +0100)]
sched/doc: Update the CPU capacity asymmetry bits
Update the documentation bits referring to capacity aware scheduling
with regards to newly introduced SD_ASYM_CPUCAPACITY_FULL sched_domain
flag.
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210603140627.8409-4-beata.michalska@arm.com
Beata Michalska [Thu, 3 Jun 2021 14:06:26 +0000 (15:06 +0100)]
sched/topology: Rework CPU capacity asymmetry detection
Currently the CPU capacity asymmetry detection, performed through
asym_cpu_capacity_level, tries to identify the lowest topology level
at which the highest CPU capacity is being observed, not necessarily
finding the level at which all possible capacity values are visible
to all CPUs, which might be bit problematic for some possible/valid
asymmetric topologies i.e.:
DIE [ ]
MC [ ][ ]
CPU [0] [1] [2] [3] [4] [5] [6] [7]
Capacity |.....| |.....| |.....| |.....|
L M B B
Where:
arch_scale_cpu_capacity(L) = 512
arch_scale_cpu_capacity(M) = 871
arch_scale_cpu_capacity(B) = 1024
In this particular case, the asymmetric topology level will point
at MC, as all possible CPU masks for that level do cover the CPU
with the highest capacity. It will work just fine for the first
cluster, not so much for the second one though (consider the
find_energy_efficient_cpu which might end up attempting the energy
aware wake-up for a domain that does not see any asymmetry at all)
Rework the way the capacity asymmetry levels are being detected,
allowing to point to the lowest topology level (for a given CPU), where
full set of available CPU capacities is visible to all CPUs within given
domain. As a result, the per-cpu sd_asym_cpucapacity might differ across
the domains. This will have an impact on EAS wake-up placement in a way
that it might see different range of CPUs to be considered, depending on
the given current and target CPUs.
Additionally, those levels, where any range of asymmetry (not
necessarily full) is being detected will get identified as well.
The selected asymmetric topology level will be denoted by
SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
maintaining the current behaviour for asymmetric topologies, with
misfit migration operating correctly on lower levels, if applicable,
as any asymmetry is enough to trigger the misfit migration.
The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
relate to the full asymmetry level denoted by the sd_asym_cpucapacity
pointer.
Detecting the CPU capacity asymmetry is being based on a set of
available CPU capacities for all possible CPUs. This data is being
generated upon init and updated once CPU topology changes are being
detected (through arch_update_cpu_topology). As such, any changes
to identified CPU capacities (like initializing cpufreq) need to be
explicitly advertised by corresponding archs to trigger rebuilding
the data.
Additional -dflags- parameter, used when building sched domains, has
been removed as well, as the asymmetry flags are now being set directly
in sd_init.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
Beata Michalska [Thu, 3 Jun 2021 14:06:25 +0000 (15:06 +0100)]
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
Introducing new, complementary to SD_ASYM_CPUCAPACITY, sched_domain
topology flag, to distinguish between shed_domains where any CPU
capacity asymmetry is detected (SD_ASYM_CPUCAPACITY) and ones where
a full set of CPU capacities is visible to all domain members
(SD_ASYM_CPUCAPACITY_FULL).
With the distinction between full and partial CPU capacity asymmetry,
brought in by the newly introduced flag, the scope of the original
SD_ASYM_CPUCAPACITY flag gets shifted, still maintaining the existing
behaviour when one is detected on a given sched domain, allowing
misfit migrations within sched domains that do not observe full range
of CPU capacities but still do have members with different capacity
values. It loses though it's meaning when it comes to the lowest CPU
asymmetry sched_domain level per-cpu pointer, which is to be now
denoted by SD_ASYM_CPUCAPACITY_FULL flag.
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210603140627.8409-2-beata.michalska@arm.com
Zhaoyang Huang [Fri, 11 Jun 2021 00:29:34 +0000 (08:29 +0800)]
psi: Fix race between psi_trigger_create/destroy
Race detected between psi_trigger_destroy/create as shown below, which
cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
and psi_system->poll_timer->entry->next. Under this modification, the
race window is removed by initialising poll_wait and poll_timer in
group_init which are executed only once at beginning.
psi_trigger_destroy() psi_trigger_create()
mutex_lock(trigger_lock);
rcu_assign_pointer(poll_task, NULL);
mutex_unlock(trigger_lock);
mutex_lock(trigger_lock);
if (!rcu_access_pointer(group->poll_task)) {
timer_setup(poll_timer, poll_timer_fn, 0);
rcu_assign_pointer(poll_task, task);
}
mutex_unlock(trigger_lock);
synchronize_rcu();
del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
psi_trigger_create()
So, trigger_lock/RCU correctly protects destruction of
group->poll_task but misses this race affecting poll_timer and
poll_wait.
Fixes:
461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Co-developed-by: ziwei.dai <ziwei.dai@unisoc.com>
Signed-off-by: ziwei.dai <ziwei.dai@unisoc.com>
Co-developed-by: ke.wang <ke.wang@unisoc.com>
Signed-off-by: ke.wang <ke.wang@unisoc.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
Huaixin Chang [Mon, 21 Jun 2021 09:27:58 +0000 (17:27 +0800)]
sched/fair: Introduce the burstable CFS controller
The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.
We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.
Traditional (UP-EDF) bandwidth control is something like:
(U = \Sum u_i) <= 1
This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.
This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.
That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.
At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).
The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.
Tail latencies are shown below, and it wasn't the worst case.
Latency percentiles (usec)
50.0000th: 19872
75.0000th: 21344
90.0000th: 22176
95.0000th: 22496
*99.0000th: 22752
99.5000th: 22752
99.9000th: 22752
min=0, max=22727
rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/
5371BD36-55AE-4F71-B9D7-
B86DC32E3D2B@linux.alibaba.com/
Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
Qais Yousef [Thu, 17 Jun 2021 16:51:55 +0000 (17:51 +0100)]
sched/uclamp: Fix uclamp_tg_restrict()
Now cpu.uclamp.min acts as a protection, we need to make sure that the
uclamp request of the task is within the allowed range of the cgroup,
that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
tg->uclamp[UCLAMP_MAX].
As reported by Xuewen [1] we can have some corner cases where there's
inversion between uclamp requested by task (p) and the uclamp values of
the taskgroup it's attached to (tg). Following table demonstrates
2 corner cases:
| p | tg | effective
-----------+-----+------+-----------
CASE 1
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 60%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
CASE 2
-----------+-----+------+-----------
uclamp_min | 0% | 30% | 30%
-----------+-----+------+-----------
uclamp_max | 20% | 50% | 20%
-----------+-----+------+-----------
With this fix we get:
| p | tg | effective
-----------+-----+------+-----------
CASE 1
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 50%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
CASE 2
-----------+-----+------+-----------
uclamp_min | 0% | 30% | 30%
-----------+-----+------+-----------
uclamp_max | 20% | 50% | 30%
-----------+-----+------+-----------
Additionally uclamp_update_active_tasks() must now unconditionally
update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
instance could have an impact on the effective UCLAMP_MIN of the tasks.
| p | tg | effective
-----------+-----+------+-----------
old
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 50%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
*new*
-----------+-----+------+-----------
uclamp_min | 60% | 0% | *60%*
-----------+-----+------+-----------
uclamp_max | 80% |*70%* | *70%*
-----------+-----+------+-----------
[1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
Fixes:
0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
Reported-by: Xuewen Yan <xuewen.yan94@gmail.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
Vincent Donnefort [Mon, 21 Jun 2021 10:37:52 +0000 (11:37 +0100)]
sched/rt: Fix Deadline utilization tracking during policy change
DL keeps track of the utilization on a per-rq basis with the structure
avg_dl. This utilization is updated during task_tick_dl(),
put_prev_task_dl() and set_next_task_dl(). However, when the current
running task changes its policy, set_next_task_dl() which would usually
take care of updating the utilization when the rq starts running DL
tasks, will not see a such change, leaving the avg_dl structure outdated.
When that very same task will be dequeued later, put_prev_task_dl() will
then update the utilization, based on a wrong last_update_time, leading to
a huge spike in the DL utilization signal.
The signal would eventually recover from this issue after few ms. Even
if no DL tasks are run, avg_dl is also updated in
__update_blocked_others(). But as the CPU capacity depends partly on the
avg_dl, this issue has nonetheless a significant impact on the scheduler.
Fix this issue by ensuring a load update when a running task changes
its policy to DL.
Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
Vincent Donnefort [Mon, 21 Jun 2021 10:37:51 +0000 (11:37 +0100)]
sched/rt: Fix RT utilization tracking during policy change
RT keeps track of the utilization on a per-rq basis with the structure
avg_rt. This utilization is updated during task_tick_rt(),
put_prev_task_rt() and set_next_task_rt(). However, when the current
running task changes its policy, set_next_task_rt() which would usually
take care of updating the utilization when the rq starts running RT tasks,
will not see a such change, leaving the avg_rt structure outdated. When
that very same task will be dequeued later, put_prev_task_rt() will then
update the utilization, based on a wrong last_update_time, leading to a
huge spike in the RT utilization signal.
The signal would eventually recover from this issue after few ms. Even if
no RT tasks are run, avg_rt is also updated in __update_blocked_others().
But as the CPU capacity depends partly on the avg_rt, this issue has
nonetheless a significant impact on the scheduler.
Fix this issue by ensuring a load update when a running task changes
its policy to RT.
Fixes:
371bf427 ("sched/rt: Add rt_rq utilization tracking")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
Peter Zijlstra [Fri, 11 Jun 2021 08:28:17 +0000 (10:28 +0200)]
sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
Peter Zijlstra [Fri, 11 Jun 2021 08:28:16 +0000 (10:28 +0200)]
sched,arch: Remove unused TASK_STATE offsets
All 6 architectures define TASK_STATE in asm-offsets, but then never
actually use it. Remove the definitions to make sure they never will.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210611082838.472811363@infradead.org
Peter Zijlstra [Fri, 11 Jun 2021 08:28:15 +0000 (10:28 +0200)]
sched,timer: Use __set_current_state()
There's an existing helper for setting TASK_RUNNING; must've gotten
lost last time we did this cleanup.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.409696194@infradead.org
Peter Zijlstra [Fri, 11 Jun 2021 08:28:14 +0000 (10:28 +0200)]
sched: Add get_current_state()
Remove yet another few p->state accesses.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org
Peter Zijlstra [Fri, 11 Jun 2021 08:28:13 +0000 (10:28 +0200)]
sched,perf,kvm: Fix preemption condition
When ran from the sched-out path (preempt_notifier or perf_event),
p->state is irrelevant to determine preemption. You can get preempted
with !task_is_running() just fine.
The right indicator for preemption is if the task is still on the
runqueue in the sched-out path.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org
Peter Zijlstra [Fri, 11 Jun 2021 08:28:12 +0000 (10:28 +0200)]
sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
Peter Zijlstra [Fri, 11 Jun 2021 08:28:11 +0000 (10:28 +0200)]
sched: Unbreak wakeups
Remove broken task->state references and let wake_up_process() DTRT.
The anti-pattern in these patches breaks the ordering of ->state vs
COND as described in the comment near set_current_state() and can lead
to missed wakeups:
(OoO load, observes RUNNING)<-.
for (;;) { |
t->state = UNINTERRUPTIBLE; |
smp_mb(); ,-----> | (observes !COND)
| /
if (COND) ---------' | COND = 1;
break; `- if (t->state != RUNNING)
wake_up_process(t); // not done
schedule(); // forever waiting
}
t->state = TASK_RUNNING;
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.160855222@infradead.org
Ingo Molnar [Fri, 18 Jun 2021 09:31:25 +0000 (11:31 +0200)]
Merge branch 'sched/urgent' into sched/core, to resolve conflicts
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 15 Jun 2021 11:16:11 +0000 (12:16 +0100)]
sched/fair: Age the average idle time
This is a partial forward-port of Peter Ziljstra's work first posted
at:
https://lore.kernel.org/lkml/
20180530142236.
667774973@infradead.org/
Currently select_idle_cpu()'s proportional scheme uses the average idle
time *for when we are idle*, that is temporally challenged. When a CPU
is not at all idle, we'll happily continue using whatever value we did
see when the CPU goes idle. To fix this, introduce a separate average
idle and age it (the existing value still makes sense for things like
new-idle balancing, which happens when we do go idle).
The overall goal is to not spend more time scanning for idle CPUs than
we're idle for. Otherwise we're inhibiting work. This means that we need to
consider the cost over all the wake-ups between consecutive idle periods.
To track this, the scan cost is subtracted from the estimated average
idle time.
The impact of this patch is related to workloads that have domains that
are fully busy or overloaded. Without the patch, the scan depth may be
too high because a CPU is not reaching idle.
Due to the nature of the patch, this is a regression magnet. It
potentially wins when domains are almost fully busy or overloaded --
at that point searches are likely to fail but idle is not being aged
as CPUs are active so search depth is too large and useless. It will
potentially show regressions when there are idle CPUs and a deep search is
beneficial. This tbench result on a 2-socket broadwell machine partially
illustates the problem
5.13.0-rc2 5.13.0-rc2
vanilla sched-avgidle-v1r5
Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%*
Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%*
Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%*
Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%*
Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%*
Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%*
Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%*
Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%*
Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%*
Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%*
Note that 8 was a regression point where a deeper search would have helped
but it gains for high thread counts when searches are useless. Hackbench
is a more extreme example although not perfect as the tasks idle rapidly
hackbench-process-pipes
5.13.0-rc2 5.13.0-rc2
vanilla sched-avgidle-v1r5
Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%)
Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%)
Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%)
Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%*
Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%*
Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%*
Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%)
Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%*
Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%*
Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%*
Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%*
Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%*
Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%*
Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%*
Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%*
Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%)
Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%)
Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%)
Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%)
Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%)
Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%)
Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%)
Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%)
Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%)
Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%)
Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%)
Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%)
Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%)
Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%)
Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%)
Again, higher thread counts benefit and the standard deviation
shows that results are also a lot more stable when the idle
time is aged.
The patch potentially matters when a socket was multiple LLCs as the
maximum search depth is lower. However, some of the test results were
suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and
other results were not dramatically different to other mcahines.
Given the nature of the patch, Peter's full series is not being forward
ported as each part should stand on its own. Preferably they would be
merged at different times to reduce the risk of false bisections.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
Lukasz Luba [Mon, 14 Jun 2021 19:12:38 +0000 (20:12 +0100)]
sched/cpufreq: Consider reduced CPU capacity in energy calculation
Energy Aware Scheduling (EAS) needs to predict the decisions made by
SchedUtil. The map_util_freq() exists to do that.
There are corner cases where the max allowed frequency might be reduced
(due to thermal). SchedUtil as a CPUFreq governor, is aware of that
but EAS is not. This patch aims to address it.
SchedUtil stores the maximum allowed frequency in
'sugov_policy::next_freq' field. EAS has to predict that value, which is
the real used frequency. That value is made after a call to
cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
In the existing code EAS is not able to predict that real frequency.
This leads to energy estimation errors.
To avoid wrong energy estimation in EAS (due to frequency miss prediction)
make sure that the step which calculates Performance Domain frequency,
is also aware of the allowed CPU capacity.
Furthermore, modify map_util_freq() to not extend the frequency value.
Instead, use map_util_perf() to extend the util value in both places:
SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
In the end, we achieve the same desirable behavior for both subsystems
and alignment in regards to the real CPU frequency.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
Lukasz Luba [Mon, 14 Jun 2021 19:11:28 +0000 (20:11 +0100)]
sched/fair: Take thermal pressure into account while estimating energy
Energy Aware Scheduling (EAS) needs to be able to predict the frequency
requests made by the SchedUtil governor to properly estimate energy used
in the future. It has to take into account CPUs utilization and forecast
Performance Domain (PD) frequency. There is a corner case when the max
allowed frequency might be reduced due to thermal. SchedUtil is aware of
that reduced frequency, so it should be taken into account also in EAS
estimations.
SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
to 'policy::max'. SchedUtil is responsible to respect that upper limit
while setting the frequency through CPUFreq drivers. This effective
frequency is stored internally in 'sugov_policy::next_freq' and EAS has
to predict that value.
In the existing code the raw value of arch_scale_cpu_capacity() is used
for clamping the returned CPU utilization from effective_cpu_util().
This patch fixes issue with too big single CPU utilization, by introducing
clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
capacity reduced by thermal pressure raw value.
Thanks to knowledge about allowed CPU capacity, we don't get too big value
for a single CPU utilization, which is then added to the util sum. The
util sum is used as a source of information for estimating whole PD energy.
To avoid wrong energy estimation in EAS (due to capped frequency), make
sure that the calculation of util sum is aware of allowed CPU capacity.
This thermal pressure might be visible in scenarios where the CPUs are not
heavily loaded, but some other component (like GPU) drastically reduced
available power budget and increased the SoC temperature. Thus, we still
use EAS for task placement and CPUs are not over-utilized.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
Lukasz Luba [Mon, 14 Jun 2021 19:10:30 +0000 (20:10 +0100)]
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
The thermal pressure signal gives information to the scheduler about
reduced CPU capacity due to thermal. It is based on a value stored in
a per-cpu 'thermal_pressure' variable. The online CPUs will get the
new value there, while the offline won't. Unfortunately, when the CPU
is back online, the value read from per-cpu variable might be wrong
(stale data). This might affect the scheduler decisions, since it
sees the CPU capacity differently than what is actually available.
Fix it by making sure that all online+offline CPUs would get the
proper value in their per-cpu variable when thermal framework sets
capping.
Fixes:
f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20210614191030.22241-1-lukasz.luba@arm.com
Dietmar Eggemann [Tue, 1 Jun 2021 08:36:16 +0000 (10:36 +0200)]
sched/fair: Return early from update_tg_cfs_load() if delta == 0
In case the _avg delta is 0 there is no need to update se's _avg
(level n) nor cfs_rq's _avg (level n-1). These values stay the same.
Since cfs_rq's _avg isn't changed, i.e. no load is propagated down,
cfs_rq's _sum should stay the same as well.
So bail out after se's _sum has been updated.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
Vincent Guittot [Tue, 1 Jun 2021 15:53:28 +0000 (17:53 +0200)]
sched/pelt: Check that *_avg are null when *_sum are
Check that we never break the rule that pelt's avg values are null if
pelt's sum are.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Odin Ugedal <odin@uged.al>
Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
Odin Ugedal [Sat, 12 Jun 2021 11:28:15 +0000 (13:28 +0200)]
sched/fair: Correctly insert cfs_rq's to list on unthrottle
Fix an issue where fairness is decreased since cfs_rq's can end up not
being decayed properly. For two sibling control groups with the same
priority, this can often lead to a load ratio of 99/1 (!!).
This happens because when a cfs_rq is throttled, all the descendant
cfs_rq's will be removed from the leaf list. When they initial cfs_rq
is unthrottled, it will currently only re add descendant cfs_rq's if
they have one or more entities enqueued. This is not a perfect
heuristic.
Instead, we insert all cfs_rq's that contain one or more enqueued
entities, or it its load is not completely decayed.
Can often lead to situations like this for equally weighted control
groups:
$ ps u -C stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1
root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1
Fixes:
31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()")
[vingo: !SMP build fix]
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
Linus Torvalds [Sun, 13 Jun 2021 21:43:10 +0000 (14:43 -0700)]
Linux 5.13-rc6
Linus Torvalds [Sun, 13 Jun 2021 19:41:47 +0000 (12:41 -0700)]
Merge tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git./linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Correct buffer copying when peeking events
- Sync cpufeatures/disabled-features.h header with the kernel sources
* tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
tools headers cpufeatures: Sync with the kernel sources
perf session: Correct buffer copying when peeking events
Linus Torvalds [Sun, 13 Jun 2021 19:32:59 +0000 (12:32 -0700)]
Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix use-after-free in nfs4_init_client()
Bugfixes:
- Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
- Fix second deadlock in nfs4_evict_inode()
- nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP
- Fix setting of the NFS_CAP_SECURITY_LABEL capability"
* tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix second deadlock in nfs4_evict_inode()
NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
NFS: FMODE_READ and friends are C macros, not enum types
NFS: Fix a potential NULL dereference in nfs_get_client()
NFS: Fix use-after-free in nfs4_init_client()
NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
Linus Torvalds [Sun, 13 Jun 2021 19:25:33 +0000 (12:25 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four reasonably small fixes to the core for scsi host allocation
failure paths.
The root problem is that we're not freeing the memory allocated by
dev_set_name(), which involves a rejig of may of the free on error
paths to do put_device() instead of kfree which, in turn, has several
other knock on ramifications and inspection turned up a few other
lurking bugs"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: core: Only put parent device if host state differs from SHOST_CREATED
scsi: core: Put .shost_dev in failure path if host state changes to RUNNING
scsi: core: Fix failure handling of scsi_add_host_with_dma()
scsi: core: Fix error handling of scsi_host_alloc()
Linus Torvalds [Sat, 12 Jun 2021 20:57:49 +0000 (13:57 -0700)]
Merge tag 'riscv-for-linus-5.13-rc6' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A pair of XIP fixes: one to fix alternatives, and one to turn off the
rest of the features that require code modification
- A fix to a type that was causing some alternatives to break
- A build fix for BUILTIN_DTB
* tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix BUILTIN_DTB for sifive and microchip soc
riscv: alternative: fix typo in macro name
riscv: code patching only works on !XIP_KERNEL
riscv: xip: support runtime trap patching
Feng Tang [Fri, 11 Jun 2021 01:54:42 +0000 (09:54 +0800)]
mm: relocate 'write_protect_seq' in struct mm_struct
0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit
57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork").
Further debug shows the regression is due to that commit changes the
offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
cache alignment changes.
From the perf data, the contention for 'mmap_lock' is very severe and
takes around 95% cpu cycles, and it is a rw_semaphore
struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...
Before commit
57efa1fe5957 adds the 'write_protect_seq', it happens to
have a very optimal cache alignment layout, as Linus explained:
"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.
Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.
Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.
Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."
After the commit, the rw_semaphore is at offset 128, which means the
'count' and 'owner' fields are now in the same cacheline, and causes
more cache bouncing.
Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
affect its offset:
CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
The layout above is on 64 bits system with 0day's default kernel config
(similar to RHEL-8.3's config), in which all these 3 options are 'y'.
And the layout can vary with different kernel configs.
Relayouting a structure is usually a double-edged sword, as sometimes it
can helps one case, but hurt other cases. For this case, one solution
is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
(when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
hole in 'mm_struct' will not change other fields' alignment, while
restoring the regression.
Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 12 Jun 2021 19:34:49 +0000 (12:34 -0700)]
Merge tag 'usb-5.13-rc6' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of tiny USB fixes for 5.13-rc6.
There are more than I would normally like, but there's been a bunch of
people banging on the gadget and dwc3 and typec code recently for I
think an Android release, which has resulted in a number of small
fixes. It's nice to see companies send fixes upstream for this type of
work, a notable change from years ago.
Anyway, fixes in here are:
- usb-serial device id updates
- usb-serial cp210x driver fixes for broken firmware versions
- typec fixes for crazy charging devices and other reported problems
- dwc3 fixes for reported problems found
- gadget fixes for reported problems
- tiny xhci fixes
- other small fixes for reported issues.
- revert of a problem fix found by linux-next testing
All of these have passed 0-day and linux-next testing with no reported
problems (the revert for the found linux-next build problem included)"
* tag 'usb-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (44 commits)
Revert "usb: gadget: fsl: Re-enable driver for ARM SoCs"
usb: typec: mux: Fix copy-paste mistake in typec_mux_match
usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
usb: gadget: fsl: Re-enable driver for ARM SoCs
usb: typec: wcove: Use LE to CPU conversion when accessing msg->header
USB: serial: cp210x: fix CP2102N-A01 modem control
USB: serial: cp210x: fix alternate function for CP2102N QFN20
usb: misc: brcmstb-usb-pinmap: check return value after calling platform_get_resource()
usb: dwc3: ep0: fix NULL pointer exception
usb: gadget: eem: fix wrong eem header operation
usb: typec: intel_pmc_mux: Put ACPI device using acpi_dev_put()
usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource()
usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe()
usb: typec: tcpm: Do not finish VDM AMS for retrying Responses
usb: fix various gadget panics on 10gbps cabling
usb: fix various gadgets null ptr deref on 10gbps cabling.
usb: pci-quirks: disable D3cold on xhci suspend for s2idle on AMD Renoir
usb: f_ncm: only first packet of aggregate needs to start timer
USB: f_ncm: ncm_bitrate (speed) is unsigned
MAINTAINERS: usb: add entry for isp1760
...
Linus Torvalds [Sat, 12 Jun 2021 19:27:05 +0000 (12:27 -0700)]
Merge tag 'tty-5.13-rc6' of git://git./linux/kernel/git/gregkh/tty
Pull serial driver fix from Greg KH:
"A single 8250_exar serial driver fix for a reported problem with a
change that happened in 5.13-rc1.
It has been in linux-next with no reported problems"
* tag 'tty-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250_exar: Avoid NULL pointer dereference at ->exit()
Linus Torvalds [Sat, 12 Jun 2021 19:23:54 +0000 (12:23 -0700)]
Merge tag 'staging-5.13-rc6' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Two tiny staging driver fixes:
- ralink-gdma driver authorship information fixed up
- rtl8723bs driver fix for reported regression
Both have been in linux-next for a while with no reported problems"
* tag 'staging-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: ralink-gdma: Remove incorrect author information
staging: rtl8723bs: Fix uninitialized variables
Linus Torvalds [Sat, 12 Jun 2021 19:18:49 +0000 (12:18 -0700)]
Merge tag 'driver-core-5.13-rc6' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"A single debugfs fix for 5.13-rc6, fixing a bug in
debugfs_read_file_str() that showed up in 5.13-rc1.
It has been in linux-next for a full week with no
reported problems"
* tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
debugfs: Fix debugfs_read_file_str()
Linus Torvalds [Sat, 12 Jun 2021 19:13:55 +0000 (12:13 -0700)]
Merge tag 'char-misc-5.13-rc6' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small misc driver fixes for 5.13-rc6 that fix some
reported problems:
- Tiny phy driver fixes for reported issues
- rtsx regression for when the device suspended
- mhi driver fix for a use-after-free
All of these have been in linux-next for a few days with no reported
issues"
* tag 'char-misc-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG
bus: mhi: pci-generic: Fix hibernation
bus: mhi: pci_generic: Fix possible use-after-free in mhi_pci_remove()
bus: mhi: pci_generic: T99W175: update channel name from AT to DUN
phy: Sparx5 Eth SerDes: check return value after calling platform_get_resource()
phy: ralink: phy-mt7621-pci: drop 'of_match_ptr' to fix -Wunused-const-variable
phy: ti: Fix an error code in wiz_probe()
phy: phy-mtk-tphy: Fix some resource leaks in mtk_phy_init()
phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe()
phy: usb: Fix misuse of IS_ENABLED
Linus Torvalds [Sat, 12 Jun 2021 19:06:24 +0000 (12:06 -0700)]
Merge tag 'pinctrl-v5.13-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Fix some documentation warnings for Allwinner
- Fix duplicated GPIO groups on Qualcomm SDX55
- Fix a double enablement bug in the Ralink driver
- Fix the Qualcomm SC8180x Kconfig so the driver can be selected.
* tag 'pinctrl-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: qcom: Make it possible to select SC8180x TLMM
pinctrl: ralink: rt2880: avoid to error in calls is pin is already enabled
pinctrl: qcom: Fix duplication in gpio_groups
pinctrl: aspeed: Fix minor documentation error
Linus Torvalds [Sat, 12 Jun 2021 18:59:58 +0000 (11:59 -0700)]
Merge tag 'block-5.13-2021-06-12' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes that should go into 5.13:
- Fix a regression deadlock introduced in this release between open
and remove of a bdev (Christoph)
- Fix an async_xor md regression in this release (Xiao)
- Fix bcache oversized read issue (Coly)"
* tag 'block-5.13-2021-06-12' of git://git.kernel.dk/linux-block:
block: loop: fix deadlock between open and remove
async_xor: check src_offs is not NULL before updating it
bcache: avoid oversized read request in cache missing code path
bcache: remove bcache device self-defined readahead
Linus Torvalds [Sat, 12 Jun 2021 18:53:20 +0000 (11:53 -0700)]
Merge tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just an API change for the registration changes that went into this
release. Better to get it sorted out now than before it's too late"
* tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block:
io_uring: add feature flag for rsrc tags
io_uring: change registration/upd/rsrc tagging ABI
Linus Torvalds [Sat, 12 Jun 2021 18:41:28 +0000 (11:41 -0700)]
Merge tag 'sched-urgent-2021-06-12' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc fixes:
- Fix performance regression caused by lack of intended batching of
RCU callbacks by over-eager NOHZ-full code.
- Fix cgroups related corruption of load_avg and load_sum metrics.
- Three fixes to fix blocked load, util_sum/runnable_sum and util_est
tracking bugs"
* tag 'sched-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling
sched/pelt: Ensure that *_sum is always synced with *_avg
tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed
sched/fair: Make sure to update tg contrib for blocked load
sched/fair: Keep load_avg and load_sum synced
Linus Torvalds [Sat, 12 Jun 2021 18:34:49 +0000 (11:34 -0700)]
Merge tag 'perf-urgent-2021-06-12' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes:
- Fix the NMI watchdog on ancient Intel CPUs
- Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe
irq_work path used by perf.
- Fix uncore events on Ice Lake servers.
- Someone booted maxcpus=1 on an SNB-EP, and the uncore driver
emitted warnings and was probably buggy. Fix it.
- KCSAN found a genuine data race in the core perf code. Somewhat
ironically the bug was introduced through a recent race fix. :-/
In our defense, the new race window was much more narrow. Fix it"
* tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
irq_work: Make irq_work_queue() NMI-safe again
perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1
perf: Fix data race between pin_count increment/decrement
Linus Torvalds [Sat, 12 Jun 2021 18:10:28 +0000 (11:10 -0700)]
Merge tag 'objtool-urgent-2021-06-12' of git://git./linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Two objtool fixes:
- fix a bug that corrupts the code by mistakenly rewriting
conditional jumps
- fix another bug generating an incorrect ELF symbol table
during retpoline rewriting"
* tag 'objtool-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Only rewrite unconditional retpoline thunk calls
objtool: Fix .symtab_shndx handling for elf_create_undef_symbol()
Alexandre Ghiti [Fri, 4 Jun 2021 12:06:39 +0000 (14:06 +0200)]
riscv: Fix BUILTIN_DTB for sifive and microchip soc
Fix BUILTIN_DTB config which resulted in a dtb that was actually not
built into the Linux image: in the same manner as Canaan soc does,
create an object file from the dtb file that will get linked into the
Linux image.
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Linus Torvalds [Sat, 12 Jun 2021 00:05:03 +0000 (17:05 -0700)]
Merge tag 'trace-v5.13-rc5-2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the length check in the temp buffer filter
- Fix build failure in bootconfig tools for "fallthrough" macro
- Fix error return of bootconfig apply_xbc() routine
* tag 'trace-v5.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Correct the length check which causes memory corruption
ftrace: Do not blindly read the ip address in ftrace_bug()
tools/bootconfig: Fix a build error accroding to undefined fallthrough
tools/bootconfig: Fix error return code in apply_xbc()
Linus Torvalds [Fri, 11 Jun 2021 23:29:53 +0000 (16:29 -0700)]
Merge tag 'clang-features-v5.13-rc6' of git://git./linux/kernel/git/kees/linux
Pull clang LTO fix from Kees Cook:
"Clang 13 fixed some IR behavior for LTO, but this broke work-arounds
used in the kernel.
Handle changes to needed LTO flags in Clang 13 (Tor Vic)"
* tag 'clang-features-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
x86, lto: Pass -stack-alignment only on LLD < 13.0.0
Linus Torvalds [Fri, 11 Jun 2021 23:27:18 +0000 (16:27 -0700)]
Merge tag 'gpio-fixes-for-v5.13-rc6' of git://git./linux/kernel/git/brgl/linux
Pull gpio fix from Bartosz Golaszewski:
"Fix a shift-out-of-bounds error in gpio-wcd934x"
* tag 'gpio-fixes-for-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: wcd934x: Fix shift-out-of-bounds error
Linus Torvalds [Fri, 11 Jun 2021 19:33:38 +0000 (12:33 -0700)]
Merge tag 'drm-fixes-2021-06-11' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Another week of fixes, nothing too crazy, but a few all over the
place.
Two locking fixes in the core/ttm area, a couple of small driver fixes
(radeon, sun4i, mcde, vc4). Then msm and amdgpu have a set of fixes
each, mostly for smaller things, though the msm has a DSI fix for a
black screen.
I haven't seen any intel fixes this week so they may have a few that
may or may not wait for next week.
drm:
- auth locking fix
ttm:
- locking fix
amdgpu:
- Use kvzmalloc in amdgu_bo_create
- Use drm_dbg_kms for reporting failure to get a GEM FB
- Fix some register offsets for Sienna Cichlid
- Fix fall-through warning
radeon:
- memcpy_to/from_io fixes
msm:
- NULL ptr deref fix
- CP_PROTECT reg programming fix
- incorrect register shift fix
- DSI blank screen fix
sun4i:
- hdmi output probing fix
mcde:
- DSI pipeline calc fix
vc4:
- out of bounds fix"
* tag 'drm-fixes-2021-06-11' of git://anongit.freedesktop.org/drm/drm:
drm/msm/dsi: Stash away calculated vco frequency on recalc
drm: Lock pointer access in drm_master_release()
drm/mcde: Fix off by 10^3 in calculation
drm/msm/a6xx: avoid shadow NULL reference in failure path
drm/msm/a6xx: fix incorrectly set uavflagprd_inv field for A650
drm/msm/a6xx: update/fix CP_PROTECT initialization
radeon: use memcpy_to/fromio for UVD fw upload
drm/amd/pm: Fix fall-through warning for Clang
drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
drm/amdgpu: Use drm_dbg_kms for reporting failure to get a GEM FB
drm/amdgpu: switch kzalloc to kvzalloc in amdgpu_bo_create
drm/msm: Init mm_list before accessing it for use_vram path
drm: Fix use-after-free read in drm_getunique()
drm/vc4: fix vc4_atomic_commit_tail() logic
drm/ttm: fix deref of bo->ttm without holding the lock v2
drm/sun4i: dw-hdmi: Make HDMI PHY into a platform device
Linus Torvalds [Fri, 11 Jun 2021 18:02:56 +0000 (11:02 -0700)]
Merge tag 'devicetree-fixes-for-5.13-3' of git://git./linux/kernel/git/robh/linux
Pull devicetree fix from Rob Herring:
"A single fix for broken media/renesas,drif.yaml binding schema"
* tag 'devicetree-fixes-for-5.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
media: dt-bindings: media: renesas,drif: Fix fck definition
Jens Axboe [Fri, 11 Jun 2021 17:56:08 +0000 (11:56 -0600)]
Merge branch 'md-fixes' of https://git./linux/kernel/git/song/md into block-5.13
Pull MD related fix from Song.
* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
async_xor: check src_offs is not NULL before updating it
Linus Torvalds [Fri, 11 Jun 2021 17:53:43 +0000 (10:53 -0700)]
Merge tag 'acpi-5.13-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These revert a problematic recent commit and fix a regression
introduced during the 5.12 development cycle.
Specifics:
- Revert recent commit that attempted to fix the FACS table reference
counting but introduced a problem with accessing the hardware
signature after hibernation (Zhang Rui).
- Fix regression in the _OSC handling that broke the loading of ACPI
tables on some systems (Mika Westerberg)"
* tag 'acpi-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: Pass the same capabilities to the _OSC regardless of the query flag
Revert "ACPI: sleep: Put the FACS table after using it"
Christoph Hellwig [Sat, 5 Jun 2021 14:09:50 +0000 (17:09 +0300)]
block: loop: fix deadlock between open and remove
Commit
c76f48eb5c08 ("block: take bd_mutex around delete_partitions in
del_gendisk") adds disk->part0->bd_mutex in del_gendisk(), this way
causes the following AB/BA deadlock between removing loop and opening
loop:
1) loop_control_ioctl(LOOP_CTL_REMOVE)
-> mutex_lock(&loop_ctl_mutex)
-> del_gendisk
-> mutex_lock(&disk->part0->bd_mutex)
2) blkdev_get_by_dev
-> mutex_lock(&disk->part0->bd_mutex)
-> lo_open
-> mutex_lock(&loop_ctl_mutex)
Add a new Lo_deleting state to remove the need for clearing
->private_data and thus holding loop_ctl_mutex in the ioctl
LOOP_CTL_REMOVE path.
Based on an analysis and earlier patch from
Ming Lei <ming.lei@redhat.com>.
Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes:
c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210605140950.5800-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 11 Jun 2021 17:47:10 +0000 (10:47 -0700)]
Merge tag 'sound-5.13-rc6' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A bit more commits than expected at this time, but likely it's the
last shot before the final.
Many of changes are device-specific fix-ups for various ASoC drivers,
while a few usual HD-audio quirks and a FireWire fix, as well as a
couple of ALSA / ASoC core fixes.
All look nice and small, and nothing to scare much"
* tag 'sound-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: seq: Fix race of snd_seq_timer_open()
ALSA: hda/realtek: fix mute/micmute LEDs for HP ZBook Power G8
ALSA: hda/realtek: headphone and mic don't work on an Acer laptop
ASoC: qcom: lpass-cpu: Fix pop noise during audio capture begin
ALSA: firewire-lib: fix the context to call snd_pcm_stop_xrun()
ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 840 Aero G8
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP EliteBook x360 1040 G8
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Elite Dragonfly G2
ASoC: rt5682: Fix the fast discharge for headset unplugging in soundwire mode
ASoC: tas2562: Fix TDM_CFG0_SAMPRATE values
ASoC: meson: gx-card: fix sound-dai dt schema
ASoC: AMD Renoir: Remove fix for DMI entry on Lenovo 2020 platforms
ASoC: AMD Renoir - add DMI entry for Lenovo 2020 AMD platforms
ASoC: SOF: reset enabled_cores state at suspend
ASoC: fsl-asoc-card: Set .owner attribute when registering card.
ASoC: topology: Fix spelling mistake "vesion" -> "version"
ASoC: rt5659: Fix the lost powers for the HDA header
ASoC: core: Fix Null-point-dereference in fmt_single_name()
Tor Vic [Thu, 10 Jun 2021 20:58:06 +0000 (20:58 +0000)]
x86, lto: Pass -stack-alignment only on LLD < 13.0.0
Since LLVM commit 3787ee4, the '-stack-alignment' flag has been dropped
[1], leading to the following error message when building a LTO kernel
with Clang-13 and LLD-13:
ld.lld: error: -plugin-opt=-: ld.lld: Unknown command line argument
'-stack-alignment=8'. Try 'ld.lld --help'
ld.lld: Did you mean '--stackrealign=8'?
It also appears that the '-code-model' flag is not necessary anymore
starting with LLVM-9 [2].
Drop '-code-model' and make '-stack-alignment' conditional on LLD < 13.0.0.
These flags were necessary because these flags were not encoded in the
IR properly, so the link would restart optimizations without them. Now
there are properly encoded in the IR, and these flags exposing
implementation details are no longer necessary.
[1] https://reviews.llvm.org/D103048
[2] https://reviews.llvm.org/D52322
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/1377
Signed-off-by: Tor Vic <torvic9@mailbox.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/f2c018ee-5999-741e-58d4-e482d5246067@mailbox.org
Linus Torvalds [Fri, 11 Jun 2021 17:07:50 +0000 (10:07 -0700)]
Merge tag 'hwmon-for-v5.13-rc6' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Fixes for tps23861, scpi-hwmon, and corsair-psu drivers, plus a
bindings fix for TI ADS7828"
* tag 'hwmon-for-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (tps23861) correct shunt LSB values
hwmon: (tps23861) set current shunt value
hwmon: (tps23861) define regmap max register
hwmon: (scpi-hwmon) shows the negative temperature properly
hwmon: (corsair-psu) fix suspend behavior
dt-bindings: hwmon: Fix typo in TI ADS7828 bindings
Linus Torvalds [Fri, 11 Jun 2021 17:02:30 +0000 (10:02 -0700)]
Merge tag 'mmc-v5.13-rc3' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"A couple of MMC fixes to the Renesas SDHI driver:
- Fix HS400 on R-Car M3-W+
- Abort tuning when timeout detected"
* tag 'mmc-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: renesas_sdhi: Fix HS400 on R-Car M3-W+
mmc: renesas_sdhi: abort tuning when timeout detected
Rafael J. Wysocki [Fri, 11 Jun 2021 15:57:24 +0000 (17:57 +0200)]
Merge branch 'acpi-bus'
* acpi-bus:
ACPI: Pass the same capabilities to the _OSC regardless of the query flag
Arnaldo Carvalho de Melo [Tue, 8 Jun 2021 16:46:18 +0000 (13:46 -0300)]
tools headers cpufeatures: Sync with the kernel sources
To pick the changes in:
fb35d30fe5b06cc2 ("x86/cpufeatures: Assign dedicated feature word for CPUID_0x8000001F[EAX]")
e7b6385b01d8e9fb ("x86/cpufeatures: Add Intel SGX hardware bits")
1478b99a76534b6c ("x86/cpufeatures: Mark ENQCMD as disabled when configured out")
That don't cause any change in the tools, just silences this perf build
warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/disabled-features.h' differs from latest version at 'arch/x86/include/asm/disabled-features.h'
diff -u tools/arch/x86/include/asm/disabled-features.h arch/x86/include/asm/disabled-features.h
Cc: Borislav Petkov <bp@suse.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Leo Yan [Sat, 5 Jun 2021 05:29:57 +0000 (13:29 +0800)]
perf session: Correct buffer copying when peeking events
When peeking an event, it has a short path and a long path. The short
path uses the session pointer "one_mmap_addr" to directly fetch the
event; and the long path needs to read out the event header and the
following event data from file and fill into the buffer pointer passed
through the argument "buf".
The issue is in the long path that it copies the event header and event
data into the same destination address which pointer "buf", this means
the event header is overwritten. We are just lucky to run into the
short path in most cases, so we don't hit the issue in the long path.
This patch adds the offset "hdr_sz" to the pointer "buf" when copying
the event data, so that it can reserve the event header which can be
used properly by its caller.
Fixes:
5a52f33adf02 ("perf session: Add perf_session__peek_event()")
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210605052957.1070720-1-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Greg Kroah-Hartman [Fri, 11 Jun 2021 10:32:49 +0000 (12:32 +0200)]
Merge tag 'usb-serial-5.13-rc6' of https://git./linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for 5.13-rc6
Here are two fixes for the cp210x driver. The first fixes a regression
with early revisions of the CP2102N which specifically broke some ESP32
development boards. The second makes sure that the pin configuration is
detected properly also for the CP2102N QFN20 package.
Both have been in linux-next over night and with no reported issues.
* tag 'usb-serial-5.13-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: cp210x: fix CP2102N-A01 modem control
USB: serial: cp210x: fix alternate function for CP2102N QFN20
Greg Kroah-Hartman [Fri, 11 Jun 2021 07:18:47 +0000 (09:18 +0200)]
Revert "usb: gadget: fsl: Re-enable driver for ARM SoCs"
This reverts commit
e0e8b6abe8c862229ba00cdd806e8598cdef00bb.
Turns out this breaks the build. We had numerous reports of problems
from linux-next and 0-day about this not working properly, so revert it
for now until it can be figured out properly.
The build errors are:
arm-linux-gnueabi-ld: fsl_udc_core.c:(.text+0x29d4): undefined reference to `fsl_udc_clk_finalize'
arm-linux-gnueabi-ld: fsl_udc_core.c:(.text+0x2ba8): undefined reference to `fsl_udc_clk_release'
fsl_udc_core.c:(.text+0x2848): undefined reference to `fsl_udc_clk_init'
fsl_udc_core.c:(.text+0xe88): undefined reference to `fsl_udc_clk_release'
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Fixes:
e0e8b6abe8c8 ("usb: gadget: fsl: Re-enable driver for ARM SoCs")
Cc: stable <stable@vger.kernel.org>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Leo Li <leoyang.li@nxp.com>
Cc: Peter Chen <peter.chen@nxp.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Ran Wang <ran.wang_1@nxp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Peter Zijlstra [Thu, 10 Jun 2021 07:04:29 +0000 (09:04 +0200)]
objtool: Only rewrite unconditional retpoline thunk calls
It turns out that the compilers generate conditional branches to the
retpoline thunks like:
5d5: 0f 85 00 00 00 00 jne 5db <cpuidle_reflect+0x22>
5d7: R_X86_64_PLT32 __x86_indirect_thunk_r11-0x4
while the rewrite can only handle JMP/CALL to the thunks. The result
is the alternative wrecking the code. Make sure to skip writing the
alternatives for conditional branches.
Fixes:
9bc0bb50727c ("objtool/x86: Rewrite retpoline thunk calls")
Reported-by: Lukasz Majczak <lma@semihalf.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Vitaly Wool [Mon, 31 May 2021 09:33:10 +0000 (12:33 +0300)]
riscv: alternative: fix typo in macro name
alternative-macros.h defines ALT_NEW_CONTENT in its assembly part
and ALT_NEW_CONSTENT in the C part. Most likely it is the latter
that is wrong.
Fixes:
6f4eea90465ad
(riscv: Introduce alternative mechanism to apply errata solution)
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Xiao Ni [Fri, 28 May 2021 06:16:38 +0000 (14:16 +0800)]
async_xor: check src_offs is not NULL before updating it
When PAGE_SIZE is greater than 4kB, multiple stripes may share the same
page. Thus, src_offs is added to async_xor_offs() with array of offsets.
However, async_xor() passes NULL src_offs to async_xor_offs(). In such
case, src_offs should not be updated. Add a check before the update.
Fixes:
ceaf2966ab08(async_xor: increase src_offs when dropping destination page)
Cc: stable@vger.kernel.org # v5.10+
Reported-by: Oleksandr Shchirskyi <oleksandr.shchirskyi@linux.intel.com>
Tested-by: Oleksandr Shchirskyi <oleksandr.shchirskyi@intel.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Dave Airlie [Fri, 11 Jun 2021 01:17:09 +0000 (11:17 +1000)]
Merge tag 'amd-drm-fixes-5.13-2021-06-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-5.13-2021-06-09:
amdgpu:
- Use kvzmalloc in amdgu_bo_create
- Use drm_dbg_kms for reporting failure to get a GEM FB
- Fix some register offsets for Sienna Cichlid
- Fix fall-through warning
radeon:
- memcpy_to/from_io fixes
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210610035631.3943-1-alexander.deucher@amd.com
Dave Airlie [Fri, 11 Jun 2021 00:59:49 +0000 (10:59 +1000)]
Merge tag 'drm-misc-fixes-2021-06-10' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
One fix for snu4i that prevents it from probing, two locking fixes for
ttm and drm_auth, one off-by-x1000 fix for mcde and a fix for vc4 to
prevent an out-of-bounds access.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20210610171653.lqsoadxrhdk73cdy@gilmour
Dave Airlie [Fri, 11 Jun 2021 00:45:27 +0000 (10:45 +1000)]
Merge tag 'drm-msm-fixes-2021-06-10' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
- NULL ptr deref fix
- CP_PROTECT reg programming fix
- incorrect register shift fix
- DSI blank screen fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGvbcz0=QxGYnX9u7cD1SCvFSx20dzrZuOccjtRRBTJd5Q@mail.gmail.com
Jisheng Zhang [Mon, 10 May 2021 16:28:38 +0000 (00:28 +0800)]
riscv: code patching only works on !XIP_KERNEL
Some features which need code patching such as KPROBES, DYNAMIC_FTRACE
KGDB can only work on !XIP_KERNEL. Add dependencies for these features
that rely on code patching.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Vitaly Wool [Mon, 31 May 2021 08:53:42 +0000 (11:53 +0300)]
riscv: xip: support runtime trap patching
RISCV_ERRATA_ALTERNATIVE patches text at runtime which is currently
not possible when the kernel is executed from the flash in XIP mode.
Since runtime patching concerns only traps at the moment, let's just
have all the traps reside in RAM anyway if RISCV_ERRATA_ALTERNATIVE
is set. Thus, these functions will be patch-able even when the .text
section is in flash.
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pavel Begunkov [Thu, 10 Jun 2021 15:37:38 +0000 (16:37 +0100)]
io_uring: add feature flag for rsrc tags
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of
new IORING_REGISTER operations, in particular
IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc
tagging, and also indicating implemented dynamic fixed buffer updates.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 10 Jun 2021 15:37:37 +0000 (16:37 +0100)]
io_uring: change registration/upd/rsrc tagging ABI
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.
It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.
Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric W. Biederman [Thu, 10 Jun 2021 20:11:11 +0000 (15:11 -0500)]
coredump: Limit what can interrupt coredumps
Olivier Langlois has been struggling with coredumps being incompletely written in
processes using io_uring.
Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
> with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
> set_notify_signal()
The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals. Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.
Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports. This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.
The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending. So it should
not develop surprising wake-up reasons either.
v2: Don't remove the now unnecessary code in prepare_signal.
Cc: stable@vger.kernel.org
Fixes:
12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 10 Jun 2021 19:01:22 +0000 (12:01 -0700)]
Merge branch 'for-5.13-fixes' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"This is a high priority but low risk fix for a cgroup1 bug where
rename(2) can change a cgroup's name to something which can break
parsing of /proc/PID/cgroup"
* 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup1: don't allow '\n' in renaming
Bjorn Andersson [Thu, 10 Jun 2021 00:21:32 +0000 (17:21 -0700)]
usb: typec: mux: Fix copy-paste mistake in typec_mux_match
Fix the copy-paste mistake in the return path of typec_mux_match(),
where dev is considered a member of struct typec_switch rather than
struct typec_mux.
The two structs are identical in regards to having the struct device as
the first entry, so this provides no functional change.
Fixes:
3370db35193b ("usb: typec: Registering real device entries for the muxes")
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20210610002132.3088083-1-bjorn.andersson@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mayank Rana [Wed, 9 Jun 2021 07:35:35 +0000 (00:35 -0700)]
usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
If ucsi_init() fails for some reason (e.g. ucsi_register_port()
fails or general communication failure to the PPM), particularly at
any point after the GET_CAPABILITY command had been issued, this
results in unwinding the initialization and returning an error.
However the ucsi structure's ucsi_capability member retains its
current value, including likely a non-zero num_connectors.
And because ucsi_init() itself is done in a workqueue a UCSI
interface driver will be unaware that it failed and may think the
ucsi_register() call was completely successful. Later, if
ucsi_unregister() is called, due to this stale ucsi->cap value it
would try to access the items in the ucsi->connector array which
might not be in a proper state or not even allocated at all and
results in NULL or invalid pointer dereference.
Fix this by clearing the ucsi->cap value to 0 during the error
path of ucsi_init() in order to prevent a later ucsi_unregister()
from entering the connector cleanup loop.
Fixes:
c1b0bc2dabfa ("usb: typec: Add support for UCSI interface")
Cc: stable@vger.kernel.org
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Mayank Rana <mrana@codeaurora.org>
Signed-off-by: Jack Pham <jackp@codeaurora.org>
Link: https://lore.kernel.org/r/20210609073535.5094-1-jackp@codeaurora.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Joel Stanley [Thu, 10 Jun 2021 03:49:57 +0000 (13:19 +0930)]
usb: gadget: fsl: Re-enable driver for ARM SoCs
The commit
a390bef7db1f ("usb: gadget: fsl_mxc_udc: Remove the driver")
dropped the ARCH_MXC dependency from USB_FSL_USB2, leaving it depending
solely on FSL_SOC.
FSL_SOC is powerpc only; it was briefly available on ARM in 2014 but was
removed by commit
cfd074ad8600 ("ARM: imx: temporarily remove
CONFIG_SOC_FSL from LS1021A"). Therefore the driver can no longer be
enabled on ARM platforms.
This appears to be a mistake as arm64's ARCH_LAYERSCAPE and arm32
SOC_LS1021A SoCs use this symbol. It's enabled in these defconfigs:
arch/arm/configs/imx_v6_v7_defconfig:CONFIG_USB_FSL_USB2=y
arch/arm/configs/multi_v7_defconfig:CONFIG_USB_FSL_USB2=y
arch/powerpc/configs/mgcoge_defconfig:CONFIG_USB_FSL_USB2=y
arch/powerpc/configs/mpc512x_defconfig:CONFIG_USB_FSL_USB2=y
To fix, expand the dependencies so USB_FSL_USB2 can be enabled on the
ARM platforms, and with COMPILE_TEST.
Fixes:
a390bef7db1f ("usb: gadget: fsl_mxc_udc: Remove the driver")
Signed-off-by: Joel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/r/20210610034957.93376-1-joel@jms.id.au
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Andy Shevchenko [Wed, 9 Jun 2021 17:22:02 +0000 (20:22 +0300)]
usb: typec: wcove: Use LE to CPU conversion when accessing msg->header
As LKP noticed the Sparse is not happy about strict type handling:
.../typec/tcpm/wcove.c:380:50: sparse: expected unsigned short [usertype] header
.../typec/tcpm/wcove.c:380:50: sparse: got restricted __le16 const [usertype] header
Fix this by switching to use pd_header_cnt_le() instead of pd_header_cnt()
in the affected code.
Fixes:
ae8a2ca8a221 ("usb: typec: Group all TCPCI/TCPM code together")
Fixes:
3c4fb9f16921 ("usb: typec: wcove: start using tcpm for USB PD support")
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20210609172202.83377-1-andriy.shevchenko@linux.intel.com
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Linus Torvalds [Thu, 10 Jun 2021 17:53:04 +0000 (10:53 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"A mixture of small bug fixes and a small security issue:
- WARN_ON when IPoIB is automatically moved between namespaces
- Long standing bug where mlx5 would use the wrong page for the
doorbell recovery memory if fork is used
- Security fix for mlx4 that disables the timestamp feature
- Several crashers for mlx5
- Plug a recent mlx5 memory leak for the sig_mr"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/mlx5: Fix initializing CQ fragments buffer
RDMA/mlx5: Delete right entry from MR signature database
RDMA: Verify port when creating flow rule
RDMA/mlx5: Block FDB rules when not in switchdev mode
RDMA/mlx4: Do not map the core_clock page to user space unless enabled
RDMA/mlx5: Use different doorbell memory for different processes
RDMA/ipoib: Fix warning caused by destroying non-initial netns
Robert Marko [Wed, 9 Jun 2021 22:07:28 +0000 (00:07 +0200)]
hwmon: (tps23861) correct shunt LSB values
Current shunt LSB values got reversed during in the
original driver commit.
So, correct the current shunt LSB values according to
the datasheet.
This caused reading slightly skewed current values.
Fixes:
fff7b8ab2255 ("hwmon: add Texas Instruments TPS23861 driver")
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Link: https://lore.kernel.org/r/20210609220728.499879-3-robert.marko@sartura.hr
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Robert Marko [Wed, 9 Jun 2021 22:07:27 +0000 (00:07 +0200)]
hwmon: (tps23861) set current shunt value
TPS23861 has a configuration bit for setting of the
current shunt value used on the board.
Its bit 0 of the General Mask 1 register.
According to the datasheet bit values are:
0 for 255 mOhm (Default)
1 for 250 mOhm
So, configure the bit before registering the hwmon
device according to the value passed in the DTS or
default one if none is passed.
This caused potentially reading slightly skewed values
due to max current value being 1.02A when 250mOhm shunt
is used instead of 1.0A when 255mOhm is used.
Fixes:
fff7b8ab2255 ("hwmon: add Texas Instruments TPS23861 driver")
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Link: https://lore.kernel.org/r/20210609220728.499879-2-robert.marko@sartura.hr
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Robert Marko [Wed, 9 Jun 2021 22:07:26 +0000 (00:07 +0200)]
hwmon: (tps23861) define regmap max register
Define the max register address the device supports.
This allows reading the whole register space via
regmap debugfs, without it only register 0x0 is visible.
This was forgotten in the original driver commit.
Fixes:
fff7b8ab2255 ("hwmon: add Texas Instruments TPS23861 driver")
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Link: https://lore.kernel.org/r/20210609220728.499879-1-robert.marko@sartura.hr
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Takashi Iwai [Thu, 10 Jun 2021 15:20:59 +0000 (17:20 +0200)]
ALSA: seq: Fix race of snd_seq_timer_open()
The timer instance per queue is exclusive, and snd_seq_timer_open()
should have managed the concurrent accesses. It looks as if it's
checking the already existing timer instance at the beginning, but
it's not right, because there is no protection, hence any later
concurrent call of snd_seq_timer_open() may override the timer
instance easily. This may result in UAF, as the leftover timer
instance can keep running while the queue itself gets closed, as
spotted by syzkaller recently.
For avoiding the race, add a proper check at the assignment of
tmr->timeri again, and return -EBUSY if it's been already registered.
Reported-by: syzbot+ddc1260a83ed1cbf6fb5@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/000000000000dce34f05c42f110c@google.com
Link: https://lore.kernel.org/r/20210610152059.24633-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Johan Hovold [Wed, 9 Jun 2021 16:15:09 +0000 (18:15 +0200)]
USB: serial: cp210x: fix CP2102N-A01 modem control
CP2102N revision A01 (firmware version <= 1.0.4) has a buggy
flow-control implementation that uses the ulXonLimit instead of
ulFlowReplace field of the flow-control settings structure (erratum
CP2102N_E104).
A recent change that set the input software flow-control limits
incidentally broke RTS control for these devices when CRTSCTS is not set
as the new limits would always enable hardware flow control.
Fix this by explicitly disabling flow control for the buggy firmware
versions and only updating the input software flow-control limits when
IXOFF is requested. This makes sure that the terminal settings matches
the default zero ulXonLimit (ulFlowReplace) for these devices.
Link: https://lore.kernel.org/r/20210609161509.9459-1-johan@kernel.org
Reported-by: David Frey <dpfrey@gmail.com>
Reported-by: Alex Villacís Lasso <a_villacis@palosanto.com>
Tested-by: Alex Villacís Lasso <a_villacis@palosanto.com>
Fixes:
f61309d9c96a ("USB: serial: cp210x: set IXOFF thresholds")
Cc: stable@vger.kernel.org # 5.12
Signed-off-by: Johan Hovold <johan@kernel.org>
Stephen Boyd [Tue, 8 Jun 2021 19:55:19 +0000 (12:55 -0700)]
drm/msm/dsi: Stash away calculated vco frequency on recalc
A problem was reported on CoachZ devices where the display wouldn't come
up, or it would be distorted. It turns out that the PLL code here wasn't
getting called once dsi_pll_10nm_vco_recalc_rate() started returning the
same exact frequency, down to the Hz, that the bootloader was setting
instead of 0 when the clk was registered with the clk framework.
After commit
001d8dc33875 ("drm/msm/dsi: remove temp data from global
pll structure") we use a hardcoded value for the parent clk frequency,
i.e. VCO_REF_CLK_RATE, and we also hardcode the value for FRAC_BITS,
instead of getting it from the config structure. This combination of
changes to the recalc function allows us to properly calculate the
frequency of the PLL regardless of whether or not the PLL has been
clk_prepare()d or clk_set_rate()d. That's a good improvement.
Unfortunately, this means that now we won't call down into the PLL clk
driver when we call clk_set_rate() because the frequency calculated in
the framework matches the frequency that is set in hardware. If the rate
is the same as what we want it should be OK to not call the set_rate PLL
op. The real problem is that the prepare op in this driver uses a
private struct member to stash away the vco frequency so that it can
call the set_rate op directly during prepare. Once the set_rate op is
never called because recalc_rate told us the rate is the same, we don't
set this private struct member before the prepare op runs, so we try to
call the set_rate function directly with a frequency of 0. This
effectively kills the PLL and configures it for a rate that won't work.
Calling set_rate from prepare is really quite bad and will confuse any
downstream clks about what the rate actually is of their parent. Fixing
that will be a rather large change though so we leave that to later.
For now, let's stash away the rate we calculate during recalc so that
the prepare op knows what frequency to set, instead of 0. This way
things keep working and the display can enable the PLL properly. In the
future, we should remove that code from the prepare op so that it
doesn't even try to call the set rate function.
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Abhinav Kumar <abhinavk@codeaurora.org>
Fixes:
001d8dc33875 ("drm/msm/dsi: remove temp data from global pll structure")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20210608195519.125561-1-swboyd@chromium.org
Signed-off-by: Rob Clark <robdclark@chromium.org>
Alexander Kuznetsov [Wed, 9 Jun 2021 07:17:19 +0000 (10:17 +0300)]
cgroup1: don't allow '\n' in renaming
cgroup_mkdir() have restriction on newline usage in names:
$ mkdir $'/sys/fs/cgroup/cpu/test\ntest2'
mkdir: cannot create directory
'/sys/fs/cgroup/cpu/test\ntest2': Invalid argument
But in cgroup1_rename() such check is missed.
This allows us to make /proc/<pid>/cgroup unparsable:
$ mkdir /sys/fs/cgroup/cpu/test
$ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2'
$ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2'
$ cat /proc/self/cgroup
11:pids:/
10:freezer:/
9:hugetlb:/
8:cpuset:/
7:blkio:/user.slice
6:memory:/user.slice
5:net_cls,net_prio:/
4:perf_event:/
3:devices:/user.slice
2:cpu,cpuacct:/test
test2
1:name=systemd:/
0::/
Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru>
Reported-by: Andrey Krasichkov <buglloc@yandex-team.ru>
Acked-by: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Alaa Hleihel [Thu, 10 Jun 2021 07:34:27 +0000 (10:34 +0300)]
IB/mlx5: Fix initializing CQ fragments buffer
The function init_cq_frag_buf() can be called to initialize the current CQ
fragments buffer cq->buf, or the temporary cq->resize_buf that is filled
during CQ resize operation.
However, the offending commit started to use function get_cqe() for
getting the CQEs, the issue with this change is that get_cqe() always
returns CQEs from cq->buf, which leads us to initialize the wrong buffer,
and in case of enlarging the CQ we try to access elements beyond the size
of the current cq->buf and eventually hit a kernel panic.
[exception RIP: init_cq_frag_buf+103]
[
ffff9f799ddcbcd8] mlx5_ib_resize_cq at
ffffffffc0835d60 [mlx5_ib]
[
ffff9f799ddcbdb0] ib_resize_cq at
ffffffffc05270df [ib_core]
[
ffff9f799ddcbdc0] llt_rdma_setup_qp at
ffffffffc0a6a712 [llt]
[
ffff9f799ddcbe10] llt_rdma_cc_event_action at
ffffffffc0a6b411 [llt]
[
ffff9f799ddcbe98] llt_rdma_client_conn_thread at
ffffffffc0a6bb75 [llt]
[
ffff9f799ddcbec8] kthread at
ffffffffa66c5da1
[
ffff9f799ddcbf50] ret_from_fork_nospec_begin at
ffffffffa6d95ddd
Fix it by getting the needed CQE by calling mlx5_frag_buf_get_wqe() that
takes the correct source buffer as a parameter.
Fixes:
388ca8be0037 ("IB/mlx5: Implement fragmented completion queue (CQ)")
Link: https://lore.kernel.org/r/90a0e8c924093cfa50a482880ad7e7edb73dc19a.1623309971.git.leonro@nvidia.com
Signed-off-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Aharon Landau [Thu, 10 Jun 2021 07:34:26 +0000 (10:34 +0300)]
RDMA/mlx5: Delete right entry from MR signature database
The value mr->sig is stored in the entry upon mr allocation, however, ibmr
is wrongly entered here as "old", therefore, xa_cmpxchg() does not replace
the entry with NULL, which leads to the following trace:
WARNING: CPU: 28 PID: 2078 at drivers/infiniband/hw/mlx5/main.c:3643 mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib]
Modules linked in: nvme_rdma nvme_fabrics nvme_core 8021q garp mrp bonding bridge stp llc rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_tad
CPU: 28 PID: 2078 Comm: reboot Tainted: G X --------- --- 5.13.0-0.rc2.19.el9.x86_64 #1
Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 2.9.1 12/07/2018
RIP: 0010:mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib]
Code: 8d bb 70 1f 00 00 be 00 01 00 00 e8 9d 94 ce da 48 3d 00 01 00 00 75 02 5b c3 0f 0b 5b c3 0f 0b 48 83 bb b0 20 00 00 00 74 d5 <0f> 0b eb d1 4
RSP: 0018:
ffffa8db06d33c90 EFLAGS:
00010282
RAX:
0000000000000000 RBX:
ffff97f890a44000 RCX:
ffff97f900ec0160
RDX:
0000000000000000 RSI:
0000000080080001 RDI:
ffff97f890a44000
RBP:
ffffffffc0c189b8 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000001 R11:
0000000000000300 R12:
ffff97f890a44000
R13:
ffffffffc0c36030 R14:
00000000fee1dead R15:
0000000000000000
FS:
00007f0d5a8a3b40(0000) GS:
ffff98077fb80000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000555acbf4f450 CR3:
00000002a6f56002 CR4:
00000000001706e0
Call Trace:
mlx5r_remove+0x39/0x60 [mlx5_ib]
auxiliary_bus_remove+0x1b/0x30
__device_release_driver+0x17a/0x230
device_release_driver+0x24/0x30
bus_remove_device+0xdb/0x140
device_del+0x18b/0x3e0
mlx5_detach_device+0x59/0x90 [mlx5_core]
mlx5_unload_one+0x22/0x60 [mlx5_core]
shutdown+0x31/0x3a [mlx5_core]
pci_device_shutdown+0x34/0x60
device_shutdown+0x15b/0x1c0
__do_sys_reboot.cold+0x2f/0x5b
? vfs_writev+0xc7/0x140
? handle_mm_fault+0xc5/0x290
? do_writev+0x6b/0x110
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes:
e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()")
Link: https://lore.kernel.org/r/f3f585ea0db59c2a78f94f65eedeafc5a2374993.1623309971.git.leonro@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Maor Gottlieb [Thu, 10 Jun 2021 07:34:25 +0000 (10:34 +0300)]
RDMA: Verify port when creating flow rule
Validate port value provided by the user and with that remove no longer
needed validation by the driver. The missing check in the mlx5_ib driver
could cause to the below oops.
Call trace:
_create_flow_rule+0x2d4/0xf28 [mlx5_ib]
mlx5_ib_create_flow+0x2d0/0x5b0 [mlx5_ib]
ib_uverbs_ex_create_flow+0x4cc/0x624 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xd4/0x150 [ib_uverbs]
ib_uverbs_cmd_verbs.isra.7+0xb28/0xc50 [ib_uverbs]
ib_uverbs_ioctl+0x158/0x1d0 [ib_uverbs]
do_vfs_ioctl+0xd0/0xaf0
ksys_ioctl+0x84/0xb4
__arm64_sys_ioctl+0x28/0xc4
el0_svc_common.constprop.3+0xa4/0x254
el0_svc_handler+0x84/0xa0
el0_svc+0x10/0x26c
Code:
b9401260 f9615681 51000400 8b001c20 (
f9403c1a)
Fixes:
436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs")
Link: https://lore.kernel.org/r/faad30dc5219a01727f47db3dc2f029d07c82c00.1623309971.git.leonro@nvidia.com
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>