review.tizen.org Git - platform/kernel/linux-exynos.git/log

EXPERIMENTAL: sched:/fair: modify slow path by adding explicit idle search

The patch add seeking for idle cpu in sched domain, prefering local group.
This should give the scheduler opportunity to find idle cpu when it's
capacity is smaller than sum of spare capacitites from other group.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

EXPERIMENTAL: sched/fair: find idle cpu in the group

In find_idlest_group take into account any CPU which is actually idle.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

EXPERIMENTAL: sched/fair: change load balance utilization checks

Patch is less conservative in load balance target switch decision.
The workload is more spreaded and there is no idle cpu when
system is fully loaded.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

sched/fair: change active balance path and add tracing

Patch changes default EAS behavior in active balance path.
It tries to enable all CPUs in some workload in sysbench.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

DT: arm64: exynos5433: change LITTLE core capacity ratio

This patch tries to align big an LITTLE cpu performance for the same frequency.
This assumption is true only for some particular workloads.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

trace: sched: add new trace events for tracking migrations

This patch adds some new traces which can help during load balance
and/or migration investigations.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

LOCAL / clocksource: arm_arch_timer: Don't register clockevent for Per-CPU

The arm64 architecture enabled the arm_arch_timer always
even if arm_arch_timer is not stable. When Exynos5433 uses the
arm_arch_timer, it fails to enable/disble the secondary cpu.

To fix the hotplug issue of secondary cpu, if Exynos's MCT timer
is enabled, arm_arch_timer doesn't register the clockevent for Per-CPU.

Change-Id: Ia41f9c13889c4637391168dbc828a0bdbc7cdfff
Reported-by: Wook Song <wook16.song@samsung.com>
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>

LOCAL / clocksource/ arch_timer: Use phy timer instead of virt timer

Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>

FIX: drivers/usb: drop changes for hikey and go back to v4.18-rc6

This a a copy of the whole dir which fixes HiSilicon changes to usb and gadget.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

DT: arm64: exynos5433: add cpu-map with cores and clusters constelation

The patch adds missing topology information. This is required i.e
for properly building scheduler topology, groups, or energy model.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

DT: arm64: exynos5433: add dynamic power coefficent to cpu nodes

The patch adds needed dynamic power coefficent for the Energy Model
needed by EAS and/or IPA. The values where captured and calculated
during experiments.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

config: arm64: tm2: enable new energy model

The patch enables new energy model framework for tm2 (exynos5433 SoC).

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

DT: arm64: exynos5433: add capacity to cpu nodes

The patch adds capacity needed by EAS.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

config: arm64: add needed cgroups, trace setting to tm2 config

The patch ports some missing configs from basic tizen to
enable EAS with tracing on basic v4.14 from tizenOS tm2.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

config: arm64: port tm2 config from v4.14 from tizen

The patch takes the config from tizenOS v4.14 as a base
starting point for mainline kernel.

Signed-off-by: Lukasz Luba <l.luba@partner.samsung.com>

sched/events: Introduce util_est trace events

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
Change-Id: I65e294c454369cbc15a29370d8a13ce358a95c39

sched/events: Introduce task_group load tracking trace event

The trace event key load is mapped to:

(1) load : cfs_rq->tg->load_avg

The cfs_rq owned by the task_group is used as the only parameter for the
trace event because it has a reference to the taskgroup and the cpu.
Using the taskgroup as a parameter instead would require the cpu as a
second parameter. A task_group is global and not per-cpu data. The cpu
key only tells on which cpu the value was gathered.

The following list shows examples of the key=value pairs for:

(1) a task group:

cpu=1 path=/tg1/tg11/tg111 load=517

(2) an autogroup:

cpu=1 path=/autogroup-10 load=1050

We don't maintain a load signal for a root task group.

The trace event is only defined if cfs group scheduling support
(CONFIG_FAIR_GROUP_SCHED) is enabled.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

sched/events: Introduce sched_entity load tracking trace event

The following trace event keys are mapped to:

(1) load     : se->avg.load_avg

(2) rbl_load : se->avg.runnable_load_avg

(3) util     : se->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for non-existent key=value pairs:

path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the
                   sched_entity represents a task.

comm = "(null)" : In case sched_entity represents a task_group.

pid = -1        : In case sched_entity represents a task_group.

The following list shows examples of the key=value pairs in different
configurations for:

(1) a task:

     cpu=0 path=(null) comm=sshd pid=2206 load=102 rbl_load=102  util=102

(2) a taskgroup:

     cpu=1 path=/tg1/tg11/tg111 comm=(null) pid=-1 load=882 rbl_load=882 util=510

(3) an autogroup:

     cpu=0 path=/autogroup-13 comm=(null) pid=-1 load=49 rbl_load=49 util=48

(4) w/o CONFIG_FAIR_GROUP_SCHED:

     cpu=0 path=(null) comm=sshd pid=2211 load=301 rbl_load=301 util=265

The trace event is only defined for CONFIG_SMP.

The helper functions __trace_sched_cpu(), __trace_sched_path() and
__trace_sched_id() are extended to deal with sched_entities as well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>

sched/events: Introduce cfs_rq load tracking trace event

The following trace event keys are mapped to:

(1) load     : cfs_rq->avg.load_avg

(2) rbl_load : cfs_rq->avg.runnable_load_avg

(2) util     : cfs_rq->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for a non-existent key=value pair:

path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.

The following list shows examples of the key=value pairs in different
configurations for:

(1) a root task_group:

     cpu=4 path=/ load=6 rbl_load=6 util=331

(2) a task_group:

     cpu=1 path=/tg1/tg11/tg111 load=538 rbl_load=538 util=522

(3) an autogroup:

     cpu=3 path=/autogroup-18 load=997 rbl_load=997 util=517

(4) w/o CONFIG_FAIR_GROUP_SCHED:

     cpu=0 path=(null) load=314 rbl_load=314 util=289

The trace event is only defined for CONFIG_SMP.

The helper function __trace_sched_path() can be used to get the length
parameter of the dynamic array (path == NULL) and to copy the path into
it (path != NULL).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>

sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG

Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If
CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be
available to be printed in the load tracking trace events provided by
this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>

cpufreq: scmi: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scmi-cpufreq driver which uses
the power costs provided by the firmware.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>

firmware: arm_scmi: add a getter for power of performance states

The SCMI protocol can be used to get power estimates from firmware
corresponding to each performance state of a device. Although these power
costs are already managed by the SCMI firmware driver, they are not
exposed to any external subsystem yet.

Fix this by adding a new get_power() interface to the exisiting perf_ops
defined for the SCMI protocol.

Change-Id: I2ba654a59e9633dcb2ce4128dbd99f2317d0ece8

cpufreq: arm_big_little: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scpi-cpufreq driver which
can rely on the power estimation helper provided by PM_OPP.

Todo: Check if driver can handle -EPROBE_DEFER and if the call to
dev_pm_opp_get_opp_count() id realy necessary.

Change-Id: Ia808262ef6c9f2cc7819a83e8eb2f602454edfa3
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ Removed the dependency on dev_pm_opp_of_estimate_power() ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

cpufreq: scpi: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scpi-cpufreq driver which can
estimate power using the P = C * V^2 * f equation where C, V, and f
respectively are the capacitance of the CPU and the voltage and frequency
of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding, and the voltage and frequency values are obtained from PM_OPP.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>

arm: dts: vexpress-v2p-ca15_a7: Add dynamic-power-coefficient properties

The values are computed by measuring energy over a 10 secs sysbench
workload running at each frequency and affine to 1 or 2 A15's as well as
1, 2 or 3 A7's. The Power values for individual cpus are calculated from
the Energy values divided by workload runtime by taking the difference
of the energy values between n+1 and n cpus.

P [mW] = C * freq [Mhz] * voltage [mV] * voltage [mV] / 1000000000

C = P [mW] / freq [Mhz] * voltage [mV] * voltage [mV] * 1000000000

The actual C (dynamic-power-coefficient) value is the mean value out of
all the C values of the OPP's.

A15:
     freq       power  voltage  dyn_pwr_coef
     [MhZ]      [mW]      [mV]
0   500.0   534.57550      900   1319.939506
1   600.0   547.15468      900   1125.832675
2   700.0   572.22060      900   1009.207407
3   800.0   607.76592      900    937.910370
4   900.0   648.50552      900    889.582332
5  1000.0   693.86776      900    856.626864
6  1100.0   916.51314      975    876.469442
7  1200.0  1198.57566     1050    905.952880

                            mean: 990

A7:

     freq      power  voltage  dyn_pwr_coef
     [MhZ]      [mW]     [mV]
0   350.0   40.17430      900    141.708289
1   400.0   42.68700      900    131.750000
2   500.0   54.25716      900    133.968296
3   600.0   64.09914      900    131.891235
4   700.0   74.09736      900    130.683175
5   800.0   82.69694      900    127.618735
6   900.0  113.71386      975    132.911225
7  1000.0  144.94124     1050    131.465977

                           mean: 133

The ratio between A15 and A7 is 990/113 = 7.44

This value (7.44) is very close to mean ratio between the power value of
A15 an A7 of the per sched-domain Energy Model (7.96):

  mV  Mhz  MhZ old EM      ratio
       A7  A15 core power
900  350  500 6997/1024   6.83
900  400  600 5177/761    6.80
900  500  700 3846/549    7.01
900  600  800 3524/447    7.88
900  700  900 3125/407    7.68
900  800 1000 2756/334    8.25
975  900 1100 2312/275    8.40
1050 1000 1200 2021/187   10.80

                    mean:  7.96

Change-Id: I93ef375d05ff769481a07f2c74f061e307cb14d4
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

arm64: dts: juno-r2: Add dynamic-power-coefficient properties

Taken from commit cadf54148974 "arm64: dts: Add IPA parameters to soc
thermal zone" wich also sets up SoC thermal zones and bind them to
cpufreq cooling devices. We don't want this functionality right now.

The commit is for example part of:

git.linaro.org/landing-teams/working/arm/kernel-release.git
lt_arm/ack-4.9-armlt-18.01

Change-Id: Id1a44fb7d222d59f7d44b5f55797d407513eb7e7
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

arm64: dts: juno: Add dynamic-power-coefficient properties

Taken from commit cadf54148974 "arm64: dts: Add IPA parameters to soc
thermal zone" wich also sets up SoC thermal zones and bind them to
cpufreq cooling devices. We don't want this functionality right now.

The commit is for example part of:

git.linaro.org/landing-teams/working/arm/kernel-release.git
lt_arm/ack-4.9-armlt-18.01

Change-Id: I7c23a58fa49b281ed5df2f60db0514a9b3b50c7b
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

OPTIONAL: cpufreq: dt: Register an Energy Model

*******************************************************************
* This patch illustrates the usage of the newly introduced Energy *
* Model framework and isn't supposed to be merged as-is. *
*******************************************************************

The Energy Model framework provides an API to register the active power
of CPUs. Call this API from the cpufreq-dt driver with an estimation
of the power as P = C * V^2 * f with C, V, and f respectively the
capacitance of the CPU and the voltage and frequency of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding (originally introduced for thermal/IPA), and the voltage and
frequency values from PM_OPP.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

OPTIONAL: arch_topology: Start Energy Aware Scheduling

Energy Aware Scheduling (EAS) starts when the scheduling domains are
built if the Energy Model (EM) is present. However, in the typical case
of Arm/Arm64 systems, the EM is provided after the scheduling domains
are first built at boot time, which results in EAS staying disabled.

Fix this issue by re-building the scheduling domain from the arch
topology driver, once CPUfreq is up and running and the asymmetry in CPU
capacities has been detected.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

arm, arm64: Enable kernel config options required for EAS testing

arm and arm64:

  Add    Function, Function Graph, Irqsoff, Preempt, Sched Tracer
  Add    Prove Locking
  Add    Prove RCU

for arm64:

  Add    USB Net RTL8152
  Add    USB Net
  Add    USB Net AX8817X
  Remove Mouse PS2

for arm:

  Add    kernel .config support and /proc/config.gz
  Add    ARM Big.Little cpufreq driver
  Add    ARM Big.Little cpuidle driver
  Add    Sensor Vexpress

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

arm, arm64: Enable kernel config options required for EAS

arm and arm64:

  Add    Cgroups support
  Add    Energy Model
  Add    CpuFreq governors and make schedutil default

for arm:

  Add    Cpuset support
  Add    Scheduler autogroups
  Add    DIE sched domain level

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

sched/core: uclamp: use percentage clamp values

The utilization is a well defined property of tasks and CPUs with an
in-kernel representation based on power-of-two values.
The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
allows efficient computations in hot-paths and a sufficient fixed point
arithmetic precision.
However, the utilization values range is still an implementation detail
which is also possibly subject to changes in the future.

Since we don't want to commit new user-space APIs to any in-kernel
implementation detail, let's add an abstraction layer on top of the APIs
used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
cpu.util_{min,max} attributes.

We do that by adding a couple of conversion functions which can be used
to conveniently transform utilization/capacity values from/to the internal
SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
in the standard [0..100] range.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
- rebased on tip/sched/core

Changes in v2:
- none: this is a new patch

sched/core: uclamp: update CPU's refcount on TG's clamp changes

When a task group refcounts a new clamp group, we need to ensure that
the new clamp values are immediately enforced to all its tasks which are
currently RUNNABLE. This is to ensure that all currently RUNNABLE tasks
are boosted and/or clamped as requested as soon as possible.

Let's ensure that, whenever a new clamp group is refcounted by a task
group, all its RUNNABLE tasks are correctly accounted in their
respective CPUs. We do that by slightly refactoring uclamp_group_get()
to get an additional parameter *cgroup_subsys_state which, when
provided, it's used to walk the list of tasks in the corresponding TGs
and update the RUNNABLE ones.

This is a "brute force" solution which allows to reuse the same refcount
update code already used by the per-task API. That's also the only way
to ensure a prompt enforcement of new clamp constraints on RUNNABLE
tasks, as soon as a task group attribute is tweaked.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
- rebased on tip/sched/core
- fixed some typos

Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review

sched/core: uclamp: add system default clamps

Clamp values cannot be tuned at the root cgroup level. Moreover, because
of the delegation model requirements and how the parent clamps
propagation works, if we want to enable subgroups to set a non null
util.min, we need to be able to configure the root group util.min to the
allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).

Unfortunately this setup will also mean that all tasks running in the
root group, will always get a maximum util.min clamp, unless they have a
lower task specific clamp which is definitively not a desirable default
configuration.

Let's fix this by explicitly adding a system default configuration
(sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
for all tasks running on the root group.

This interface is available independently from cgroups, thus providing a
complete solution for system wide utilization clamping configuration.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

sched/core: uclamp: use TG's clamps to restrict Task's clamps

When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's Task Group (TG) in which the task is currently living.

Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG effective value.

This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. The overall idea is that cgroup
attributes are always used to restrict the per-task attributes.

Thus, this implementation allows to:

1. ensure cgroup clamps are always used to restrict task specific
   requests, i.e. boosted only up to the effective granted value or
   clamped at least to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
   request less then what enforced by their current TG

For this mecanisms to work properly, we add the concept of "active"
clamp group, which is used to track the currently most restrictive clamp
value each task is subject to.
The active clamp is computed at enqueue time, by using an additional
   task_struct::uclamp_group_id
to keep track of the clamp group in which each task is currently
accounted into. This allows to update task constrains on
demand, only when a task becames RUNNABLE, thus always using the most
restrictive clamp depending on the current TG's settings.

This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.

For consistency purposes, as well as to properly inform userspace, the
sched_getattr(2) call is updated to always return the properly
aggregated constrains as described above. This will also make
sched_getattr(2) a convenient userpace API to know the utilization
constraints enforced on a task by the cgroup's CPU controller.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- fix not required override
- fix typos in changelog
Others:
- clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
   clamp group_id/value code into dedicated getter functions:
   uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
- rebased on tip/sched/core

Changes in v2:
OSPM discussion:
- implement a "nice" semantics where cgroup clamp values are always
   used to restrict task specific clamp values, i.e. tasks running on a
   TG are only allowed to demote themself.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review

sched/core: uclamp: map TG's clamp values into CPU's clamp groups

Utilization clamping requires to map each different clamp value
into one of the available clamp groups used by the scheduler's fast-path
to account for RUNNABLE tasks. Thus, each time a TG's clamp value is
updated we need to get a reference to the new value's clamp group and
release a reference to the previous one.

Let's ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time).
We do that by slightly refactoring uclamp_group_get() to make the
*task_struct parameter optional. This allows to re-use the code already
available to support the per-task API.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add explicit calls to uclamp_group_find(), which is now not more
part of uclamp_group_get()
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review

sched/core: uclamp: propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This represent also the clamp value which is actually
used to clamp tasks in each task group.

Since it can be interesting for tasks in a cgroup to know exactly what
is the currently propagated/enforced configuration, the effective clamp
values are exposed to user-space by means of a new pair of read-only
attributes: cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com>
- new patch in v3, to implement a suggestion from v1 review

sched/core: uclamp: extend cpu's cgroup controller

The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is expected (or allowed) to generate.
Constraints on minimum and maximum utilization allowed for tasks in a
CPU cgroup can improve the control on the actual amount of CPU bandwidth
consumed by tasks.

Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.

The ultimate goal of utilization clamping is thus to enable:

- boosting: by selecting an higher capacity CPU and/or higher execution
            frequency for small tasks which are affecting the user
            interactive experience.

- capping: by selecting more energy efficiency CPUs or lower execution
           frequency, for big tasks which are mainly related to
           background activities, and thus without a direct impact on
           the user experience.

Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.

This patch extends the CPU controller by adding a couple of new
attributes, util.min and util.max, which can be used to enforce task's
utilization boosting and capping. Specifically:

- util.min: defines the minimum utilization which should be considered,
            e.g. when schedutil selects the frequency for a CPU while a
            task in this group is RUNNABLE.
            i.e. the task will run at least at a minimum frequency which
                corresponds to the min_util utilization

- util.max: defines the maximum utilization which should be considered,
            e.g. when schedutil selects the frequency for a CPU while a
            task in this group is RUNNABLE.
            i.e. the task will run up to a maximum frequency which
                corresponds to the max_util utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies
b) do not enforce any constraints and/or dependency between the parent
   and its child nodes, thus relying on the delegation model and
   permission settings defined by the system management software
c) allow to (eventually) further restrict task-specific clamps defined
   via sched_setattr(2)

This patch provides the basic support to expose the two new attributes
and to validate their run-time updates. However, we do not actually
allocated clamp groups and thus the write calls added by this patch
always returns -EINVAL. Following patches will provide the missing bits.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com>
- use "." notation for attributes naming
   i.e. s/util_{min,max}/util.{min,max}/
Others
- rebased on tip/sched/core

Changes in v2:
Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com>
- make attributes available only on non-root nodes
   a system wide API seems of not immediate interest and thus it's not
   supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <20180410200514.GA793541@devbig577.frc2.facebook.com>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
   the changelog above has been extended to better justify the naming
   proposed by the new attributes
Others:
- rebased on v4.18-rc4
- reduced code to simplify the review of this patch
   which now provides just the basic code for CGroups integration
- add attributes to the default hierarchy as well as the legacy one
- use -ERANGE as range violation error

These additional bits:
- refcounting of clamp groups
- RUNNABLE tasks refcount updates
- aggregation of per-task and per-task_group utilization constraints
are provided in separate and following patches to make it more clear and
documented how they are performed.

sched/core: uclamp: enforce last task UCLAMP_MAX

When a util_max clamped task sleeps, its clamp constraints are removed
from the CPU. However, the blocked utilization on that CPU can still be
higher than the max clamp value enforced while that task was running.
This max clamp removal when a CPU is going to be idle could thus allow
unwanted CPU frequency increases, right while the task is not running.

This can happen, for example, where there is another (smaller) task
running on a different CPU of the same frequency domain.
In this case, when we aggregate the utilization of all the CPUs in a
shared frequency domain, schedutil can still see the full non clamped
blocked utilization of all the CPUs and thus eventually increase the
frequency.

Let's fix this by using:

uclamp_cpu_put_id(UCLAMP_MAX)
uclamp_cpu_update(last_clamp_value)

to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition. Thus, while a CPU is idle, we can still enforce the last used
clamp value for it.

To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
idle, we don't want to enforce any minimum frequency
Indeed, we rely just on blocked load decay to smoothly reduce the
frequency.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID

Changes in v2:
- rabased on v4.18-rc4
- new patch to improve a specific issue

sched/cpufreq: uclamp: add utilization clamping for RT tasks

Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE.  Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.

This patch extends the usage of util_{min,max} to the RT scheduling
class.  Whenever a task in this class is RUNNABLE, the util required is
defined by the constraints of the CPU control group the task belongs to.

Since utilization clamping applies now to both CFS and RT task, there
can be two alternative approaches:
A) clamp the combined utilization
B) combine the clamped utilizations
which have pros and cons.
Approach A) is more power efficient, since it generally selects lower
frequencies when we have both RT and CFS utilization. However, this
could affect performance of the lower priority CFS class, since the
minimum utilization clamp could be completely eclipsed by the RT
utilization.
Approach B) is more fair to the lower priority CFS class since it always
adds the required minimum utilization to that class too. For that reason
it could be less power efficient and, since we do not distinguish clamp
values based on the scheduling class, it could also end up boosting CFS
tasks more then required (e.g. when the current min utilization of a CPU
is required by an RT task). That's why this approach is masked behind a
sched feature.

The IO wait boost value is thus subject to clamping for RT tasks too.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clamping constraints.
It's worth to notice that, by default, clamp values are
    min_util, max_util = (0, SCHED_CAPACITY_SCALE)
and thus, RT tasks always run at the maximum OPP if not otherwise
constrained by userspace.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
- rebased on tip/sched/core

Changes in v2:
- rebased on v4.18-rc4

sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider the requirements suggested by userspace, for
example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "required" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus allowing to have a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Let's add the required support to clamp the utilization generated by
FAIR tasks within the boundaries defined by their aggregated utilization
clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.

For FAIR tasks both the utilization as well as the IOWait boost values
are clamped according to the CPU aggregated utilization clamp
constraints.

The default values for boosting and capping are defined to be:
- util_min: 0
- util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Others:
- rebased on tip/sched/core

Changes in v2:
- rebased on v4.18-rc4

sched/core: uclamp: update CPU's refcount on clamp changes

Utilization clamp values enforced on a CPU by a task can be updated at
run-time, for example via a sched_setattr syscall, while a task is
currently RUNNABLE on that CPU. In these cases, the task can be already
refcounting a clamp group for its CPU and thus we need to update this
reference to ensure the new constraints are immediately enforced.

Since a clamp value change always implies a clamp group refcount update,
this patch hooks into the clamp group refcount getter to trigger a CPU
refcount syncup. Such a syncup is required only by currently RUNNABLE
tasks which are also referencing at least one valid clamp group.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Other:
- rabased on tip/sched/core

Changes in v2:
Message-ID: <20180413111900.GF4082@hirez.programming.kicks-ass.net>
- get rid of the group_id back annotation
   which is not requires at this stage where we have only per-task
   clamping support. It will be introduce later when CGroups support is
   added.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review

sched/core: uclamp: add CPU's clamp groups accounting

Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range. This range depends on the set of currently
RUNNABLE tasks on a CPU, where each task references two "clamp groups"
defining the util_min and the util_max clamp values to be considered for
that task. The clamp value mapped by a clamp group applies to a CPU only
when there is at least one task RUNNABLE referencing that clamp group.

When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it can be required to re-compute what is the new "aggregated" clamp
value to apply on that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e.  with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).

Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.

Tasks have a:
   task_struct::uclamp::group_id[clamp_idx]
indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
which they should refcount at enqueue time.

CPUs rq have a:
   rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index..

The clamp value of each clamp group is tracked by
rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
unordered array of clamp values. However, the MAX aggregation of the
currently active clamp groups is implemented to minimize the number of
times we need to scan the complete (unordered) clamp group array to
figure out the new max value. This operation indeed happens only when we
dequeue last task of the clamp group corresponding to the current max
clamp, and thus the CPU is either entering IDLE or going to schedule a
less boosted or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Other:
- rebased on tip/sched/core

Changes in v2:

Message-ID: <20180413093822.GM4129@hirez.programming.kicks-ass.net>
- refactored struct rq::uclamp_cpu to be more cache efficient
   no more holes, re-arranged vectors to match cache lines with expected
   data locality

Message-ID: <20180413094615.GT4043@hirez.programming.kicks-ass.net>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
   and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments

Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init

Other:
- rabased on v4.18-rc4
- improved documentation to make more explicit some concepts.

sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Utilization clamping requires each CPU to know which clamp values are
assigned to tasks that are currently RUNNABLE on that CPU.
Multiple tasks can be assigned the same clamp value and tasks with
different clamp values can be concurrently active on the same CPU.
Thus, a proper data structure is required to support a fast and
efficient aggregation of the clamp values required by the currently
RUNNABLE tasks.

For this purpose we use a per-CPU array of reference counters,
where each slot is used to account how many tasks require a certain
clamp value are currently RUNNABLE on each CPU.
Each clamp value corresponds to a "clamp index" which identifies the
position within the array of reference couters.

                                 :
       (user-space changes)      :      (kernel space / scheduler)
                                 :
             SLOW PATH           :             FAST PATH
                                 :
    task_struct::uclamp::value   :     sched/core::enqueue/dequeue
                                 :         cpufreq_schedutil
                                 :
  +----------------+    +--------------------+     +-------------------+
  |      TASK      |    |     CLAMP GROUP    |     |    CPU CLAMPS     |
  +----------------+    +--------------------+     +-------------------+
  |                |    |   clamp_{min,max}  |     |  clamp_{min,max}  |
  | util_{min,max} |    |      se_count      |     |    tasks count    |
  +----------------+    +--------------------+     +-------------------+
                                 :
           +------------------>  :  +------------------->
    group_id = map(clamp_value)  :  ref_count(group_id)
                                 :
                                 :

Let's introduce the support to map tasks to "clamp groups".
Specifically we introduce the required functions to translate a
"clamp value" into a clamp's "group index" (group_id).

Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
   sense to boost/limit to different frequencies,
   e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
   the per-CPU clamp values in the fast path.

The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
a -ENOSPC error in case this will exceed the number of maximum different
clamp values supported.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
   there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
   group_id as a parameter
Others:
- rebased on tip/sched/core

Changes in v2:
- rabased on v4.18-rc4
- set UCLAMP_GROUPS_COUNT=2 by default
   which allows to fit all the hot-path CPU clamps data, partially
   intorduced also by the following patches, into a single cache line
   while still supporting up to 2 different {min,max}_utiql clamps.

sched/core: uclamp: extend sched_setattr to support utilization clamping

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a more simplified model which is essentially based on
the relatively simple concept of POSIX priorities.

Such a simple priority based model however does not allow to exploit
some of the most advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.

Utilization clamping aims at exposing to user-space a new set of
per-task attributes which can be used to provide the scheduler with some
hints about the expected/required utilization for a task.
This will allow to implement a more advanced per-task frequency control
mechanism which is not based just on a "passive" measured task
utilization but on a more "active" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy efficient.
Ultimately, such a mechanism can be considered similar to the cpufreq's
powersave, performance and userspace governor but with a much fine
grained and per-task control.

Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes.
Specifically, a new pair of attributes allows to specify a minimum and
maximum utilization which the scheduler should consider for a task.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
Others:
- rebased on tip/sched/core

Changes in v2:
- rebased on v4.18-rc4
- move at the head of the series

As discussed at OSPM, using a [0..SCHED_CAPACITY_SCALE] range seems to
be acceptable. However, an additional patch has been added at the end of
the series which introduces a simple abstraction to use a more
generic [0..100] range.

At OSPM we also discarded the idea to "recycle" the usage of
sched_runtime and sched_period which would have made the API too
much complex for limited benefits.

arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes

Asymmetric cpu capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm. This patch points the arm implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

cc: Russell King <linux@armlinux.org.uk>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes

Asymmetric cpu capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm64. This patch points the arm64 implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

cc: Catalin Marinas <catalin.marinas@arm.com>
cc: Will Deacon <will.deacon@arm.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change

The setting of SD_ASYM_CPUCAPACITY depends on the per-cpu capacities.
These might not have their final values when the hierarchy is initially
built as the values depend on cpufreq to be initialized or the values
being set through sysfs. To ensure that the flags are set correctly we
need to rebuild the sched_domain hierarchy whenever the reported per-cpu
capacity (arch_scale_cpu_capacity()) changes.

This patch ensure that a full sched_domain rebuild happens when cpu
capacity changes occur.

cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/topology: SD_ASYM_CPUCAPACITY flag detection

The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all cpu capacities are visible for
any cpu's point of view on asymmetric cpu capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available cpu
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.

The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all cpus in the system, i.e. when hotplugging
all cpus out except those with one particular cpu capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.

Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched: Enable idle balance to pull single task towards cpu with higher capacity

We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu. Add an
extra criteria to need_active_balance() to kick off active load balance
if the source cpu is over-utilized and has lower capacity than the
destination cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

sched: Prevent unnecessary active balance of single task in sched group

Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes a
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().

A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Don't move tasks to lower capacity cpus unless necessary

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Set rq->rd->overload when misfit

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
  big_0  | c c c c a a
  big_1  | d d d d b b
  ^
  ^
    Tasks end on the big CPUs, idle balance happens
    and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched: Change root_domain->overload type to int

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Change prefer_sibling type to bool

This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Kick nohz balance if rq->misfit_task_load

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Consider misfit tasks when load-balancing

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched: Add sched_group per-cpu max capacity

The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Add group_misfit_task load-balance type

To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched: Add static_key for asymmetric cpu capacity optimizations

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

sched/fair: Select an energy-efficient CPU on task wake-up

If an Energy Model (EM) is available and if the system isn't
overutilized, re-route waking tasks into an energy-aware placement
algorithm. The selection of an energy-efficient CPU for a task
is achieved by estimating the impact on system-level active energy
resulting from the placement of the task on the CPU with the highest
spare capacity in each frequency domain. This strategy spreads tasks in
a frequency domain and avoids overly aggressive task packing. The best
CPU energy-wise is then selected if it saves a large enough amount of
energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This is an attempt to do something useful as writing a fast heuristic
that performs reasonably well on a broad spectrum of architectures isn't
an easy task. As such, the scope of usability of the energy-aware
wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag
set, and where the EM isn't too complex.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/fair: Introduce an energy estimation helper function

In preparation for the definition of an energy-aware wakeup path,
introduce a helper function to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the capacity state to be reached by all frequency domains and
estimates the consumption of each online CPU according to its Energy
Model and its percentage of busy time.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/cpufreq: Refactor the utilization aggregation method

Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
to decide which frequency to request. Energy Aware Scheduling (EAS)
needs to be able to predict those requests to assess the energy impact
of scheduling decisions. However, the PELT signals aggregation is only
done in schedutil for now, hence making it hard to synchronize it with
EAS.

To address this issue, introduce schedutil_freq_util() to perform the
aforementioned aggregation and make it available to other parts of the
scheduler. Since frequency selection and energy estimation still need
to deal with RT and DL signals slightly differently, schedutil_freq_util()
is called with a different 'type' parameter in those two contexts, and
returns an aggregated utilization signal accordingly.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched: Add over-utilization/tipping point indicator

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/fair: Clean-up update_sg_lb_stats parameters

In preparation for the introduction of a new root domain flag which can
be set during load balance (the 'overutilized' flag), clean-up the set
of parameters passed to update_sg_lb_stats(). More specifically, the
'local_group' and 'local_idx' parameters can be removed since they can
easily be reconstructed from within the function.

While at it, transform the 'overload' parameter into a flag stored in
the 'sg_status parameter and change the root_domain's overload field to
an 'int' since sizeof(_Bool) is implementation defined.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/topology: Introduce sched_energy_present static key

In order to ensure a minimal performance impact on non-energy-aware
systems, introduce a static_key guarding the access to Energy-Aware
Scheduling (EAS) code.

The static key is set iff all the following conditions are met for at
least one root domain:
  1. all online CPUs of the root domain are covered by the Energy
     Model (EM);
  2. the complexity of the root domain's EM is low enough to keep
     scheduling overheads low;
  3. the root domain has an asymmetric CPU capacity topology (detected
     by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
     hierarchy).

The static key is checked in the rd_freq_domain() function which returns
the frequency domains of a root domain when they are available. As EAS
cannot be enabled with CONFIG_ENERGY_MODEL=n, rd_freq_domain() is
stubbed to 'NULL' to let the compiler remove the unused EAS code by
constant propagation.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/topology: Lowest energy aware balancing sched_domain level pointer

Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the lowest level at which energy
aware scheduling should be used.

Generally speaking, the largest opportunity to save energy via scheduling
comes from a smarter exploitation of heterogeneous platforms (i.e.
big.LITTLE). Consequently, the sd_ea shortcut is wired to the lowest
scheduling domain at which the SD_ASYM_CPUCAPACITY flag is set. For
example, it is possible to apply Energy-Aware Scheduling within a socket
on a multi-socket system, as long as each socket has an asymmetric
topology. Cross-sockets wake-up balancing will only happen when the
system is over-utilized, or this_cpu and prev_cpu are in different
sockets.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/topology: Reference the Energy Model of CPUs when available

The existing scheduling domain hierarchy is defined to map to the cache
topology of the system. However, Energy Aware Scheduling (EAS) requires
more knowledge about the platform, and specifically needs to know about
the span of Frequency Domains (FD), which do not always align with
caches.

To address this issue, use the Energy Model (EM) of the system to extend
the scheduler topology code with a representation of the FDs, alongside
the scheduling domains. More specifically, a linked list of FDs is
attached to each root domain. When multiple root domains are in use,
each list contains only the FDs covering the CPUs of its root domain. If
a FD spans over CPUs of two different root domains, it will be
duplicated in both lists.

The lists are fully maintained by the scheduler from
partition_sched_domains() in order to cope with hotplug and cpuset
changes. As for scheduling domains, the list are protected by RCU to
ensure safe concurrent updates.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

PM / EM: Expose the Energy Model in sysfs

Expose the Energy Model (read-only) of all frequency domains in sysfs
for convenience. To do so, add a kobject to the CPU subsystem under the
umbrella of which a kobject for each frequency domain is attached.

The resulting hierarchy is as follows for a platform with two frequency
domains for example:

   /sys/devices/system/cpu/energy_model
   ├── fd0
   │   ├── cost
   │   ├── cpus
   │   ├── frequency
   │   └── power
   └── fd4
       ├── cost
       ├── cpus
       ├── frequency
       └── power

In this implementation, the kobject abstraction is only used as a
convenient way of exposing data to sysfs. However, it could also be
used in the future to allocate and release frequency domains in a more
dynamic way using reference counting.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

PM: Introduce an Energy Model management framework

Several subsystems in the kernel (task scheduler and/or thermal at the
time of writing) can benefit from knowing about the energy consumed by
CPUs. Yet, this information can come from different sources (DT or
firmware for example), in different formats, hence making it hard to
exploit without a standard API.

As an attempt to address this, introduce a centralized Energy Model
(EM) management framework which aggregates the power values provided
by drivers into a table for each frequency domain in the system. The
power cost tables are made available to interested clients (e.g. task
scheduler or thermal) via platform-agnostic APIs. The overall design
is represented by the diagram below (focused on Arm-related drivers as
an example, but applicable to any architecture):

     +---------------+  +-----------------+  +-------------+
     | Thermal (IPA) |  | Scheduler (EAS) |  |    Other    |
     +---------------+  +-----------------+  +-------------+
             |                 | em_fd_energy()     |
             |                 | em_cpu_get()       |
             +-----------+     |         +----------+
                         |     |         |
                         v     v         v
                      +---------------------+
                      |                     |
                      |    Energy Model     |
                      |                     |
                      |     Framework       |
                      |                     |
                      +---------------------+
                         ^       ^       ^
                         |       |       | em_register_freq_domain()
              +----------+       |       +---------+
              |                  |                 |
      +---------------+  +---------------+  +--------------+
      |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
      +---------------+  +---------------+  +--------------+
              ^                  ^                 ^
              |                  |                 |
      +--------------+   +---------------+  +--------------+
      | Device Tree  |   |   Firmware    |  |      ?       |
      +--------------+   +---------------+  +--------------+

Drivers (typically, but not limited to, CPUFreq drivers) can register
data in the EM framework using the em_register_freq_domain() API. The
calling driver must provide a callback function with a standardized
signature that will be used by the EM framework to build the power
cost tables of the frequency domain. This design should offer a lot of
flexibility to calling drivers which are free of reading information
from any location and to use any technique to compute power costs.
Moreover, the capacity states registered by drivers in the EM framework
are not required to match real performance states of the target. This
is particularly important on targets where the performance states are
not known by the OS.

On the client side, the EM framework offers APIs to access the power
cost tables of a CPU (em_cpu_get()), and to estimate the energy
consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
such as the task scheduler can then use these APIs to access the shared
data structures holding the Energy Model of CPUs.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched/cpufreq: Factor out utilization to frequency mapping

The schedutil governor maps utilization values to frequencies by applying
a 25% margin. Since this sort of mapping mechanism can be needed by other
users (i.e. EAS), factor the utilization-to-frequency mapping code out
of schedutil and move it to include/linux/sched/cpufreq.h to avoid code
duplication. The new map_util_freq() function is inlined to avoid
overheads.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

sched: Relocate arch_scale_cpu_capacity

By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

ARM: per-CPU processor vtables

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

ARM: split out processor lookup

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

ARM: make lookup_processor_type() non-__init

Move lookup_processor_type() out of the __init section.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

ARM: more fixups

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

ARM: check_bugs

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

ARM: add PROC_VTABLE macros

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

power: vexpress: fix corruption in notifier registration

Vexpress platforms provide two different restart handlers: SYS_REBOOT
that restart the entire system, while DB_RESET only restarts the
daughter board containing the CPU. DB_RESET is overridden by SYS_REBOOT
if it exists.

notifier_chain_register used in register_restart_handler by design
relies on notifiers to be registered once only, however vexpress restart
notifier can get registered twice. When this happen it corrupts list
of notifiers, as result some notifiers can be not called on proper
event, traverse on list can be cycled forever, and second unregister
can access already freed memory.

So far, since this was the only restart handler in the system, no issue
was observed even if the same notifier was registered twice. However
commit 6c5c0d48b686 ("watchdog: sp805: add restart handler") added
support for SP805 restart handlers and since the system under test
contains two vexpress restart and two SP805 watchdog instances, it was
observed that during the boot traversing the restart handler list looped
forever as there's a cycle in that list resulting in boot hang.

This patch fixes the issues by ensuring that the notifier is installed
only once.

Cc: Sebastian Reichel <sre@kernel.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Fixes: 46c99ac66222 ("power/reset: vexpress: Register with kernel restart handler")
Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.co.uk>

arm64: defconfig: enable f2fs and squashfs

Partitions in HiKey960 are formatted as f2fs and squashfs.
f2fs is for userdata; squashfs is for system. Both partitions are required
by Android.

Signed-off-by: Li Wei <liwei213@huawei.com>
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
[Taken from https://lkml.org/lkml/2018/5/25/142]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

arm64: defconfig: enable configs for Hisilicon ufs

This enable configs for Hisilicon Hixxxx UFS driver.

Signed-off-by: Li Wei <liwei213@huawei.com>
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
[Taken from https://lkml.org/lkml/2018/5/25/143]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

arm64: defconfig: add configs for power management on Hi3660

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

arm64: defconfig: enable usb for hikey960

Signed-off-by: Fan Ning <fanning4@hisilicon.com>
Squashed with:

FIX: arm64: defconfig: update for USB

Signed-off-by: Fan Ning <fanning4@hisilicon.com>
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
[Squashing]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

usb: dwc3: Remove unused function

dwc3_gadget_conndone_interrupt was added in
<e8621f602a6d> "usb: add dwc3 driver for hikey960", but it's
unused.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

arm64: dts: hikey960: Enable SDIO high-speed mode on WL1837 module

According to hardware documentation, the WL1837MOD WiFi chip
supports SDIO high-speed mode, which increases the clock to
50 MHz. This results in a modest increase in throughput.

Signed-off-by: Ryan Grachek <ryan@edited.us>
[jstultz: Ported from hikey/wl1835 to hikey960/wl1837]
Signed-off-by: John Stultz <john.stultz@linaro.org>

WIP usb fixups

HACK: mailbox: Hi3660: Fixup for mailbox state machine

Now we only check the mailbox state machine at the beginning of request
the channel, and later we will not run state machine code for sending
message. This results the state machine malfunction and the message
cannot be really handle from MCU side.

This patch is temporarily move the state machine operations into send
message flow.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>

drivers: richtek: fix write-out-of-bounds in rt1711_init_alert()

KASAN warns about a write-out-of-bounds in rt1711_init_alert():

len = strlen(chip->tcpc_desc->name);
name = kzalloc(sizeof(len + 5), GFP_KERNEL); <- allocated here
sprintf(name, "%s-IRQ", chip->tcpc_desc->name); <- written here

The stray sizeof() operator means it's allocating 4 bytes rather than
the intended strlen(...) + 5 bytes.

Change-Id: Iaecc36682754948c9fa983ab9a88486690a1358d
Signed-off-by: Greg Hackmann <ghackmann@google.com>

drivers: richtek: fix write-out-of-bounds in rt_regmap_cache_init()

KASAN warns about a write-out-of-bounds in rt_regmap_cache_init():

if (!rd->props.group) {
rd->props.group = devm_kzalloc(&rd->dev,
sizeof(rd->props.group), GFP_KERNEL); <- allocated here
rd->props.group[0].start = 0x00;
rd->props.group[0].end = 0xffff;
rd->props.group[0].mode = RT_1BYTE_MODE; <- written here
}

The devm_kzalloc() call a few lines above is accidentally requesting
enough space to store a pointer type, which isn't enough space to hold
the struct itself.

Change-Id: I0036262b3129bd86d2e8612fb9b67a848bbb4ead
Signed-off-by: Greg Hackmann <ghackmann@google.com>

usb: dwc3: Get gadget mode working again

Adapt from 4.15-rc changes that broke gadget mode

Should be squished down with earlier patch.

Signed-off-by: John Stultz <john.stultz@linaro.org>

usb: support usb gadget for hikey960

Signed-off-by: Fan Ning <fanning4@hisilicon.com>

usb: add pd support for hikey960

Signed-off-by: Fan Ning <fanning4@hisilicon.com>

usb: add dwc3 driver for hikey960

Signed-off-by: Fan Ning <fanning4@hisilicon.com>

usb: add hub_usb5734 driver for hikey960

Signed-off-by: Fan Ning <fanning4@hisilicon.com>

thermal: hisilicon: add thermal sensor driver for Hi3660

This patch adds the support for thermal sensor of Hi3660 SoC.
this will register sensors for thermal framework and use device
tree to bind cooling device.

Signed-off-by: Tao Wang <kevin.wangtao@hisilicon.com>
Signed-off-by: Leo Yan <leo.yan@linaro.org>

cpuidle: Fix NULL driver checking

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

dts: hikey960: Fix bootwarning on mapping reboot reason syscon

Signed-off-by: John Stultz <john.stultz@linaro.org>

arm64: dts: hi3660: adb reboot node

Add "hisilicon,hi3660-reboot" node for hi3660.

Eventually when we've transitioned to UEFI this can be dropped.
As we can then use syscon-reboot-mode.

Signed-off-by: Chen Feng <puck.chen@hisilicon.com>
Signed-off-by: Chen Jun <chenjun14@huawei.com>