1 =========================
2 Capacity Aware Scheduling
3 =========================
11 Conventional, homogeneous SMP platforms are composed of purely identical
12 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
13 different performance characteristics - on such platforms, not all CPUs can be
16 CPU capacity is a measure of the performance a CPU can reach, normalized against
17 the most performant CPU in the system. Heterogeneous systems are also called
18 asymmetric CPU capacity systems, as they contain CPUs of different capacities.
20 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
23 - not all CPUs may have the same microarchitecture (µarch).
24 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
25 physically able to attain the higher Operating Performance Points (OPP).
27 Arm big.LITTLE systems are an example of both. The big CPUs are more
28 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
29 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
32 CPU performance is usually expressed in Millions of Instructions Per Second
33 (MIPS), which can also be expressed as a given amount of instructions attainable
36 capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
41 Two different capacity values are used within the scheduler. A CPU's
42 ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
43 attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
44 which some loss of available performance (e.g. time spent handling IRQs) is
47 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
48 while ``capacity_orig`` is class-agnostic. The rest of this document will use
49 the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
58 Consider an hypothetical dual-core asymmetric CPU capacity system where
60 - work_per_hz(CPU0) = W
61 - work_per_hz(CPU1) = W/2
62 - all CPUs are running at the same fixed frequency
64 By the above definition of capacity:
67 - capacity(CPU1) = C/2
69 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
72 With a workload that periodically does a fixed amount of work, you will get an
73 execution trace like so::
78 +----+----+----+----+----+----+----+----+----+----+-> time
81 | _________ _________ ____
83 +----+----+----+----+----+----+----+----+----+----+-> time
85 CPU0 has the highest capacity in the system (C), and completes a fixed amount of
86 work W in T units of time. On the other hand, CPU1 has half the capacity of
87 CPU0, and thus only completes W/2 in T.
89 1.3.2 Different max OPPs
90 ~~~~~~~~~~~~~~~~~~~~~~~~
92 Usually, CPUs of different capacity values also have different maximum
93 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
96 - max_freq(CPU1) = 2/3 * F
101 - capacity(CPU1) = C/3
103 Executing the same workload as described in 1.3.1, which each CPU running at its
104 maximum frequency results in::
109 +----+----+----+----+----+----+----+----+----+----+-> time
113 | ______________ ______________ ____
115 +----+----+----+----+----+----+----+----+----+----+-> time
117 1.4 Representation caveat
118 -------------------------
120 It should be noted that having a *single* value to represent differences in CPU
121 performance is somewhat of a contentious point. The relative performance
122 difference between two different µarchs could be X% on integer operations, Y% on
123 floating point operations, Z% on branches, and so on. Still, results using this
124 simple approach have been satisfactory for now.
132 Capacity aware scheduling requires an expression of a task's requirements with
133 regards to CPU capacity. Each scheduler class can express this differently, and
134 while task utilization is specific to CFS, it is convenient to describe it here
135 in order to introduce more generic concepts.
137 Task utilization is a percentage meant to represent the throughput requirements
138 of a task. A simple approximation of it is the task's duty cycle, i.e.::
140 task_util(p) = duty_cycle(p)
142 On an SMP system with fixed frequencies, 100% utilization suggests the task is a
143 busy loop. Conversely, 10% utilization hints it is a small periodic task that
144 spends more time sleeping than executing. Variable CPU frequencies and
145 asymmetric CPU capacities complexify this somewhat; the following sections will
148 2.2 Frequency invariance
149 ------------------------
151 One issue that needs to be taken into account is that a workload's duty cycle is
152 directly impacted by the current OPP the CPU is running at. Consider running a
153 periodic workload at a given frequency F::
158 +----+----+----+----+----+----+----+----+----+----+-> time
160 This yields duty_cycle(p) == 25%.
162 Now, consider running the *same* workload at frequency F/2::
165 | _________ _________ ____
167 +----+----+----+----+----+----+----+----+----+----+-> time
169 This yields duty_cycle(p) == 50%, despite the task having the exact same
170 behaviour (i.e. executing the same amount of work) in both executions.
172 The task utilization signal can be made frequency invariant using the following
175 task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
177 Applying this formula to the two examples above yields a frequency invariant
178 task utilization of 25%.
183 CPU capacity has a similar effect on task utilization in that running an
184 identical workload on CPUs of different capacity values will yield different
187 Consider the system described in 1.3.2., i.e.::
190 - capacity(CPU1) = C/3
192 Executing a given periodic workload on each CPU at their maximum frequency would
198 +----+----+----+----+----+----+----+----+----+----+-> time
201 | ______________ ______________ ____
203 +----+----+----+----+----+----+----+----+----+----+-> time
207 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
208 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
210 The task utilization signal can be made CPU invariant using the following
213 task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
215 with ``max_capacity`` being the highest CPU capacity value in the
216 system. Applying this formula to the above example above yields a CPU
217 invariant task utilization of 25%.
219 2.4 Invariant task utilization
220 ------------------------------
222 Both frequency and CPU invariance need to be applied to task utilization in
223 order to obtain a truly invariant signal. The pseudo-formula for a task
224 utilization that is both CPU and frequency invariant is thus, for a given
227 curr_frequency(cpu) capacity(cpu)
228 task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
229 max_frequency(cpu) max_capacity
231 In other words, invariant task utilization describes the behaviour of a task as
232 if it were running on the highest-capacity CPU in the system, running at its
235 Any mention of task utilization in the following sections will imply its
238 2.5 Utilization estimation
239 --------------------------
241 Without a crystal ball, task behaviour (and thus task utilization) cannot
242 accurately be predicted the moment a task first becomes runnable. The CFS class
243 maintains a handful of CPU and task signals based on the Per-Entity Load
244 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
245 opposed to instantaneous).
247 This means that while the capacity aware scheduling criteria will be written
248 considering a "true" task utilization (using a crystal ball), the implementation
249 will only ever be able to use an estimator thereof.
251 3. Capacity aware scheduling requirements
252 =========================================
257 Linux cannot currently figure out CPU capacity on its own, this information thus
258 needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
261 The arm and arm64 architectures directly map this to the arch_topology driver
262 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
263 Documentation/devicetree/bindings/arm/cpu-capacity.txt.
265 3.2 Frequency invariance
266 ------------------------
268 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
269 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
272 Implementing this function requires figuring out at which frequency each CPU
273 have been running at. One way to implement this is to leverage hardware counters
274 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
275 AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
276 when the kernel is aware of the switched-to frequency (also employed by
279 4. Scheduler topology
280 =====================
282 During the construction of the sched domains, the scheduler will figure out
283 whether the system exhibits asymmetric CPU capacities. Should that be the
286 - The sched_asym_cpucapacity static key will be enabled.
287 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
288 level that spans all unique CPU capacity values.
289 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
290 CPUs with any range of asymmetry.
292 The sched_asym_cpucapacity static key is intended to guard sections of code that
293 cater to asymmetric CPU capacity systems. Do note however that said key is
294 *system-wide*. Imagine the following setup using cpusets::
300 \__/ \______________/
303 Which could be created via:
307 mkdir /sys/fs/cgroup/cpuset/cs0
308 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
309 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
311 mkdir /sys/fs/cgroup/cpuset/cs1
312 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
313 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
315 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
317 Since there *is* CPU capacity asymmetry in the system, the
318 sched_asym_cpucapacity static key will be enabled. However, the sched_domain
319 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
320 set in that hierarchy, it describes an SMP island and should be treated as such.
322 Therefore, the 'canonical' pattern for protecting codepaths that cater to
323 asymmetric CPU capacities is to:
325 - Check the sched_asym_cpucapacity static key
326 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
327 the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
328 CPU or group thereof)
330 5. Capacity aware scheduling implementation
331 ===========================================
336 5.1.1 Capacity fitness
337 ~~~~~~~~~~~~~~~~~~~~~~
339 The main capacity scheduling criterion of CFS is::
341 task_util(p) < capacity(task_cpu(p))
343 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
344 task "fits" on its CPU. If it is violated, the task will need to achieve more
345 work than what its CPU can provide: it will be CPU-bound.
347 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
348 value for a task, either via sched_setattr() or via the cgroup interface (see
349 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
350 clamp task_util() in the previous criterion.
352 5.1.2 Wakeup CPU selection
353 ~~~~~~~~~~~~~~~~~~~~~~~~~~
355 CFS task wakeup CPU selection follows the capacity fitness criterion described
356 above. On top of that, uclamp is used to clamp the task utilization values,
357 which lets userspace have more leverage over the CPU selection of CFS
358 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
360 clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
362 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
363 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
364 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
365 giving it a high uclamp.min value.
369 Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
370 (EAS), which is described in Documentation/scheduler/sched-energy.rst.
375 A pathological case in the wakeup CPU selection occurs when a task rarely
376 sleeps, if at all - it thus rarely wakes up, if at all. Consider::
381 capacity(CPU1) = C / 3
385 | _________ _________ ____
387 +----+----+----+----+----+----+----+----+----+----+-> time
392 | ____________________________________________
394 +----+----+----+----+----+----+----+----+----+----+->
397 This workload should run on CPU0, but if the task either:
399 - was improperly scheduled from the start (inaccurate initial
400 utilization estimation)
401 - was properly scheduled from the start, but suddenly needs more
404 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
405 the CPU capacity scheduling criterion is violated, and there may not be any more
406 wakeup event to fix this up via wakeup CPU selection.
408 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
409 put in place to handle this shares the same name. Misfit task migration
410 leverages the CFS load balancer, more specifically the active load balance part
411 (which caters to migrating currently running tasks). When load balance happens,
412 a misfit active load balance will be triggered if a misfit task can be migrated
413 to a CPU with more capacity than its current one.
418 5.2.1 Wakeup CPU selection
419 ~~~~~~~~~~~~~~~~~~~~~~~~~~
421 RT task wakeup CPU selection searches for a CPU that satisfies::
423 task_uclamp_min(p) <= capacity(task_cpu(cpu))
425 while still following the usual priority constraints. If none of the candidate
426 CPUs can satisfy this capacity criterion, then strict priority based scheduling
427 is followed and CPU capacities are ignored.
432 5.3.1 Wakeup CPU selection
433 ~~~~~~~~~~~~~~~~~~~~~~~~~~
435 DL task wakeup CPU selection searches for a CPU that satisfies::
437 task_bandwidth(p) < capacity(task_cpu(p))
439 while still respecting the usual bandwidth and deadline constraints. If
440 none of the candidate CPUs can satisfy this capacity criterion, then the
441 task will remain on its current CPU.