Documentation/scheduler/sched-capacity.rst

   1 =========================
   2 Capacity Aware Scheduling
   3 =========================
   4
   5 1. CPU Capacity
   6 ===============
   7
   8 1.1 Introduction
   9 ----------------
  10
  11 Conventional, homogeneous SMP platforms are composed of purely identical
  12 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
  13 different performance characteristics - on such platforms, not all CPUs can be
  14 considered equal.
  15
  16 CPU capacity is a measure of the performance a CPU can reach, normalized against
  17 the most performant CPU in the system. Heterogeneous systems are also called
  18 asymmetric CPU capacity systems, as they contain CPUs of different capacities.
  19
  20 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
  21 from two factors:
  22
  23 - not all CPUs may have the same microarchitecture (µarch).
  24 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
  25   physically able to attain the higher Operating Performance Points (OPP).
  26
  27 Arm big.LITTLE systems are an example of both. The big CPUs are more
  28 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
  29 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
  30 can.
  31
  32 CPU performance is usually expressed in Millions of Instructions Per Second
  33 (MIPS), which can also be expressed as a given amount of instructions attainable
  34 per Hz, leading to::
  35
  36   capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
  37
  38 1.2 Scheduler terms
  39 -------------------
  40
  41 Two different capacity values are used within the scheduler. A CPU's
  42 ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
  43 attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
  44 which some loss of available performance (e.g. time spent handling IRQs) is
  45 subtracted.
  46
  47 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
  48 while ``capacity_orig`` is class-agnostic. The rest of this document will use
  49 the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
  50 brevity.
  51
  52 1.3 Platform examples
  53 ---------------------
  54
  55 1.3.1 Identical OPPs
  56 ~~~~~~~~~~~~~~~~~~~~
  57
  58 Consider an hypothetical dual-core asymmetric CPU capacity system where
  59
  60 - work_per_hz(CPU0) = W
  61 - work_per_hz(CPU1) = W/2
  62 - all CPUs are running at the same fixed frequency
  63
  64 By the above definition of capacity:
  65
  66 - capacity(CPU0) = C
  67 - capacity(CPU1) = C/2
  68
  69 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
  70 be a LITTLE.
  71
  72 With a workload that periodically does a fixed amount of work, you will get an
  73 execution trace like so::
  74
  75  CPU0 work ^
  76            |     ____                ____                ____
  77            |    |    |              |    |              |    |
  78            +----+----+----+----+----+----+----+----+----+----+-> time
  79
  80  CPU1 work ^
  81            |     _________           _________           ____
  82            |    |         |         |         |         |
  83            +----+----+----+----+----+----+----+----+----+----+-> time
  84
  85 CPU0 has the highest capacity in the system (C), and completes a fixed amount of
  86 work W in T units of time. On the other hand, CPU1 has half the capacity of
  87 CPU0, and thus only completes W/2 in T.
  88
  89 1.3.2 Different max OPPs
  90 ~~~~~~~~~~~~~~~~~~~~~~~~
  91
  92 Usually, CPUs of different capacity values also have different maximum
  93 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
  94
  95 - max_freq(CPU0) = F
  96 - max_freq(CPU1) = 2/3 * F
  97
  98 This yields:
  99
 100 - capacity(CPU0) = C
 101 - capacity(CPU1) = C/3
 102
 103 Executing the same workload as described in 1.3.1, which each CPU running at its
 104 maximum frequency results in::
 105
 106  CPU0 work ^
 107            |     ____                ____                ____
 108            |    |    |              |    |              |    |
 109            +----+----+----+----+----+----+----+----+----+----+-> time
 110
 111                             workload on CPU1
 112  CPU1 work ^
 113            |     ______________      ______________      ____
 114            |    |              |    |              |    |
 115            +----+----+----+----+----+----+----+----+----+----+-> time
 116
 117 1.4 Representation caveat
 118 -------------------------
 119
 120 It should be noted that having a *single* value to represent differences in CPU
 121 performance is somewhat of a contentious point. The relative performance
 122 difference between two different µarchs could be X% on integer operations, Y% on
 123 floating point operations, Z% on branches, and so on. Still, results using this
 124 simple approach have been satisfactory for now.
 125
 126 2. Task utilization
 127 ===================
 128
 129 2.1 Introduction
 130 ----------------
 131
 132 Capacity aware scheduling requires an expression of a task's requirements with
 133 regards to CPU capacity. Each scheduler class can express this differently, and
 134 while task utilization is specific to CFS, it is convenient to describe it here
 135 in order to introduce more generic concepts.
 136
 137 Task utilization is a percentage meant to represent the throughput requirements
 138 of a task. A simple approximation of it is the task's duty cycle, i.e.::
 139
 140   task_util(p) = duty_cycle(p)
 141
 142 On an SMP system with fixed frequencies, 100% utilization suggests the task is a
 143 busy loop. Conversely, 10% utilization hints it is a small periodic task that
 144 spends more time sleeping than executing. Variable CPU frequencies and
 145 asymmetric CPU capacities complexify this somewhat; the following sections will
 146 expand on these.
 147
 148 2.2 Frequency invariance
 149 ------------------------
 150
 151 One issue that needs to be taken into account is that a workload's duty cycle is
 152 directly impacted by the current OPP the CPU is running at. Consider running a
 153 periodic workload at a given frequency F::
 154
 155   CPU work ^
 156            |     ____                ____                ____
 157            |    |    |              |    |              |    |
 158            +----+----+----+----+----+----+----+----+----+----+-> time
 159
 160 This yields duty_cycle(p) == 25%.
 161
 162 Now, consider running the *same* workload at frequency F/2::
 163
 164   CPU work ^
 165            |     _________           _________           ____
 166            |    |         |         |         |         |
 167            +----+----+----+----+----+----+----+----+----+----+-> time
 168
 169 This yields duty_cycle(p) == 50%, despite the task having the exact same
 170 behaviour (i.e. executing the same amount of work) in both executions.
 171
 172 The task utilization signal can be made frequency invariant using the following
 173 formula::
 174
 175   task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
 176
 177 Applying this formula to the two examples above yields a frequency invariant
 178 task utilization of 25%.
 179
 180 2.3 CPU invariance
 181 ------------------
 182
 183 CPU capacity has a similar effect on task utilization in that running an
 184 identical workload on CPUs of different capacity values will yield different
 185 duty cycles.
 186
 187 Consider the system described in 1.3.2., i.e.::
 188
 189 - capacity(CPU0) = C
 190 - capacity(CPU1) = C/3
 191
 192 Executing a given periodic workload on each CPU at their maximum frequency would
 193 result in::
 194
 195  CPU0 work ^
 196            |     ____                ____                ____
 197            |    |    |              |    |              |    |
 198            +----+----+----+----+----+----+----+----+----+----+-> time
 199
 200  CPU1 work ^
 201            |     ______________      ______________      ____
 202            |    |              |    |              |    |
 203            +----+----+----+----+----+----+----+----+----+----+-> time
 204
 205 IOW,
 206
 207 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
 208 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
 209
 210 The task utilization signal can be made CPU invariant using the following
 211 formula::
 212
 213   task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
 214
 215 with ``max_capacity`` being the highest CPU capacity value in the
 216 system. Applying this formula to the above example above yields a CPU
 217 invariant task utilization of 25%.
 218
 219 2.4 Invariant task utilization
 220 ------------------------------
 221
 222 Both frequency and CPU invariance need to be applied to task utilization in
 223 order to obtain a truly invariant signal. The pseudo-formula for a task
 224 utilization that is both CPU and frequency invariant is thus, for a given
 225 task p::
 226
 227                                      curr_frequency(cpu)   capacity(cpu)
 228   task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
 229                                      max_frequency(cpu)    max_capacity
 230
 231 In other words, invariant task utilization describes the behaviour of a task as
 232 if it were running on the highest-capacity CPU in the system, running at its
 233 maximum frequency.
 234
 235 Any mention of task utilization in the following sections will imply its
 236 invariant form.
 237
 238 2.5 Utilization estimation
 239 --------------------------
 240
 241 Without a crystal ball, task behaviour (and thus task utilization) cannot
 242 accurately be predicted the moment a task first becomes runnable. The CFS class
 243 maintains a handful of CPU and task signals based on the Per-Entity Load
 244 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
 245 opposed to instantaneous).
 246
 247 This means that while the capacity aware scheduling criteria will be written
 248 considering a "true" task utilization (using a crystal ball), the implementation
 249 will only ever be able to use an estimator thereof.
 250
 251 3. Capacity aware scheduling requirements
 252 =========================================
 253
 254 3.1 CPU capacity
 255 ----------------
 256
 257 Linux cannot currently figure out CPU capacity on its own, this information thus
 258 needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
 259 for that purpose.
 260
 261 The arm and arm64 architectures directly map this to the arch_topology driver
 262 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
 263 Documentation/devicetree/bindings/arm/cpu-capacity.txt.
 264
 265 3.2 Frequency invariance
 266 ------------------------
 267
 268 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
 269 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
 270 purpose.
 271
 272 Implementing this function requires figuring out at which frequency each CPU
 273 have been running at. One way to implement this is to leverage hardware counters
 274 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
 275 AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
 276 when the kernel is aware of the switched-to frequency (also employed by
 277 arm/arm64).
 278
 279 4. Scheduler topology
 280 =====================
 281
 282 During the construction of the sched domains, the scheduler will figure out
 283 whether the system exhibits asymmetric CPU capacities. Should that be the
 284 case:
 285
 286 - The sched_asym_cpucapacity static key will be enabled.
 287 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
 288   level that spans all unique CPU capacity values.
 289 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
 290   CPUs with any range of asymmetry.
 291
 292 The sched_asym_cpucapacity static key is intended to guard sections of code that
 293 cater to asymmetric CPU capacity systems. Do note however that said key is
 294 *system-wide*. Imagine the following setup using cpusets::
 295
 296   capacity    C/2          C
 297             ________    ________
 298            /        \  /        \
 299   CPUs     0  1  2  3  4  5  6  7
 300            \__/  \______________/
 301   cpusets   cs0         cs1
 302
 303 Which could be created via:
 304
 305 .. code-block:: sh
 306
 307   mkdir /sys/fs/cgroup/cpuset/cs0
 308   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
 309   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
 310
 311   mkdir /sys/fs/cgroup/cpuset/cs1
 312   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
 313   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
 314
 315   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
 316
 317 Since there *is* CPU capacity asymmetry in the system, the
 318 sched_asym_cpucapacity static key will be enabled. However, the sched_domain
 319 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
 320 set in that hierarchy, it describes an SMP island and should be treated as such.
 321
 322 Therefore, the 'canonical' pattern for protecting codepaths that cater to
 323 asymmetric CPU capacities is to:
 324
 325 - Check the sched_asym_cpucapacity static key
 326 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
 327   the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
 328   CPU or group thereof)
 329
 330 5. Capacity aware scheduling implementation
 331 ===========================================
 332
 333 5.1 CFS
 334 -------
 335
 336 5.1.1 Capacity fitness
 337 ~~~~~~~~~~~~~~~~~~~~~~
 338
 339 The main capacity scheduling criterion of CFS is::
 340
 341   task_util(p) < capacity(task_cpu(p))
 342
 343 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
 344 task "fits" on its CPU. If it is violated, the task will need to achieve more
 345 work than what its CPU can provide: it will be CPU-bound.
 346
 347 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
 348 value for a task, either via sched_setattr() or via the cgroup interface (see
 349 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
 350 clamp task_util() in the previous criterion.
 351
 352 5.1.2 Wakeup CPU selection
 353 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 354
 355 CFS task wakeup CPU selection follows the capacity fitness criterion described
 356 above. On top of that, uclamp is used to clamp the task utilization values,
 357 which lets userspace have more leverage over the CPU selection of CFS
 358 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
 359
 360   clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
 361
 362 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
 363 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
 364 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
 365 giving it a high uclamp.min value.
 366
 367 .. note::
 368
 369   Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
 370   (EAS), which is described in Documentation/scheduler/sched-energy.rst.
 371
 372 5.1.3 Load balancing
 373 ~~~~~~~~~~~~~~~~~~~~
 374
 375 A pathological case in the wakeup CPU selection occurs when a task rarely
 376 sleeps, if at all - it thus rarely wakes up, if at all. Consider::
 377
 378   w == wakeup event
 379
 380   capacity(CPU0) = C
 381   capacity(CPU1) = C / 3
 382
 383                            workload on CPU0
 384   CPU work ^
 385            |     _________           _________           ____
 386            |    |         |         |         |         |
 387            +----+----+----+----+----+----+----+----+----+----+-> time
 388                 w                   w                   w
 389
 390                            workload on CPU1
 391   CPU work ^
 392            |     ____________________________________________
 393            |    |
 394            +----+----+----+----+----+----+----+----+----+----+->
 395                 w
 396
 397 This workload should run on CPU0, but if the task either:
 398
 399 - was improperly scheduled from the start (inaccurate initial
 400   utilization estimation)
 401 - was properly scheduled from the start, but suddenly needs more
 402   processing power
 403
 404 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
 405 the CPU capacity scheduling criterion is violated, and there may not be any more
 406 wakeup event to fix this up via wakeup CPU selection.
 407
 408 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
 409 put in place to handle this shares the same name. Misfit task migration
 410 leverages the CFS load balancer, more specifically the active load balance part
 411 (which caters to migrating currently running tasks). When load balance happens,
 412 a misfit active load balance will be triggered if a misfit task can be migrated
 413 to a CPU with more capacity than its current one.
 414
 415 5.2 RT
 416 ------
 417
 418 5.2.1 Wakeup CPU selection
 419 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 420
 421 RT task wakeup CPU selection searches for a CPU that satisfies::
 422
 423   task_uclamp_min(p) <= capacity(task_cpu(cpu))
 424
 425 while still following the usual priority constraints. If none of the candidate
 426 CPUs can satisfy this capacity criterion, then strict priority based scheduling
 427 is followed and CPU capacities are ignored.
 428
 429 5.3 DL
 430 ------
 431
 432 5.3.1 Wakeup CPU selection
 433 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 434
 435 DL task wakeup CPU selection searches for a CPU that satisfies::
 436
 437   task_bandwidth(p) < capacity(task_cpu(p))
 438
 439 while still respecting the usual bandwidth and deadline constraints. If
 440 none of the candidate CPUs can satisfy this capacity criterion, then the
 441 task will remain on its current CPU.