Documentation/scheduler/schedutil.txt

   1
   2
   3 NOTE; all this assumes a linear relation between frequency and work capacity,
   4 we know this is flawed, but it is the best workable approximation.
   5
   6
   7 PELT (Per Entity Load Tracking)
   8 -------------------------------
   9
  10 With PELT we track some metrics across the various scheduler entities, from
  11 individual tasks to task-group slices to CPU runqueues. As the basis for this
  12 we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
  13 is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
  14 half, while the rest of history contribute the other half.
  15
  16 Specifically:
  17
  18   ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
  19
  20   ewma(u) = ewma_sum(u) / ewma_sum(1)
  21
  22 Since this is essentially a progression of an infinite geometric series, the
  23 results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
  24 is key, since it gives the ability to recompose the averages when tasks move
  25 around.
  26
  27 Note that blocked tasks still contribute to the aggregates (task-group slices
  28 and CPU runqueues), which reflects their expected contribution when they
  29 resume running.
  30
  31 Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
  32 reflects the time an entity spends on the CPU, while 'runnable' reflects the
  33 time an entity spends on the runqueue. When there is only a single task these
  34 two metrics are the same, but once there is contention for the CPU 'running'
  35 will decrease to reflect the fraction of time each task spends on the CPU
  36 while 'runnable' will increase to reflect the amount of contention.
  37
  38 For more detail see: kernel/sched/pelt.c
  39
  40
  41 Frequency- / CPU Invariance
  42 ---------------------------
  43
  44 Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
  45 for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
  46 a big CPU, we allow architectures to scale the time delta with two ratios, one
  47 Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
  48
  49 For simple DVFS architectures (where software is in full control) we trivially
  50 compute the ratio as:
  51
  52             f_cur
  53   r_dvfs := -----
  54             f_max
  55
  56 For more dynamic systems where the hardware is in control of DVFS we use
  57 hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
  58 For Intel specifically, we use:
  59
  60            APERF
  61   f_cur := ----- * P0
  62            MPERF
  63
  64              4C-turbo;  if available and turbo enabled
  65   f_max := { 1C-turbo;  if turbo enabled
  66              P0;        otherwise
  67
  68                     f_cur
  69   r_dvfs := min( 1, ----- )
  70                     f_max
  71
  72 We pick 4C turbo over 1C turbo to make it slightly more sustainable.
  73
  74 r_cpu is determined as the ratio of highest performance level of the current
  75 CPU vs the highest performance level of any other CPU in the system.
  76
  77   r_tot = r_dvfs * r_cpu
  78
  79 The result is that the above 'running' and 'runnable' metrics become invariant
  80 of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
  81
  82 For more detail see:
  83
  84  - kernel/sched/pelt.h:update_rq_clock_pelt()
  85  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
  86  - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
  87
  88
  89 UTIL_EST / UTIL_EST_FASTUP
  90 --------------------------
  91
  92 Because periodic tasks have their averages decayed while they sleep, even
  93 though when running their expected utilization will be the same, they suffer a
  94 (DVFS) ramp-up after they are running again.
  95
  96 To alleviate this (a default enabled option) UTIL_EST drives an Infinite
  97 Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
  98 highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
  99 filter to instantly increase and only decay on decrease.
 100
 101 A further runqueue wide sum (of runnable tasks) is maintained of:
 102
 103   util_est := \Sum_t max( t_running, t_util_est_ewma )
 104
 105 For more detail see: kernel/sched/fair.c:util_est_dequeue()
 106
 107
 108 UCLAMP
 109 ------
 110
 111 It is possible to set effective u_min and u_max clamps on each CFS or RT task;
 112 the runqueue keeps an max aggregate of these clamps for all running tasks.
 113
 114 For more detail see: include/uapi/linux/sched/types.h
 115
 116
 117 Schedutil / DVFS
 118 ----------------
 119
 120 Every time the scheduler load tracking is updated (task wakeup, task
 121 migration, time progression) we call out to schedutil to update the hardware
 122 DVFS state.
 123
 124 The basis is the CPU runqueue's 'running' metric, which per the above it is
 125 the frequency invariant utilization estimate of the CPU. From this we compute
 126 a desired frequency like:
 127
 128              max( running, util_est );  if UTIL_EST
 129   u_cfs := { running;                   otherwise
 130
 131                clamp( u_cfs + u_rt , u_min, u_max );    if UCLAMP_TASK
 132   u_clamp := { u_cfs + u_rt;                            otherwise
 133
 134   u := u_clamp + u_irq + u_dl;          [approx. see source for more detail]
 135
 136   f_des := min( f_max, 1.25 u * f_max )
 137
 138 XXX IO-wait; when the update is due to a task wakeup from IO-completion we
 139 boost 'u' above.
 140
 141 This frequency is then used to select a P-state/OPP or directly munged into a
 142 CPPC style request to the hardware.
 143
 144 XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
 145 required to satisfy the workload.
 146
 147 Because these callbacks are directly from the scheduler, the DVFS hardware
 148 interaction should be 'fast' and non-blocking. Schedutil supports
 149 rate-limiting DVFS requests for when hardware interaction is slow and
 150 expensive, this reduces effectiveness.
 151
 152 For more information see: kernel/sched/cpufreq_schedutil.c
 153
 154
 155 NOTES
 156 -----
 157
 158  - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
 159    will closely reflect utilization.
 160
 161  - In saturated scenarios task movement will cause some transient dips,
 162    suppose we have a CPU saturated with 4 tasks, then when we migrate a task
 163    to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
 164    new CPU will gain 0.25. This is inevitable and time progression will
 165    correct this. XXX do we still guarantee f_max due to no idle-time?
 166
 167  - Much of the above is about avoiding DVFS dips, and independent DVFS domains
 168    having to re-learn / ramp-up when load shifts.
 169