Documentation/arm/cluster-pm-race-avoidance.rst

   1 =========================================================
   2 Cluster-wide Power-up/power-down race avoidance algorithm
   3 =========================================================
   4
   5 This file documents the algorithm which is used to coordinate CPU and
   6 cluster setup and teardown operations and to manage hardware coherency
   7 controls safely.
   8
   9 The section "Rationale" explains what the algorithm is for and why it is
  10 needed.  "Basic model" explains general concepts using a simplified view
  11 of the system.  The other sections explain the actual details of the
  12 algorithm in use.
  13
  14
  15 Rationale
  16 ---------
  17
  18 In a system containing multiple CPUs, it is desirable to have the
  19 ability to turn off individual CPUs when the system is idle, reducing
  20 power consumption and thermal dissipation.
  21
  22 In a system containing multiple clusters of CPUs, it is also desirable
  23 to have the ability to turn off entire clusters.
  24
  25 Turning entire clusters off and on is a risky business, because it
  26 involves performing potentially destructive operations affecting a group
  27 of independently running CPUs, while the OS continues to run.  This
  28 means that we need some coordination in order to ensure that critical
  29 cluster-level operations are only performed when it is truly safe to do
  30 so.
  31
  32 Simple locking may not be sufficient to solve this problem, because
  33 mechanisms like Linux spinlocks may rely on coherency mechanisms which
  34 are not immediately enabled when a cluster powers up.  Since enabling or
  35 disabling those mechanisms may itself be a non-atomic operation (such as
  36 writing some hardware registers and invalidating large caches), other
  37 methods of coordination are required in order to guarantee safe
  38 power-down and power-up at the cluster level.
  39
  40 The mechanism presented in this document describes a coherent memory
  41 based protocol for performing the needed coordination.  It aims to be as
  42 lightweight as possible, while providing the required safety properties.
  43
  44
  45 Basic model
  46 -----------
  47
  48 Each cluster and CPU is assigned a state, as follows:
  49
  50         - DOWN
  51         - COMING_UP
  52         - UP
  53         - GOING_DOWN
  54
  55 ::
  56
  57             +---------> UP ----------+
  58             |                        v
  59
  60         COMING_UP                GOING_DOWN
  61
  62             ^                        |
  63             +--------- DOWN <--------+
  64
  65
  66 DOWN:
  67         The CPU or cluster is not coherent, and is either powered off or
  68         suspended, or is ready to be powered off or suspended.
  69
  70 COMING_UP:
  71         The CPU or cluster has committed to moving to the UP state.
  72         It may be part way through the process of initialisation and
  73         enabling coherency.
  74
  75 UP:
  76         The CPU or cluster is active and coherent at the hardware
  77         level.  A CPU in this state is not necessarily being used
  78         actively by the kernel.
  79
  80 GOING_DOWN:
  81         The CPU or cluster has committed to moving to the DOWN
  82         state.  It may be part way through the process of teardown and
  83         coherency exit.
  84
  85
  86 Each CPU has one of these states assigned to it at any point in time.
  87 The CPU states are described in the "CPU state" section, below.
  88
  89 Each cluster is also assigned a state, but it is necessary to split the
  90 state value into two parts (the "cluster" state and "inbound" state) and
  91 to introduce additional states in order to avoid races between different
  92 CPUs in the cluster simultaneously modifying the state.  The cluster-
  93 level states are described in the "Cluster state" section.
  94
  95 To help distinguish the CPU states from cluster states in this
  96 discussion, the state names are given a `CPU_` prefix for the CPU states,
  97 and a `CLUSTER_` or `INBOUND_` prefix for the cluster states.
  98
  99
 100 CPU state
 101 ---------
 102
 103 In this algorithm, each individual core in a multi-core processor is
 104 referred to as a "CPU".  CPUs are assumed to be single-threaded:
 105 therefore, a CPU can only be doing one thing at a single point in time.
 106
 107 This means that CPUs fit the basic model closely.
 108
 109 The algorithm defines the following states for each CPU in the system:
 110
 111         - CPU_DOWN
 112         - CPU_COMING_UP
 113         - CPU_UP
 114         - CPU_GOING_DOWN
 115
 116 ::
 117
 118          cluster setup and
 119         CPU setup complete          policy decision
 120               +-----------> CPU_UP ------------+
 121               |                                v
 122
 123         CPU_COMING_UP                   CPU_GOING_DOWN
 124
 125               ^                                |
 126               +----------- CPU_DOWN <----------+
 127          policy decision           CPU teardown complete
 128         or hardware event
 129
 130
 131 The definitions of the four states correspond closely to the states of
 132 the basic model.
 133
 134 Transitions between states occur as follows.
 135
 136 A trigger event (spontaneous) means that the CPU can transition to the
 137 next state as a result of making local progress only, with no
 138 requirement for any external event to happen.
 139
 140
 141 CPU_DOWN:
 142         A CPU reaches the CPU_DOWN state when it is ready for
 143         power-down.  On reaching this state, the CPU will typically
 144         power itself down or suspend itself, via a WFI instruction or a
 145         firmware call.
 146
 147         Next state:
 148                 CPU_COMING_UP
 149         Conditions:
 150                 none
 151
 152         Trigger events:
 153                 a) an explicit hardware power-up operation, resulting
 154                    from a policy decision on another CPU;
 155
 156                 b) a hardware event, such as an interrupt.
 157
 158
 159 CPU_COMING_UP:
 160         A CPU cannot start participating in hardware coherency until the
 161         cluster is set up and coherent.  If the cluster is not ready,
 162         then the CPU will wait in the CPU_COMING_UP state until the
 163         cluster has been set up.
 164
 165         Next state:
 166                 CPU_UP
 167         Conditions:
 168                 The CPU's parent cluster must be in CLUSTER_UP.
 169         Trigger events:
 170                 Transition of the parent cluster to CLUSTER_UP.
 171
 172         Refer to the "Cluster state" section for a description of the
 173         CLUSTER_UP state.
 174
 175
 176 CPU_UP:
 177         When a CPU reaches the CPU_UP state, it is safe for the CPU to
 178         start participating in local coherency.
 179
 180         This is done by jumping to the kernel's CPU resume code.
 181
 182         Note that the definition of this state is slightly different
 183         from the basic model definition: CPU_UP does not mean that the
 184         CPU is coherent yet, but it does mean that it is safe to resume
 185         the kernel.  The kernel handles the rest of the resume
 186         procedure, so the remaining steps are not visible as part of the
 187         race avoidance algorithm.
 188
 189         The CPU remains in this state until an explicit policy decision
 190         is made to shut down or suspend the CPU.
 191
 192         Next state:
 193                 CPU_GOING_DOWN
 194         Conditions:
 195                 none
 196         Trigger events:
 197                 explicit policy decision
 198
 199
 200 CPU_GOING_DOWN:
 201         While in this state, the CPU exits coherency, including any
 202         operations required to achieve this (such as cleaning data
 203         caches).
 204
 205         Next state:
 206                 CPU_DOWN
 207         Conditions:
 208                 local CPU teardown complete
 209         Trigger events:
 210                 (spontaneous)
 211
 212
 213 Cluster state
 214 -------------
 215
 216 A cluster is a group of connected CPUs with some common resources.
 217 Because a cluster contains multiple CPUs, it can be doing multiple
 218 things at the same time.  This has some implications.  In particular, a
 219 CPU can start up while another CPU is tearing the cluster down.
 220
 221 In this discussion, the "outbound side" is the view of the cluster state
 222 as seen by a CPU tearing the cluster down.  The "inbound side" is the
 223 view of the cluster state as seen by a CPU setting the CPU up.
 224
 225 In order to enable safe coordination in such situations, it is important
 226 that a CPU which is setting up the cluster can advertise its state
 227 independently of the CPU which is tearing down the cluster.  For this
 228 reason, the cluster state is split into two parts:
 229
 230         "cluster" state: The global state of the cluster; or the state
 231         on the outbound side:
 232
 233                 - CLUSTER_DOWN
 234                 - CLUSTER_UP
 235                 - CLUSTER_GOING_DOWN
 236
 237         "inbound" state: The state of the cluster on the inbound side.
 238
 239                 - INBOUND_NOT_COMING_UP
 240                 - INBOUND_COMING_UP
 241
 242
 243         The different pairings of these states results in six possible
 244         states for the cluster as a whole::
 245
 246                                     CLUSTER_UP
 247                   +==========> INBOUND_NOT_COMING_UP -------------+
 248                   #                                               |
 249                                                                   |
 250              CLUSTER_UP     <----+                                |
 251           INBOUND_COMING_UP      |                                v
 252
 253                   ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
 254                   #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
 255
 256             CLUSTER_DOWN         |                                |
 257           INBOUND_COMING_UP <----+                                |
 258                                                                   |
 259                   ^                                               |
 260                   +===========     CLUSTER_DOWN      <------------+
 261                                INBOUND_NOT_COMING_UP
 262
 263         Transitions -----> can only be made by the outbound CPU, and
 264         only involve changes to the "cluster" state.
 265
 266         Transitions ===##> can only be made by the inbound CPU, and only
 267         involve changes to the "inbound" state, except where there is no
 268         further transition possible on the outbound side (i.e., the
 269         outbound CPU has put the cluster into the CLUSTER_DOWN state).
 270
 271         The race avoidance algorithm does not provide a way to determine
 272         which exact CPUs within the cluster play these roles.  This must
 273         be decided in advance by some other means.  Refer to the section
 274         "Last man and first man selection" for more explanation.
 275
 276
 277         CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
 278         cluster can actually be powered down.
 279
 280         The parallelism of the inbound and outbound CPUs is observed by
 281         the existence of two different paths from CLUSTER_GOING_DOWN/
 282         INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
 283         model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
 284         COMING_UP in the basic model).  The second path avoids cluster
 285         teardown completely.
 286
 287         CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
 288         model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
 289         is trivial and merely resets the state machine ready for the
 290         next cycle.
 291
 292         Details of the allowable transitions follow.
 293
 294         The next state in each case is notated
 295
 296                 <cluster state>/<inbound state> (<transitioner>)
 297
 298         where the <transitioner> is the side on which the transition
 299         can occur; either the inbound or the outbound side.
 300
 301
 302 CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
 303         Next state:
 304                 CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
 305         Conditions:
 306                 none
 307
 308         Trigger events:
 309                 a) an explicit hardware power-up operation, resulting
 310                    from a policy decision on another CPU;
 311
 312                 b) a hardware event, such as an interrupt.
 313
 314
 315 CLUSTER_DOWN/INBOUND_COMING_UP:
 316
 317         In this state, an inbound CPU sets up the cluster, including
 318         enabling of hardware coherency at the cluster level and any
 319         other operations (such as cache invalidation) which are required
 320         in order to achieve this.
 321
 322         The purpose of this state is to do sufficient cluster-level
 323         setup to enable other CPUs in the cluster to enter coherency
 324         safely.
 325
 326         Next state:
 327                 CLUSTER_UP/INBOUND_COMING_UP (inbound)
 328         Conditions:
 329                 cluster-level setup and hardware coherency complete
 330         Trigger events:
 331                 (spontaneous)
 332
 333
 334 CLUSTER_UP/INBOUND_COMING_UP:
 335
 336         Cluster-level setup is complete and hardware coherency is
 337         enabled for the cluster.  Other CPUs in the cluster can safely
 338         enter coherency.
 339
 340         This is a transient state, leading immediately to
 341         CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
 342         should consider treat these two states as equivalent.
 343
 344         Next state:
 345                 CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
 346         Conditions:
 347                 none
 348         Trigger events:
 349                 (spontaneous)
 350
 351
 352 CLUSTER_UP/INBOUND_NOT_COMING_UP:
 353
 354         Cluster-level setup is complete and hardware coherency is
 355         enabled for the cluster.  Other CPUs in the cluster can safely
 356         enter coherency.
 357
 358         The cluster will remain in this state until a policy decision is
 359         made to power the cluster down.
 360
 361         Next state:
 362                 CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
 363         Conditions:
 364                 none
 365         Trigger events:
 366                 policy decision to power down the cluster
 367
 368
 369 CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
 370
 371         An outbound CPU is tearing the cluster down.  The selected CPU
 372         must wait in this state until all CPUs in the cluster are in the
 373         CPU_DOWN state.
 374
 375         When all CPUs are in the CPU_DOWN state, the cluster can be torn
 376         down, for example by cleaning data caches and exiting
 377         cluster-level coherency.
 378
 379         To avoid wasteful unnecessary teardown operations, the outbound
 380         should check the inbound cluster state for asynchronous
 381         transitions to INBOUND_COMING_UP.  Alternatively, individual
 382         CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
 383
 384
 385         Next states:
 386
 387         CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
 388                 Conditions:
 389                         cluster torn down and ready to power off
 390                 Trigger events:
 391                         (spontaneous)
 392
 393         CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
 394                 Conditions:
 395                         none
 396
 397                 Trigger events:
 398                         a) an explicit hardware power-up operation,
 399                            resulting from a policy decision on another
 400                            CPU;
 401
 402                         b) a hardware event, such as an interrupt.
 403
 404
 405 CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
 406
 407         The cluster is (or was) being torn down, but another CPU has
 408         come online in the meantime and is trying to set up the cluster
 409         again.
 410
 411         If the outbound CPU observes this state, it has two choices:
 412
 413                 a) back out of teardown, restoring the cluster to the
 414                    CLUSTER_UP state;
 415
 416                 b) finish tearing the cluster down and put the cluster
 417                    in the CLUSTER_DOWN state; the inbound CPU will
 418                    set up the cluster again from there.
 419
 420         Choice (a) permits the removal of some latency by avoiding
 421         unnecessary teardown and setup operations in situations where
 422         the cluster is not really going to be powered down.
 423
 424
 425         Next states:
 426
 427         CLUSTER_UP/INBOUND_COMING_UP (outbound)
 428                 Conditions:
 429                                 cluster-level setup and hardware
 430                                 coherency complete
 431
 432                 Trigger events:
 433                                 (spontaneous)
 434
 435         CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
 436                 Conditions:
 437                         cluster torn down and ready to power off
 438
 439                 Trigger events:
 440                         (spontaneous)
 441
 442
 443 Last man and First man selection
 444 --------------------------------
 445
 446 The CPU which performs cluster tear-down operations on the outbound side
 447 is commonly referred to as the "last man".
 448
 449 The CPU which performs cluster setup on the inbound side is commonly
 450 referred to as the "first man".
 451
 452 The race avoidance algorithm documented above does not provide a
 453 mechanism to choose which CPUs should play these roles.
 454
 455
 456 Last man:
 457
 458 When shutting down the cluster, all the CPUs involved are initially
 459 executing Linux and hence coherent.  Therefore, ordinary spinlocks can
 460 be used to select a last man safely, before the CPUs become
 461 non-coherent.
 462
 463
 464 First man:
 465
 466 Because CPUs may power up asynchronously in response to external wake-up
 467 events, a dynamic mechanism is needed to make sure that only one CPU
 468 attempts to play the first man role and do the cluster-level
 469 initialisation: any other CPUs must wait for this to complete before
 470 proceeding.
 471
 472 Cluster-level initialisation may involve actions such as configuring
 473 coherency controls in the bus fabric.
 474
 475 The current implementation in mcpm_head.S uses a separate mutual exclusion
 476 mechanism to do this arbitration.  This mechanism is documented in
 477 detail in vlocks.txt.
 478
 479
 480 Features and Limitations
 481 ------------------------
 482
 483 Implementation:
 484
 485         The current ARM-based implementation is split between
 486         arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
 487         arch/arm/common/mcpm_entry.c (everything else):
 488
 489         __mcpm_cpu_going_down() signals the transition of a CPU to the
 490         CPU_GOING_DOWN state.
 491
 492         __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
 493         state.
 494
 495         A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
 496         low-level power-up code in mcpm_head.S.  This could
 497         involve CPU-specific setup code, but in the current
 498         implementation it does not.
 499
 500         __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
 501         handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
 502         and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
 503         the case of an aborted cluster power-down).
 504
 505         These functions are more complex than the __mcpm_cpu_*()
 506         functions due to the extra inter-CPU coordination which
 507         is needed for safe transitions at the cluster level.
 508
 509         A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
 510         the low-level power-up code in mcpm_head.S.  This
 511         typically involves platform-specific setup code,
 512         provided by the platform-specific power_up_setup
 513         function registered via mcpm_sync_init.
 514
 515 Deep topologies:
 516
 517         As currently described and implemented, the algorithm does not
 518         support CPU topologies involving more than two levels (i.e.,
 519         clusters of clusters are not supported).  The algorithm could be
 520         extended by replicating the cluster-level states for the
 521         additional topological levels, and modifying the transition
 522         rules for the intermediate (non-outermost) cluster levels.
 523
 524
 525 Colophon
 526 --------
 527
 528 Originally created and documented by Dave Martin for Linaro Limited, in
 529 collaboration with Nicolas Pitre and Achin Gupta.
 530
 531 Copyright (C) 2012-2013  Linaro Limited
 532 Distributed under the terms of Version 2 of the GNU General Public
 533 License, as defined in linux/COPYING.