Documentation/design-docs/longs-on-32bit-arch.md

   1 # Modeling Longs on 32-bit Architectures
   2
   3 The challenge here is to model long operations in a way that reflects register lifetimes for the
   4 lo and hi halves of a 64-bit integer accurately enough that we can achieve good register
   5 allocation.
   6
   7 ## Background
   8
   9 The current liveness model supports:
  10 * Internal registers (interfere with all sources, but no defs)
  11 * Sources that are not last use (interfere with all others)
  12 * Sources that are last use and NOT DelayFree (interfere with all sources, but no defs)
  13 * Sources that are last use AND DelayFree (interfere with all sources and defs)
  14 * Kills (interfere with all uses but no defs)
  15
  16 Previously (eons ago) when we were working on arm32, the model had all but the last def
  17 interfering with the uses and internal registers (i.e. they went live during the first
  18 location for the node).
  19 * This supports the ability to reuse inputs, e.g. for long adds, subs or addresses, after the first def register goes live.
  20 * This doesn’t work for supporting multi-reg call nodes:
  21   * All defs need to occur AFTER the kills, and they do not interfere with any sources.
  22
  23 Considerations for an expanded model:
  24 * Must be “pay for play” (not make LSRA more expensive for non-long code generation)
  25 * It is not practical to precisely model nodes that generate multiple instructions:
  26   * In the case of longs, we may have one source operand that could be overwritten by
  27     the first result, but other operands that need to remain live.
  28   * However, on x86 it will be really important not to waste registers.
  29
  30 ## Decomposition
  31
  32 This is the approach currently being taken. In the initial implementation, the reordering
  33 of nodes to support the required "valid tree traversal" property of the IR was causing
  34 the code to be incorrect because for instructions like `add` the carry bit needs to be
  35 immediately consumed by the computation ofthe hi bits.
  36
  37 ## Decomposition with temps
  38
  39 In order to preserve the current decomposition approach and not change the fundamental
  40 tree ordering properties, it is necessary to evaluate sub-expressions into temps, to keep
  41 the lo and hi computations together.
  42
  43 There are concerns about this, because it requires generating a number of extra temps
  44 in the case of nested expressions. However, mikedn has done some experimentation
  45 [here](https://github.com/mikedn/coreclr/blob/decompose/src/jit/lower.cpp#L424)
  46 that indicaates that this approach may not be as problematic as we feared.
  47
  48 This basic idea is that whenever we need to decompose hi/lo operations but keep them
  49 together (i.e. can’t preserve the tree traversal/linear order invariants), we create a temp.
  50
  51 ## Richer Liveness Model (No Decomposition)
  52
  53 The idea here woudl be to retain `TYP_LONG` nodes in the IR, and to find a way to extend
  54 the liveness model used by Lowering, LSRA and CodeGen to ensure good register allocation.
  55 o
  56
  57 In the table below, there are several operations that have different register lifetime
  58 characteristics.
  59 Each is modeled with a sequence of "Locations" for which the changes in register lifetimes
  60 can be viewed as happening at the same time.
  61 All lifetimes participating in a given Location are considered to interfere.
  62 Note that the simplest model is that all the uses happen at one location, and then the
  63 def(s) happen at the next Location (i.e. the def does not interfere with the uses).
  64 The model becomes more complicated when we have constraints such as RMW (read-modify-write)
  65 operands, and even more so when we are actually modeling multiple target instructions
  66 with a single IR node.
  67
  68 To avoid even more complication than is already inherent in this issue, we will assume
  69 that the evaluation order is fixed during Lowering (and in these examples we are assuming
  70 that the lo operand is evaluated first).
  71
  72 In future we may want to consider the implications of relaxing that constraint, though
  73 note that the Decomposition approach also requires a predetermined evaluation order.
  74
  75 | Operation     | Location 1 | Location 2 | Location 3 |
  76 | --------------|:----------:| ----------:| ----------:|
  77 | x = y         | use y.lo   | def x.lo   |            |
  78 |               | use y.hi   | def x.hi   |            |
  79 |               |            |            |            |
  80 | x = y + z     | use y.lo   | def x.lo   | def x.hi   |
  81 |               | use z.lo   | use y.hi   |            |
  82 |               | y.hi live  | use z.hi   |            |
  83 |               | z.hi live  | use z.hi   |            |
  84 |               |            |            |            |
  85 | x = *p        | use p      | def x.hi   |            |
  86 | (non-ldp)     | def x.lo   |            |            |
  87 |               |            |            |            |
  88 | x = *p        | use p      | def x.lo   |            |
  89 | (ldp)         |            | def x.hi   |            |
  90 |               |            |            |            |
  91 | x = \*(p+i*8) | use p      | def x.hi   |            |
  92 | (non-ldp)     | use i      |            |            |
  93 |               | def x.lo   |            |            |
  94 |               |            |            |            |
  95 | x = call      | use args   | def x.lo   |            |
  96 |               | kill (end) | def x.hi   |            |
  97
  98
  99 Both the non-ldp  (load pair - available on Arm but not x86) cases take
 100 advantage of the fact that all the sources must remain live
 101 with the first def. The same can be done in the non-ldp case of `x = *p`.
 102 However, that approach doesn't reduce the Location requirement for the `x = y + z` case,
 103 where, if we assume that all inputs must be live for the first def, then we keep the lo
 104 registers live longer than necessary, and therefore can't reuse them for the x.lo (or
 105 other) results.
 106
 107 TO DO:
 108 * Extend the above to include more operations, and to show the actual instructions
 109   generated, to determine how well we can approximate the "optimal" liveness model.
 110
 111 ## True Linear IR
 112
 113 It would be really nice to break the “valid tree traversal” restriction, and allow
 114 arbitrary (or at least more flexible) reordering of nodes when in linear form.
 115 * Gets rid of embedded statements!!
 116 * Affects (at least) rationalizer and lsra, which implicitly depend on this property in their use of stacks.
 117 * Probably has many unknown other impacts