Documentation/design-docs/removing-embedded-statements.md

   1 ## Removing Embedded Statements From the RyuJIT Backend IR
   2
   3 Pat Gavlin ([*pagavlin@microsoft.com*](mailto:pagavlin@microsoft.com))
   4
   5 July 2016
   6
   7 ### Abstract
   8
   9 RyuJIT’s IR comes in two forms: that produced and manipulated in the front end and that expected and manipulated by the
  10 back end. The boundary between these two forms of IR is comprised of rationalization and lowering, both of which
  11 transform the front-end IR into back end IR. For the purposes of this paper, the relevant differences between the two
  12 IRs revolve around ordering constructs within basic blocks: top-level statements, trees, comma nodes, and embedded
  13 statements. The latter two constructs are used to represent arbitrary (often side-effecting) code that executes at a
  14 specific point in a tree but does not otherwise participate in the tree’s dataflow. Unfortunately, representational
  15 challenges with embedded statements make them difficult to understand and error-prone to manipulate. This paper proposes
  16 that we remove all statements--embedded and otherwise--from the backend IR by chaining the last linearly-threaded node
  17 within a statement to the first linearly-threaded node within its successor and vice versa as well as removing certain
  18 constraints on the shape of the nodes within a block.
  19
  20 ### Review: IR ordering semantics
  21
  22 As previously mentioned, RyuJIT uses two forms of IR: the front-end IR (referred to hereafter as HIR) and the back-end
  23 IR (referred to hereafter as LIR). Aside from using different representations for operations such as stores, HIR and LIR
  24 differ in their ordering constructs within basic blocks.
  25
  26 Within a basic block, the HIR is ordered first by statements. Each statement consists of a single tree that performs the
  27 computation associated with that statement. The nodes of that tree are executed in the order produced by a left-to-right
  28 post-order visit of the tree’s nodes (with the exception of binary operator nodes that have the GTF\_REVERSE\_OPS flag
  29 set, which reverses the order of their operand trees). As such, the edges in a tree represent both ordering and edges in
  30 a dataflow graph where every definition has a single use, with some exceptions:
  31
  32 -   Edges to nodes that represent defs of storage locations (e.g. edges to nodes on the LHS of assignment operators)
  33     often represent neither ordering or a use-to-def relationship: the def of the location happens as part of the
  34     execution of its parent, and the edge does not represent a use of an SDSU temp.
  35
  36 -   Edges to unused values do not represent a use-to-def relationship: these edges exist only for ordering.
  37
  38 The primary source of the latter are comma nodes, which are used in the HIR specifically to insert arbitrary code into a
  39 tree’s execution that does not otherwise participate in its SDSU dataflow.
  40
  41 Similar to the HIR, the LIR is ordered first by statements. Each statement consists of a single tree and a linear
  42 threading of nodes that represent the SDSU dataflow and computation, respectively, that are associated with that
  43 statement. The linear threading of nodes must contain each node that is present in the tree, and all nodes that make up
  44 the tree must be present in the linear threading in the same order in which they would have been executed by the HIR ’s
  45 tree ordering. Additional nodes may be present in the linear threading in the form of embedded statements, which are
  46 sub-statements whose nodes form a contiguous subsequence of their containing statement’s execution order, but do not
  47 participate in the containing statement’s SDSU dataflow (i.e. the embedded statement’s nodes are not present in the
  48 containing statement’s tree). Embedded statements otherwise have the same ordering semantics as top-level statements,
  49 and are used for the same purpose as comma nodes in the HIR: they allow the compiler to insert arbitrary code into a
  50 statement’s execution that does not otherwise participate in its SDSU dataflow.
  51
  52 As mentioned, both comma nodes and embedded statements are used for the same purpose in the HIR and LIR, respectively.
  53 Each construct represents a contiguous sequence of code that executes within the context of a tree but that does not
  54 participate in its SDSU dataflow. Of the two, however, embedded statements are generally more difficult to work with:
  55 since they are not present in their containing statement’s tree, the developer must take particular care when processing
  56 LIR in order to avoid omitting the nodes that comprise embedded statements from analyses such as those required for code
  57 motion (e.g. the analysis required to make addressing mode “contained”) and to avoid violating the tree order constraint
  58 placed upon the LIR ’s linear threading.
  59
  60 The problems with manipulating embedded statements have two relatively clear solutions: either replace embedded
  61 statements with a construct that is represented in its parent statement’s tree, or remove the tree order constraint and
  62 move the LIR to a linear ordering that is constrained only by the requirements of dataflow and side-effect consistency.
  63 We believe that the latter is preferable to the former, as it is consistent with the overall direction that the LIR has
  64 taken thus far and reduces the number of concepts bound up in tree edges, which would thereafter represent only the SDSU
  65 dataflow for a node. This would clarify the meaning of the IR as well as the analysis required to perform the
  66 transformations required by the backend.
  67
  68 ### Approach
  69
  70 The approach that we propose is outlined below.
  71
  72 #### 1. Add utilities for working with LIR.
  73
  74 Efficiently working with the new IR shape will require the creation of utilities for manipulating, analyzing,
  75 validating, and displaying linearly-ordered LIR. Of these, validation and display of the LIR are particularly
  76 interesting. Validation is likely to check at least the following properties:
  77
  78 1.  The linear threading is loop-free
  79 2.  All SDSU temps (i.e. temps represented by edges) are in fact singly-used
  80 3.  All defs of SDSU temps occur before the corresponding use
  81 4.  All SDSU defs that are used are present in the linear IR
  82
  83 In the short term, we propose that the LIR be displayed using the linear dump that was recently added to the JIT. These
  84 dumps would be configured such that all of the information typically present in today’s tree-style dumps (e.g. node
  85 flags, value numbers, etc.) is also present in the linear dump.
  86
  87 #### 2. Stop Generating embedded statements in the rationalizer.
  88
  89 This can be approached in a few different ways:
  90
  91 1.  Remove commas from the IR before rationalizing. There is already infrastructure in-flight in order to perform this
  92     transformation, so this is attractive from a dev-hours point of view. Unfortunately, this approach carries both
  93     throughput and a code quality risks due to the addition of another pass and the creation of additional local vars.
  94
  95 2.  Change the rationalizer such that it simply does not produce embedded statements. This requires that the
  96     rationalizer is either able to operate on both HIR and LIR or simply linearize commas as it goes.
  97
  98 3.  Remove embedded statements from the IR with a linearization pass between the rationalizer and lowering.
  99
 100 We will move forward with option 2, as it is the most attractive from a throughput and code quality standpoint:
 101 it does not add additional passes (as does option 3), nor does it introduce additional local vars (as does option
 102 1).
 103
 104 #### 3. Refactor decomposition and lowering to work with linear LIR.
 105
 106 The bulk of the work in this step involves moving these passes from statement- and tree-ordered walks to linear walks
 107 and refactoring any code that uses the parent stack to instead use a helper that can calculate the def-to-use edge for a
 108 node. It will also be necessary to replace embedded statement insertion with simple linear IR insertion, which is
 109 straightforward.
 110
 111 ##### 3.i. Decomposition
 112
 113 Transitioning decomposition should be rather simple. This pass walks the nodes of each statement in execution order,
 114 decomposing 64-bit operations into equivalent sequences of 32-bit operations as it goes. Critically, the rewrites
 115 performed by decomposition are universally expansions of a single operator into a contiguous sequence of operators whose
 116 results are combined to form the ultimate result. This sort of transformation is easily performed on linear IR by
 117 inserting the new nodes before the node being replaced. Furthermore, because nodes will appear in the linear walk in the
 118 same relative order that they appear in the execution-order tree walk, rewrites performed in linear order will occur in
 119 the same order in which they were originally performed.
 120
 121 ##### 3.ii. Lowering
 122
 123 The picture for lowering is much the same as the picture for decomposition. Like decomposition, all rewrites are
 124 performed in execution order, and most rewrites are simple expansions of a single node into multiple nodes. There is one
 125 notable exception, however: the lowering of add nodes for the x86 architecture examines the parent stack provided by
 126 tree walks in order to defer lowering until it may be possible to form an address mode. In this case, the use of the
 127 parent stack can be replaced with a helper that finds the node that uses the def produced by the add node.
 128
 129 ##### 3.iii. General issues
 130
 131 Both decomposition and lowering make use of a utility to fix up information present in the per-call-node side table that
 132 tracks extra information about the call’s arguments when necessary. This helper can be replaced by instead fixing up the
 133 side table entries when the call is visited by each pass.
 134
 135 #### 4. Remove uses of statements and the tree-order invariant from LSRA.
 136
 137 LSRA currently depends on the tree order invariant so that it can build a stack to track data associated with the defs
 138 consumed by an operator. Removing the tree order invariant requires that LSRA is able to accommodate situations where
 139 the contents of this stack as produced by a linear walk would no longer contain correct information due to the insertion
 140 of new nodes that produce values that are live across the operator and some subset of its operands. Analysis indicates
 141 that a simple map from defs to the necessary information should be sufficient.
 142
 143 #### 5. Remove uses of statements from the rest of the backend.
 144
 145 The rest of the backend--i.e. codegen--has no known dependencies on tree order, so there are no concerns with respect to
 146 the change in ordering semantics. However, statement nodes are used to derive IL offsets for debug information. There
 147 are a number of alternative methods of tracking IL offsets to consider:
 148
 149 1.  Insert “IL offset” nodes into the linearly-ordered LIR. These nodes would then be noted by code generation and used
 150     to emit the necessary IP-mapping entries in the same way that statements are today. This has the disadvantage of
 151     requiring additional work when performing code motion on LIR if the IL offset for a particular node needs to be
 152     kept up-to-date. However, today’s backend does not perform this sort of code motion and even if it did, optimized
 153     debugging is not currently a scenario, so a loss of debugging fidelity may be acceptable.
 154
 155 2.  Use a side table to map from each node to its IL offset. Although this increases the total working set of the
 156     backend, this has the advantage of making it easy to keep IL offsets correct in the face of code motion.
 157
 158 3.  Add IL offset information directly to each node. This has the disadvantage of increasing the working set of the
 159     entire compiler.
 160
 161 Unless we expect to require correct debug information in the face of code motion in the future, our recommendation is
 162 for the first option, which comes at a minimum size and implementation cost.
 163
 164 #### 6. Remove contained nodes from the LIR’s execution order.
 165
 166 Once statements are removed from the LIR, there is one major issue left to address: the presence of contained nodes in
 167 the LIR’s execution order. The execution of these nodes logically happens as part of the node in which they are
 168 contained, but these nodes remain physically present in the IR at their original locations. As a result, the backend
 169 must often take care to skip contained nodes when manipulating the LIR in ways that must take execution order into
 170 account. Instead of leaving such nodes in the LIR, these node should be removed from execution order and instead be
 171 represented as (probably unordered) trees referenced only by the containing node.
 172
 173 ### Conclusion and Future Directions
 174
 175 The changes suggested in this paper move RyuJIT’s LIR from a linear view of a tree-ordered nodes with certain nodes
 176 represented only in execution order to a linearly ordered sequence of nodes. Furthermore, with this change, tree edges
 177 in LIR would represent either uses of SDSU temps that have a place in execution order (where the edge points from the
 178 use of the temp to the def) or uses of unordered expression trees that execute as part of the parent node. This form
 179 should be easier to work with due to the removal of the tree order constraint and embedded statements and its similarity
 180 to other linear IR designs.