Documentation/design-docs/inlining-plans.md

   1 # Proposed Plans for Inlining
   2
   3 This document outlines a plan for improving the inlining capabilities
   4 of RyuJit as well as adding inlining capabilities to LLILC.
   5
   6 ## Background
   7
   8 Inlining is a key optimization, directly reducing call overhead and
   9 indirectly exposing a wider scope for powerful intraprocedural
  10 optimizations.
  11
  12 From an implementation standpoint, an inliner has the following aspects:
  13
  14 * Machinery: transformations on the compiler's IR that physically
  15 incorporate the code for the callee into the caller.
  16
  17 * Legality: a set of rules that describe when inlining is legal (that
  18 is, preserves program semantics). Here, the legality rules are largely
  19 dictated to us by the CoreCLR runtime.
  20
  21 * Ability: a set of rules that describe whether the machinery is able
  22 to perform a legal inline and whether the result can be handled by
  23 downstream phases.
  24
  25 * Profitability: a set of heuristic rules that describe which legal
  26 and able inlines are desirable.
  27
  28 The general consensus is that the inlining in RyuJit can be improved,
  29 while carefully maintaining RyuJit's current lean compilation
  30 overhead. It is anticipated that this work will encompass Machinery,
  31 Ability, and Profitability.
  32
  33 LLILC does no inlining today. Since we aspire to have LLILC be a
  34 high-performance .Net code generator, we need to enable inlining in
  35 LLILC. LLILC can likely leverage much of LLVM's built-in inlining
  36 Machinery and Ability, but will need new code for Legality and
  37 Profitability.
  38
  39 We envision various scenarios for .Net Code generation that impact
  40 inlining: a first-tier JIT compiler, a higher-tier JIT compiler, a
  41 fast AOT compiler, and an optimizing AOT compiler. Each scenario calls
  42 for inlining, but the tradeoffs are different. For a given scenario,
  43 we may choose to deploy RyuJit or LLILC or both depending on the
  44 scenario requirements and code generator capabilities.
  45
  46 There are also aspects of inlining that are runtime and target machine
  47 specific. Each runtime and target architecture will require some
  48 specialized tuning to accommodate the different costs of operations at
  49 runtime.
  50
  51 Taking this all into account, the proposed scope of the work
  52 encompasses two different compiler code bases, a number of codegen
  53 scenarios, and a number of target architectures and runtime
  54 conventions. The goal is to come up with one conceptual framework that
  55 covers all these cases and avoids relying too much on time-intensive
  56 manual investigation and tuning.
  57
  58 We will measure the quality of inlining along three axes: the time
  59 spent generating code (aka *throughput* -- TP) abbreviate as TP), the
  60 time spent executing the code (aka *code quality* -- CQ), and the size
  61 of the generated code (CS).
  62
  63 We also desire that the resulting decision-making machinery (in particular
  64 the Profitability heuristic) is expressed in terms that are sensible to
  65 compiler writers.
  66
  67 ## The Proposal
  68
  69 ### Machinery
  70
  71 RyuJit's machinery may or may not require updates depending on how many
  72 key inlines are rejected because of lack of ability. This is something
  73 that requires further investigation. We may also decide to pursue
  74 enhancements here to ensure that in cascaded inline cases the deeper
  75 callsites are provided up-to-date information.
  76
  77 LLILC should be able to leverage LLVM's existing inline machinery. In
  78 JIT-like scenarios, LLILC's reader will need some updates to properly
  79 handle cases where a method is being "read" as an inline candidate
  80 rather than a root method to compiled. For AOT scenarios we ultimately
  81 will want to do something more elaborate where all of the methods
  82 being compiled are represented as LLVM IR.
  83
  84 ### Legality
  85
  86 As mentioned above, the legality constraints are generally imposed by
  87 the CoreCLR runtime.
  88
  89 These constraints are already captured in the RyuJit source
  90 code. However, we generally prefer to at least logically perform all
  91 the legality checks before any of ability or profitability checks, so
  92 that we can reason more simply about why a particular inline didn't
  93 happen. This may not be entirely practical, since some of the
  94 profitability or ability early outs may be helping TP by keeping
  95 RyuJit from doing a lot of exploratory work before ultimately
  96 rejecting an inline candidate.
  97
  98 LLILC lacks legality checking, and this needs to be implemented.
  99
 100 ### Ability
 101
 102 RyuJit currently bails out of some legal inlining cases because of
 103 internal restrictions. For instance, if a callee modifies one of its
 104 parameters, RyuJit won't inline it because of limitations in how the
 105 IR is incorporated. We expect to work on reducing or removing key
 106 ability limiters in RyuJit.
 107
 108 Assuming the LLVM inlining framework is robust, LLILC shouldn't have
 109 too many ability constraints.
 110
 111 ### Profitability
 112
 113 Given the wide range of scenarios, targets, and the two distinct code
 114 bases, we propose to heavily rely on machine learning techniques for
 115 building and tuning the Profitability components of the inliners.
 116
 117 The general approach we suggest is based on past experiences we've had
 118 developing inliners for various compilers and is inspired by the CGO
 119 paper [Automatic Construction of Inlining Heuristics using Machine
 120 Learning](http://dl.acm.org/citation.cfm?id=2495914) by Kulkarni,
 121 Cavazos, Wimmer, and Simon.
 122
 123 Kulkarni et. al. treat profitability as an unsupervised learning
 124 problem, and create a well-performing heuristic black box (neural
 125 network) using evolutionary programming techniques. They then turn
 126 around and use this black box as an oracle to label inline instances,
 127 and from this guide a supervised machine learning algorithm to produce
 128 a decision tree that expresses the profitability heuristics in terms
 129 sensible to the compiler writer.
 130
 131 The inputs to the heuristic black box are various facts and estimates
 132 about an inline candidate, based on observations of the caller, callee,
 133 and call site (collectively, *features*). These features may be booleans,
 134 enumerates (categories), integers, or floating-point values. For example:
 135
 136 * (boolean) `CalleeIsLeaf`, `CallerSiteInTry`
 137 * (enumerate) `CalleeReturnCorType`
 138 * (int) `CalleeILSize`,`CallSiteProfileWeight`
 139 * (float) `CalleeLoadStoreFrequency`
 140
 141 The output is a yes/no inlining decision for a particular candidate.
 142
 143 The evolutionary process is governed by a fitness metric that expresses the
 144 appropriate balance of TP, CQ, and CS for the scenario. For instance, in a JIT
 145 scenario we likely will want to keep TP and CS close to the current
 146 baseline values. For AOT we may want to allow higher TP and larger CS if we
 147 can demonstrate CQ improvements.
 148
 149 ## Implications
 150
 151 For this approach to work, we will need to make frequent and robust
 152 measurements of the key fitness metrics across a range of
 153 representative benchmarks. So this presumes:
 154
 155 * we can identify and capture these key benchmarks
 156 * we can develop methodology for doing appropriate measurements
 157 * we can do these measurements at appropriate scale
 158
 159 ### The Benchmarks
 160
 161 Clearly in a machine learning approach the actual observations made
 162 are crucial. We need a solid, representative benchmark set. But this
 163 is a general problem with a heuristic based optimization and not an
 164 inherent limitation of this approach -- one can only measure what one
 165 can measure.
 166
 167 We plan on capturing the results in standard schema and hope that with
 168 this and the regularization of the metric data (see next section) that
 169 interested 3rd parties can perhaps supply their own benchmark results.
 170
 171 ### The Metrics
 172
 173 CS is typically very easy to measure and the measurements are "noise
 174 free". So the measurement only needs to be done once.
 175
 176 If TP and CQ are both measured in terms of time (or cycles) then the
 177 measurements will be inherently noisy. To reduce the variance caused
 178 by noise some repeated runs may be necessary, and it might also be
 179 necessary to run on dedicated, matched hardware. This severely limits
 180 scalability.
 181
 182 We instead propose to measure both TP and CQ in terms of instructions
 183 retired. While this may not be entirely noise-free, we expect the
 184 measurements to be low noise and fairly insensitive to other
 185 computations on the machine. Thus this should give us decent scale
 186 (for instance we can run scenarios in the cloud on VMs with varying
 187 underlying hardware, etc).
 188
 189 For TP, instructions retired should be a reasonable proxy for time,
 190 since we expect the CPI for the compiler proper to be relatively
 191 static. For CQ, we realize there may not be as strong a correlation
 192 with time, but typically the inliner operates early in the compilation
 193 process and inlining is mainly focused on reducing instruction count
 194 and less about reducing cycle stalls.
 195
 196 We'll do some experimentation early on to see how well these proxy
 197 metrics hold up.
 198
 199 ### Analysis
 200
 201 To help prioritize work on profitability and ability (and perhaps find
 202 areas where we might discuss trying to remove legality constraints),
 203 we need some mechanism to correlate inlining outcomes with success and
 204 failure reasons.
 205
 206 We propose to develop a textual, machine-readable inline log schema
 207 that describes a set of inlining decisions. This schema can be
 208 produced by a compiler as it makes decisions, and contains sufficient
 209 correlating data so that we can then match it up with runtime
 210 observations. For example, we might end up with the following sort of
 211 per-method output (though more likely it will be embedded in some kind of
 212 xml or json markup):
 213
 214 ```
 215   Inlining for 9811 Lookup
 216      [o] 22825 Lookup@Lattice
 217         [o] 22827 ??0?$interior_ptr@PAVCell
 218         [x] 22826 Lookup@Lattice@@@Z (reason: SizeLimit)
 219      [o] 21728 ?get_CellList
 220
 221 ```
 222
 223 where `[o]` is a successful inline, `[x]` a failed inline, and
 224 indentation shows the inlining tree. For .Net compilation we'll need
 225 some kind of persistent ID for methods, which may not be all that easy
 226 to come by.
 227
 228 This inline log can also be read back by the code generator to enable
 229 *inline replay* and force the inliner to perform a particular pattern
 230 of inlines. This is useful for things like performance investigations,
 231 bug isolation, and for trying to have one compiler emulate the
 232 inlining behavior of another. Since the logs are textual they can also
 233 be edited by hand if necessary to perform experiments.
 234
 235 ### Obtaining Runtime Data
 236
 237 If we can obtain runtime data about the frequency of call
 238 instructions, we can run this through the log to accumulate failure
 239 reasons and create a failure reason table like the following
 240 (categories and data here totally made up):
 241
 242 ```
 243   FailedReason       | Count    | [%]   |  Sites |   [%]
 244 ---------------------|----------|-------|--------|------
 245 SpeculativeNonGround | 79347955 | 37.8% |    417 | 14.1%
 246 NonGroundShared      | 28756634 | 13.7% |    217 | 07.4%
 247 SizeLimit            | 22066033 | 10.5% |    760 | 25.8%
 248 ExternFunction       | 21923402 | 10.5% |    300 | 10.2%
 249 ... etc ...
 250 ```
 251
 252 This can be used to prioritize inlining work. For instance, if
 253 `SpeculativeNonGround` was an ability issue, and we wanted to improve
 254 CQ, we'd give it high priority to fix, since it impacts inlining most
 255 frequently.
 256
 257 ### Robustness
 258
 259 Since we'll likely be altering inlining decisions, we may uncover bugs
 260 or performance blemishes in the compilers.
 261
 262 It may make sense to build up a random inlining methodology (RyuJit
 263 evidently has something like this) to try and stress test the compiler
 264 in the face of inlining changes. Log replay can be used with automatic
 265 bisection tools to try and isolate the particular inline that exposes
 266 a bug.
 267
 268 Performance issues are harder to spot automatically, and will likely
 269 require some amount of manual analysis. However, we may be able to study
 270 inline instances that are not well-predicted by the inlining models -- for
 271 instance, cases where the CS is much greater than predicted, or CQ is lower --
 272 and see if the modelling shortfall is an indicator of problems within the
 273 compiler itself.
 274
 275 ## Development Stages
 276
 277 Given the above, here's a rough draft outline for the how the
 278 functionality could be built.
 279
 280 ### Preliminaries -- Measurement
 281
 282 1. Make some benchmarks available for measurement.
 283
 284 2. Demonstrate we can reliably measure the metrics of interest: CS,
 285 CQ, and TP (as bytes, instructions, and instructions) when jitting and
 286 pre-jitting the benchmarks. Get some idea of the noise levels for CQ
 287 and TP. Hopefully they are small.
 288
 289 ### Preliminaries -- RyuJit
 290
 291 1. Refactor RyuJit code to clearly separate out legality, ability,
 292 and profitability aspects of the inliner. Hopefully this can be done
 293 with no CQ changes and no TP impact.
 294
 295 2. Develop the inline log format.
 296
 297 3. Produce inline logs from RyuJit.
 298
 299 4. Develop methodology for runtime correlation.
 300
 301 5. Produce failure reason analysis for the benchmark set.  Decide
 302 which if any abilities need improving.
 303
 304 6. Implement inline log replay for RyuJit.
 305
 306 7. Modify logs by hand or via a tool and replay them to assess the
 307 landscape of TP/CQ/CS tradeoffs from inliner changes.
 308
 309 ### Preliminaries -- LLILC
 310
 311 1. Add code to capture legality constraints when reading in a method.
 312
 313 2. Generalize LLILC's MSIL reader to read in an inlinee in
 314 the context of the caller.
 315
 316 3. Write a simple LLVM pass (or specialize an existing pass) to
 317 perform inlining. Initially just implement a very simple profitability
 318 heuristic. Verify it all works correctly.
 319
 320 4. Implement inline log generation by LLILC.
 321
 322 5. Implement inline log replay by LLILC.
 323
 324 6. Modify logs by hand or via a tool and replay them to assess the
 325 landscape of TP/CQ/CS tradeoffs from inliner changes.
 326
 327 ### Develop the Heuristics (JIT case)
 328
 329 1. Start by trying to produce code size estimates for a given inline
 330 candidate. Measure actual code size impacts for individual inlining
 331 decisions.
 332
 333 2. Determine features to surface and produce a labelled data set of
 334 (feature*, size impact). Use this to build a machine-learning model
 335 estimator for code size.
 336
 337 3. We probably can't just plug the new size estimate into the existing
 338 heuristic framework and expect good results (generally, better
 339 information in does not always result in better decisions out).  So
 340 instead build up a parallel (off by default) heuristic path where we
 341 use the new size estimator.
 342
 343 4. Surface features for the CQ and TP impacts. Build a pluggable model
 344 for the black box heuristic in the compiler. Likely this means that
 345 the model code is written by a tool from some textual description.
 346
 347 5. Run evolutionary experiments to produce an optimized black box
 348 heuristic. This involves: (a) randomly or otherwise generating an
 349 initial population of models; (b) running each model on the
 350 benchmarks, gathering metrics; (c) combining metrics across benchmarks
 351 to produce fitness; (d) using fitness to create a new set of models
 352 using various genetic approaches; (e) iterating until satisfied with
 353 the result.
 354
 355 All of this setup should be automated and available in the CI system.
 356
 357 6. Use the winning model to produce a labelled data set of (feature*,
 358 inlining decision) for all the cases in the benchmarks. Build a
 359 decision tree from this via standard techniques. Decide if any pruning
 360 is appropriate.
 361
 362 7. Back validate the results via measurement to show (a) final heuristic
 363 works as expected and metrics match expectations on training examples;
 364 cross validate on untrained benchmarks to show results are not over-fitted.
 365 Flight this to various 3rd parties for independent confirmation.
 366
 367 8. Enable new heuristics as defaults.
 368
 369 9. Repeat the above process for each code base, scenario,
 370 architecture, etc.
 371
 372 ### AOT Considerations
 373
 374 In the AOT case (particularly the optimized AOT) we will likely need
 375 to develop a different set of heuristics.
 376
 377 The current frameworks for AOT (NGEN and Ready2Run) give the code
 378 generator the opportunity to preview the methods to be compiled. We'll
 379 take advantage of that to build a call graph and orchestrate
 380 compilation of methods in a "bottom-up" fashion (callees before
 381 callers). Because of that, when inlining, the caller typically has the
 382 opportunity to look at a richer set of data about the callee -- for
 383 instance, detailed usage summary for parameters and similar.
 384
 385 The inliner also has the opportunity to plan inlining over a wider
 386 scope, perhaps adjusting heuristics for the relative importance of
 387 callers or the shape of the call graph or similar.
 388
 389 ## Vetting and Revalidation
 390
 391 With an evaluation framework in place, we can now run carefully
 392 controlled experiments to see if further enhancements to the inliner
 393 are warranted. For example, the order in which candidates are
 394 considered may have an impact -- ideally the inliner would greedily
 395 look at the best candidates first. Other feature data may prove
 396 important, for instance actual or synthetic profile data.
 397
 398 The rest of the compiler also changes over time. Periodically we
 399 should rerun the heuristic development to see if the current set of
 400 heuristics are still good, especially if major new capabilities are
 401 added to the compiler.
 402
 403 We should also continually strive to improve the benchmark set.
 404
 405 ## Risks
 406
 407 The plan calls for the inliner's Profitability logic to be constructed
 408 via machine learning. This exposes us to a number of risks:
 409
 410 * We may not understand how to successfully apply ML techniques.
 411 * We may run into one of the common ML pitfalls, like having too many
 412 features, or strongly correlated features.
 413 * We may not be using the right features.
 414 * Some of the features may themselves be estimates (eg code size).
 415 * The benchmark set may not be sufficiently representative.
 416 * We may overfit the benchmark set. The plan here is to cross-validate and
 417 not use all our benchmarks as training examples.
 418 * The training process may be too slow to be practical.
 419 * There may be too much measurement noise.
 420 * The resulting models may not be stable -- meaning that small changes in
 421 the input programs might trigger large changes in the inlining decisions.
 422 * The supervised learning output may still be black-box like in practice.
 423 * The downstream phases of the compiler might not adapt well to inlining
 424 changes or may offer an uneven performance surface.
 425 * We've found it tricky in the past to produce a good combined fitness metric
 426 from a large set of benchmarks and metrics.
 427 * It's generally not possible to build a new heuristic which is better in
 428 every way than the old one -- some metrics on some benchmarks will inevitably
 429 regress.
 430
 431 However, it's worth pointing out that a manually constructed heuristic faces
 432 many of these same risks.
 433
 434 ## Notes
 435
 436 See this [early
 437 draft](https://github.com/AndyAyersMS/coreclr/commit/035054402a345f643d9dee0ec31dbdf5fadbb17c),
 438 now orphaned by a squash, for some interesting comments.