Documentation/performance/JitOptimizerTodoAssessment.md

   1 Optimizer Codebase Status/Investments
   2 =====================================
   3
   4 There are a number of areas in the optimizer that we know we would invest in
   5 improving if resources were unlimited.  This document lists them and some
   6 thoughts about their current state and prioritization, in an effort to capture
   7 the thinking about them that comes up in planning discussions.
   8
   9
  10 Big-Ticket Items
  11 ----------------
  12
  13 ### Improved Struct Handling
  14
  15 This is an area that has received recent attention, with the [first-class structs](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md)
  16 work and the struct promotion improvements that went in for `Span<T>`.  Work here
  17 is expected to continue and can happen incrementally.  Possible next steps:
  18
  19  - Struct promotion stress mode (test mode to improve robustness/reliability)
  20  - Promotion of more structs; relax limits on e.g. field count (should generally
  21    help performance-sensitive code where structs are increasingly used to avoid
  22    heap allocations)
  23  - Improve handling of System V struct passing (I think we currently insert
  24    some unnecessary round-trips through memory at call boundaries due to
  25    internal representation issues)
  26  - Implicit byref parameter promotion w/o shadow copy
  27
  28 We don't have specific benchmarks that we know would jump in response to any of
  29 these.  May well be able to find some with some looking, though this may be an
  30 area where current performance-sensitive code avoids structs.
  31
  32 There's also work going on in corefx to use `Span<T>` more broadly.  We should
  33 make sure we are expanding our span benchmarks appropriately to track and
  34 respond to any particular issues that come out of that work.
  35
  36
  37 ### Exception handling
  38
  39 This is increasingly important as C# language constructs like async/await and
  40 certain `foreach` incantations are implemented with EH constructs, making them
  41 difficult to avoid at source level.  The recent work on finally cloning, empty
  42 finally removal, and empty try removal targeted this.  [Writethrough](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md)
  43 is another key optimization enabler here, and we are actively pursuing it.  Other
  44 things we've discussed include inlining methods with EH and computing funclet
  45 callee-save register usage independently of main function callee-save register
  46 usage, but I don't think we have any particular data pointing to either as a
  47 high priority.
  48
  49
  50 ### Loop Optimizations
  51
  52 We haven't been targeting benchmarks that spend a lot of time doing compuations
  53 in an inner loop.  Pursuing loop optimizations for the peanut butter effect
  54 would seem odd.  So this simply hasn't bubbled up in priority yet, though it's
  55 bound to eventually.  Obvious candidates include [IV widening](https://github.com/dotnet/coreclr/issues/9179),
  56 [unrolling](https://github.com/dotnet/coreclr/issues/11606), load/store motion,
  57 and strength reduction.
  58
  59
  60 ### High Tier Optimization
  61
  62 We don't have that many knobs we can "crank up" (though we do have the tracked
  63 assertion count and could switch inliner policies), nor do we have any sort of
  64 benchmarking story set up to validate whether tiering changes are helping or
  65 hurting.  We should get that benchmarking story sorted out and at least hook
  66 up those two knobs.
  67
  68 Some of this may depend on register allocation work, as the RA currently has
  69 some issues, particularly around spill placement, that could be exacerbated by
  70 very aggressive upstream optimizations.
  71
  72
  73 Mid-Scale Items
  74 ---------------
  75
  76 ### More Expression Optimizations
  77
  78 We again don't have particular benchmarks pointing to key missing cases, and
  79 balancing the CQ vs TP will be delicate here, so it would really help to have
  80 an appropriate benchmark suite to evaluate this work against.
  81
  82
  83 ### Forward Substitution
  84
  85 This too needs an appropriate benchmark suite that I don't think we have at
  86 this time.  The tradeoffs against register pressure increase and throughput
  87 need to be evaluated.  This also might make more sense to do if/when we can
  88 handle SSA renames.
  89
  90
  91 ### Async
  92
  93 We've made note of the prevalence of async/await in modern code (and particularly
  94 in web server code such as TechEmpower), and have some opportunities listed in
  95 [#7914](https://github.com/dotnet/coreclr/issues/7914).  Some sort of study of
  96 async peanut butter to find more opportunities is probably in order, but what
  97 would that look like?
  98
  99
 100 ### If-Conversion (cmov formation)
 101
 102 This hits big in microbenchmarks where it hits.  There's some work in flight
 103 on this (see [#7447](https://github.com/dotnet/coreclr/issues/7447) and
 104 [#10861](https://github.com/dotnet/coreclr/pull/10861)).
 105
 106
 107 ### Address Mode Building
 108
 109 One opportunity that's frequently visible in asm dumps is that more address
 110 expressions could be folded into memory operands' address expressions.  This
 111 would likely give a measurable codesize win.  Needs some thought about where
 112 to run in phase list and how aggressive to be about e.g. analyzing across
 113 statements.
 114
 115
 116 ### Low Tier Back-Off
 117
 118 We have some changes we know we want to make here: morph does more than it needs
 119 to in minopts, and tier 0 should be doing throughput-improving inlines, as
 120 opposed to minopts which does no inlining.  It would be nice to have the
 121 benchmarking story set up to measure the effect of such changes when they go in,
 122 we should do that.
 123
 124
 125 ### Helper Call Register Kill Set Improvements
 126
 127 We have some facility to allocate caller-save registers across calls to runtime
 128 helpers that are known not to trash them, but the information about which
 129 helpers trash which registers is spread across a few places in the codebase,
 130 and has some puzzling quirks like separate "GC" and "NoGC" kill sets for the
 131 same helper.  Unifying the information sources and then refining the recorded
 132 kill sets would help avoid more stack traffic.  See [#12940](https://github.com/dotnet/coreclr/issues/12940).
 133
 134 Low-Hanging Fruit
 135 -----------------
 136
 137 ### Switch Lowering
 138
 139 The MSIL `switch` instruction is actually encoded as a jump table, so (for
 140 better or worse) intelligent optimization of source-level switch statements
 141 largely falls to the MSIL generator (e.g. Roslyn), since encoding sparse
 142 switches as jump tables in MSIL would be impractical.  That said, when the MSIL
 143 has a switch of just a few cases (as in [#12868](https://github.com/dotnet/coreclr/issues/12868)),
 144 or just a few distinct cases that can be efficiently checked (as in [#12477](https://github.com/dotnet/coreclr/issues/12477)),
 145 the JIT needn't blindly emit these as jump tables in the native code.  Work is
 146 underway to address the latter case in [#12552](https://github.com/dotnet/coreclr/pull/12552).
 147
 148
 149 ### Write Barriers
 150
 151 A number of suggestions have been made for having the JIT recognize certain
 152 patterns and emit specialized write barriers that avoid various overheads --
 153 see [#13006](https://github.com/dotnet/coreclr/issues/13006) and [#12812](https://github.com/dotnet/coreclr/issues/12812).
 154
 155
 156 ### Byref-Exposed Store/Load Value Propagation
 157
 158 There are a few tweaks to our value-numbering for byref-exposed loads and stores
 159 to share some of the machinery we use for heap loads and stores that would
 160 allow better propagation through byref-exposed locals and out parameters --
 161 see [#13457](https://github.com/dotnet/coreclr/issues/13457) and
 162 [#13458](https://github.com/dotnet/coreclr/issues/13458).
 163
 164 Miscellaneous
 165 -------------
 166
 167 ### Value Number Conservativism
 168
 169 We have some frustrating phase-ordering issues resulting from this, but the
 170 opt-repeat experiment indicated that they're not prevalent enough to merit
 171 pursuing changing this right now.  Also, using SSA def as the proxy for value
 172 number would require handling SSA renaming, so there's a big dependency chained
 173 to this.
 174 Maybe it's worth reconsidering the priority based on throughput?
 175
 176
 177 ### Mulshift
 178
 179 RyuJIT has an implementation that handles the valuable cases (see [analysis](https://gist.github.com/JosephTremoulet/c1246b17ea2803e93e203b9969ee5a25#file-mulshift-md)
 180 and [follow-up](https://github.com/dotnet/coreclr/pull/13128) for details).
 181 The current implementation is split across Morph and CodeGen; ideally it would
 182 be moved to Lower, which is tracked by [#13150](https://github.com/dotnet/coreclr/issues/13150).