Documentation/design-docs/finally-optimizations.md

   1 Finally Optimizations
   2 =====================
   3
   4 In MSIL, a try-finally is a construct where a block of code
   5 (the finally) is guaranteed to be executed after control leaves a
   6 protected region of code (the try) either normally or via an
   7 exception.
   8
   9 In RyuJit a try-finally is currently implemented by transforming the
  10 finally into a local function that is invoked via jitted code at normal
  11 exits from the try block and is invoked via the runtime for exceptional
  12 exits from the try block.
  13
  14 For x86 the local function is simply a part of the method and shares
  15 the same stack frame with the method. For other architectures the
  16 local function is promoted to a potentially separable "funclet"
  17 which is almost like a regular function with a prolog and epilog. A
  18 custom calling convention gives the funclet access to the parent stack
  19 frame.
  20
  21 In this proposal we outline three optimizations for finallys: removing
  22 empty trys, removing empty finallys and finally cloning.
  23
  24 Empty Finally Removal
  25 ---------------------
  26
  27 An empty finally is one that has no observable effect. These often
  28 arise from `foreach` or `using` constructs (which induce a
  29 try-finally) where the cleanup method called in the finally does
  30 nothing. Often, after inlining, the empty finally is readily apparent.
  31
  32 For example, this snippet of C# code
  33 ```C#
  34 static int Sum(List<int> x) {
  35     int sum = 0;
  36     foreach(int i in x) {
  37         sum += i;
  38     }
  39     return sum;
  40 }
  41 ```
  42 produces the following jitted code:
  43 ```asm
  44 ; Successfully inlined Enumerator[Int32][System.Int32]:Dispose():this
  45 ;    (1 IL bytes) (depth 1) [below ALWAYS_INLINE size]
  46 G_M60484_IG01:
  47        55                   push     rbp
  48        57                   push     rdi
  49        56                   push     rsi
  50        4883EC50             sub      rsp, 80
  51        488D6C2460           lea      rbp, [rsp+60H]
  52        488BF1               mov      rsi, rcx
  53        488D7DD0             lea      rdi, [rbp-30H]
  54        B906000000           mov      ecx, 6
  55        33C0                 xor      rax, rax
  56        F3AB                 rep stosd
  57        488BCE               mov      rcx, rsi
  58        488965C0             mov      qword ptr [rbp-40H], rsp
  59
  60 G_M60484_IG02:
  61        33C0                 xor      eax, eax
  62        8945EC               mov      dword ptr [rbp-14H], eax
  63        8B01                 mov      eax, dword ptr [rcx]
  64        8B411C               mov      eax, dword ptr [rcx+28]
  65        33D2                 xor      edx, edx
  66        48894DD0             mov      gword ptr [rbp-30H], rcx
  67        8955D8               mov      dword ptr [rbp-28H], edx
  68        8945DC               mov      dword ptr [rbp-24H], eax
  69        8955E0               mov      dword ptr [rbp-20H], edx
  70
  71 G_M60484_IG03:
  72        488D4DD0             lea      rcx, bword ptr [rbp-30H]
  73        E89B35665B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
  74        85C0                 test     eax, eax
  75        7418                 je       SHORT G_M60484_IG05
  76
  77 ; Body of foreach loop
  78
  79 G_M60484_IG04:
  80        8B4DE0               mov      ecx, dword ptr [rbp-20H]
  81        8B45EC               mov      eax, dword ptr [rbp-14H]
  82        03C1                 add      eax, ecx
  83        8945EC               mov      dword ptr [rbp-14H], eax
  84        488D4DD0             lea      rcx, bword ptr [rbp-30H]
  85        E88335665B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
  86        85C0                 test     eax, eax
  87        75E8                 jne      SHORT G_M60484_IG04
  88
  89 ; Normal exit from the implicit try region created by `foreach`
  90 ; Calls the finally to dispose of the iterator
  91
  92 G_M60484_IG05:
  93        488BCC               mov      rcx, rsp
  94        E80C000000           call     G_M60484_IG09      // call to finally
  95
  96 G_M60484_IG06:
  97        90                   nop
  98
  99 G_M60484_IG07:
 100        8B45EC               mov      eax, dword ptr [rbp-14H]
 101
 102 G_M60484_IG08:
 103        488D65F0             lea      rsp, [rbp-10H]
 104        5E                   pop      rsi
 105        5F                   pop      rdi
 106        5D                   pop      rbp
 107        C3                   ret
 108
 109 ; Finally funclet. Note it simply sets up and then tears down a stack
 110 ; frame. The dispose method was inlined and is empty.
 111
 112 G_M60484_IG09:
 113        55                   push     rbp
 114        57                   push     rdi
 115        56                   push     rsi
 116        4883EC30             sub      rsp, 48
 117        488B6920             mov      rbp, qword ptr [rcx+32]
 118        48896C2420           mov      qword ptr [rsp+20H], rbp
 119        488D6D60             lea      rbp, [rbp+60H]
 120
 121 G_M60484_IG10:
 122        4883C430             add      rsp, 48
 123        5E                   pop      rsi
 124        5F                   pop      rdi
 125        5D                   pop      rbp
 126        C3                   ret
 127 ```
 128
 129 In such cases the try-finally can be removed, leading to code like the following:
 130 ```asm
 131 G_M60484_IG01:
 132        57                   push     rdi
 133        56                   push     rsi
 134        4883EC38             sub      rsp, 56
 135        488BF1               mov      rsi, rcx
 136        488D7C2420           lea      rdi, [rsp+20H]
 137        B906000000           mov      ecx, 6
 138        33C0                 xor      rax, rax
 139        F3AB                 rep stosd
 140        488BCE               mov      rcx, rsi
 141
 142 G_M60484_IG02:
 143        33F6                 xor      esi, esi
 144        8B01                 mov      eax, dword ptr [rcx]
 145        8B411C               mov      eax, dword ptr [rcx+28]
 146        48894C2420           mov      gword ptr [rsp+20H], rcx
 147        89742428             mov      dword ptr [rsp+28H], esi
 148        8944242C             mov      dword ptr [rsp+2CH], eax
 149        89742430             mov      dword ptr [rsp+30H], esi
 150
 151 G_M60484_IG03:
 152        488D4C2420           lea      rcx, bword ptr [rsp+20H]
 153        E8A435685B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
 154        85C0                 test     eax, eax
 155        7414                 je       SHORT G_M60484_IG05
 156
 157 G_M60484_IG04:
 158        8B4C2430             mov      ecx, dword ptr [rsp+30H]
 159        03F1                 add      esi, ecx
 160        488D4C2420           lea      rcx, bword ptr [rsp+20H]
 161        E89035685B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
 162        85C0                 test     eax, eax
 163        75EC                 jne      SHORT G_M60484_IG04
 164
 165 G_M60484_IG05:
 166        8BC6                 mov      eax, esi
 167
 168 G_M60484_IG06:
 169        4883C438             add      rsp, 56
 170        5E                   pop      rsi
 171        5F                   pop      rdi
 172        C3                   ret
 173 ```
 174
 175 Empty finally removal is unconditionally profitable: it should always
 176 reduce code size and improve code speed.
 177
 178 Empty Try Removal
 179 ---------------------
 180
 181 If the try region of a try-finally is empty, and the jitted code will
 182 execute on a runtime that does not protect finally execution from
 183 thread abort, then the try-finally can be replaced with just the
 184 content of the finally.
 185
 186 Empty trys with non-empty finallys often exist in code that must run
 187 under both thread-abort aware and non-thread-abort aware runtimes. In
 188 the former case the placement of cleanup code in the finally ensures
 189 that the cleanup code will execute fully. But if thread abort is not
 190 possible, the extra protection offered by the finally is not needed.
 191
 192 Empty try removal looks for try-finallys where the try region does
 193 nothing except invoke the finally. There are currently two different
 194 EH implementation models, so the try screening has two cases:
 195
 196 * callfinally thunks (x64/arm64): the try must be a single empty
 197 basic block that always jumps to a callfinally that is the first
 198 half of a callfinally/always pair;
 199 * non-callfinally thunks (x86/arm32): the try must be a
 200 callfinally/always pair where the first block is an empty callfinally.
 201
 202 The screening then verifies that the callfinally identified above is
 203 the only callfinally for the try. No other callfinallys are expected
 204 because this try cannot have multiple leaves and its handler cannot be
 205 reached by nested exit paths.
 206
 207 When the empty try is identified, the jit modifies the
 208 callfinally/always pair to branch to the handler, modifies the
 209 handler's return to branch directly to the continuation (the
 210 branch target of the second half of the callfinally/always pair),
 211 updates various status flags on the blocks, and then removes the
 212 try-finally region.
 213
 214 Finally Cloning
 215 ---------------
 216
 217 Finally cloning is an optimization where the jit duplicates the code
 218 in the finally for one or more of the normal exit paths from the try,
 219 and has those exit points branch to the duplicated code directly,
 220 rather than calling the finally.  This transformation allows for
 221 improved performance and optimization of the common case where the try
 222 completes without an exception.
 223
 224 Finally cloning also allows hot/cold splitting of finally bodies: the
 225 cloned finally code covers the normal try exit paths (the hot cases)
 226 and can be placed in the main method region, and the original finally,
 227 now used largely or exclusively for exceptional cases (the cold cases)
 228 spilt off into the cold code region. Without cloning, RyuJit
 229 would always treat the finally as cold code.
 230
 231 Finally cloning will increase code size, though often the size
 232 increase is mitigated somewhat by more compact code generation in the
 233 try body and streamlined invocation of the cloned finallys.
 234
 235 Try-finally regions may have multiple normal exit points. For example
 236 the following `try` has two: one at the `return 3` and one at the try
 237 region end:
 238
 239 ```C#
 240 try {
 241    if (p) return 3;
 242    ...
 243 }
 244 finally {
 245    ...
 246 }
 247 return 4;
 248 ```
 249
 250 Here the finally must be executed no matter how the try exits. So
 251 there are to two normal exit paths from the try, both of which pass
 252 through the finally but which then diverge. The fact that some try
 253 regions can have multiple exits opens the potential for substantial
 254 code growth from finally cloning, and so leads to a choice point in
 255 the implementation:
 256
 257 * Share the clone along all exit paths
 258 * Share the clone along some exit paths
 259 * Clone along all exit paths
 260 * Clone along some exit paths
 261 * Only clone along one exit path
 262 * Only clone when there is one exit path
 263
 264 The shared clone option must essentially recreate or simulate the
 265 local call mechanism for the finally, though likely somewhat more
 266 efficiently. Each exit point must designate where control should
 267 resume once the shared finally has finished.  For instance the jit
 268 could introduce a new local per try-finally to determine where the
 269 cloned finally should resume, and enumerate the possibilities using a
 270 small integer. The end of the cloned finally would then use a switch
 271 to determine what code to execute next. This has the downside of
 272 introducing unrealizable paths into the control flow graph.
 273
 274 Cloning along all exit paths can potentially lead to large amounts of
 275 code growth.
 276
 277 Cloning along some paths or only one path implies that some normal
 278 exit paths won't be as well optimized. Nonetheless cloning along one
 279 path was the choice made by JIT64 and the one we recommend for
 280 implementation. In particular we suggest only cloning along the end of
 281 try region exit path, so that any early exit will continue to invoke
 282 the funclet for finally cleanup (unless that exit happens to have the
 283 same post-finally continuation as the end try region exit, in which
 284 case it can simply jump to the cloned finally).
 285
 286 One can imagine adaptive strategies. The size of the finally can
 287 be roughly estimated and the number of clones needed for full cloning
 288 readily computed. Selective cloning can be based on profile
 289 feedback or other similar mechanisms for choosing the profitable
 290 cases.
 291
 292 The current implementation will clone the finally and retarget the
 293 last (largest IL offset) leave in the try region to the clone. Any
 294 other leave that ultimately transfers control to the same post-finally
 295 offset will also be modified to jump to the clone.
 296
 297 Empirical studies have shown that most finallys are small. Thus to
 298 avoid excessive code growth, a crude size estimate is formed by
 299 counting the number of statements in the blocks that make up the
 300 finally. Any finally larger that 15 statements is not cloned. In our
 301 study this disqualifed about 0.5% of all finallys from cloning.
 302
 303 ### EH Nesting Considerations
 304
 305 Finally cloning is also more complicated when the finally encloses
 306 other EH regions, since the clone will introduce copies of all these
 307 regions. While it is possible to implement cloning in such cases we
 308 propose to defer for now.
 309
 310 Finally cloning is also a bit more complicated if the finally is
 311 enclosed by another finally region, so we likewise propose deferring
 312 support for this.  (Seems like a rare enough thing but maybe not too
 313 hard to handle -- though possibly not worth it if we're not going to
 314 support the enclosing case).
 315
 316 ### Control-Flow and Other Considerations
 317
 318 If the try never exits normally, then the finally can only be invoked
 319 in exceptional cases. There is no benefit to cloning since the cloned
 320 finally would be unreachable. We can detect a subset of such cases
 321 because there will be no call finally blocks.
 322
 323 JIT64 does not clone finallys that contained switch. We propose to
 324 do likewise. (Initially I did not include this restriction but
 325 hit a failing test case where the finally contained a switch. Might be
 326 worth a deeper look, though such cases are presumably rare.)
 327
 328 If the finally never exits normally, then we presume it is cold code,
 329 and so will not clone.
 330
 331 If the finally is marked as run rarely, we will not clone.
 332
 333 Implementation Proposal
 334 -----------------------
 335
 336 We propose that empty finally removal and finally cloning be run back
 337 to back, spliced into the phase list just after fgInline and
 338 fgAddInternal, and just before implicit by-ref and struct
 339 promotion. We want to run these early before a lot of structural
 340 invariants regarding EH are put in place, and before most
 341 other optimization, but run them after inlining
 342 (so empty finallys can be more readily identified) and after the
 343 addition of implicit try-finallys created by the jit.  Empty finallys
 344 may arise later because of optimization, but this seems relatively
 345 uncommon.
 346
 347 We will remove empty finallys first, then clone.
 348
 349 Neither optimization will run when the jit is generating debuggable
 350 code or operating in min opts mode.
 351
 352 ### Empty Finally Removal (Sketch)
 353
 354 Skip over methods that have no EH, are compiled with min opts, or
 355 where the jit is generating debuggable code.
 356
 357 Walk the handler table, looking for try-finally (we could also look
 358 for and remove try-faults with empty faults, but those are presumably
 359 rare).
 360
 361 If the finally is a single block and contains only a `retfilter`
 362 statement, then:
 363
 364 * Retarget the callfinally(s) to jump always to the continuation blocks.
 365 * Remove the paired jump always block(s) (note we expect all finally
 366 calls to be paired since the empty finally returns).
 367 * For funclet EH models with finally target bits, clear the finally
 368 target from the continuations.
 369 * For non-funclet EH models only, clear out the GT_END_LFIN statement
 370 in the finally continuations.
 371 * Remove the handler block.
 372 * Reparent all directly contained try blocks to the enclosing try region
 373 or to the method region if there is no enclosing try.
 374 * Remove the try-finally from the EH table via `fgRemoveEHTableEntry`.
 375
 376 After the walk, if any empty finallys were removed, revalidate the
 377 integrity of the handler table.
 378
 379 ### Finally Cloning (Sketch)
 380
 381 Skip over all methods, if the runtime suports thread abort. More on
 382 this below.
 383
 384 Skip over methods that have no EH, are compiled with min opts, or
 385 where the jit is generating debuggable code.
 386
 387 Walk the handler table, looking for try-finally. If the finally is
 388 enclosed in a handler or encloses another handler, skip.
 389
 390 Walk the finally body blocks. If any is BBJ_SWITCH, or if none
 391 is BBJ_EHFINALLYRET, skip cloning. If all blocks are RunRarely
 392 skip cloning. If the finally has more that 15 statements, skip
 393 cloning.
 394
 395 Walk the try region from back to front (from largest to smallest IL
 396 offset). Find the last block in the try that invokes the finally. That
 397 will be the path that will invoke the clone.
 398
 399 If the EH model requires callfinally thunks, and there are multiple
 400 thunks that invoke the finally, and the callfinally thunk along the
 401 clone path is not the first, move it to the front (this helps avoid
 402 extra jumps).
 403
 404 Set the insertion point to just after the callfinally in the path (for
 405 thunk models) or the end of the try (for non-thunk models).  Set up a
 406 block map. Clone the finally body using `fgNewBBinRegion` and
 407 `fgNewBBafter` to make the first and subsequent blocks, and
 408 `CloneBlockState` to fill in the block contents. Clear the handler
 409 region on the cloned blocks. Bail out if cloning fails. Mark the first
 410 and last cloned blocks with appropriate BBF flags. Patch up inter-clone
 411 branches and convert the returns into jumps to the continuation.
 412
 413 Walk the callfinallys, retargeting the ones that return to the
 414 continuation so that they invoke the clone. Remove the paired always
 415 blocks. Clear the finally target bit and any GT_END_LFIN from the
 416 continuation.
 417
 418 If all call finallys are converted, modify the region to be try/fault
 419 (interally EH_HANDLER_FAULT_WAS_FINALLY, so we can distinguish it
 420 later from "organic" try/faults).  Otherwise leave it as a
 421 try/finally.
 422
 423 Clear the catch type on the clone entry.
 424
 425 ### Thread Abort
 426
 427 For runtimes that support thread abort (desktop), more work is
 428 required:
 429
 430 * The cloned finally must be reported to the runtime. Likely this
 431 can trigger off of the BBF_CLONED_FINALLY_BEGIN/END flags.
 432 * The jit must maintain the integrity of the clone by not losing
 433 track of the blocks involved, and not allowing code to move in our
 434 out of the cloned region
 435
 436 Code Size Impact
 437 ----------------
 438
 439 Code size impact from finally cloning was measured for CoreCLR on
 440 Windows x64.
 441
 442 ```
 443 Total bytes of diff: 16158 (0.12 % of base)
 444     diff is a regression.
 445 Total byte diff includes 0 bytes from reconciling methods
 446         Base had    0 unique methods,        0 unique bytes
 447         Diff had    0 unique methods,        0 unique bytes
 448 Top file regressions by size (bytes):
 449         3518 : Microsoft.CodeAnalysis.CSharp.dasm (0.16 % of base)
 450         1895 : System.Linq.Expressions.dasm (0.32 % of base)
 451         1626 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.07 % of base)
 452         1428 : System.Threading.Tasks.Parallel.dasm (4.66 % of base)
 453         1248 : System.Linq.Parallel.dasm (0.20 % of base)
 454 Top file improvements by size (bytes):
 455        -4529 : System.Private.CoreLib.dasm (-0.14 % of base)
 456         -975 : System.Reflection.Metadata.dasm (-0.28 % of base)
 457         -239 : System.Private.Uri.dasm (-0.27 % of base)
 458         -104 : System.Runtime.InteropServices.RuntimeInformation.dasm (-3.36 % of base)
 459          -99 : System.Security.Cryptography.Encoding.dasm (-0.61 % of base)
 460 57 total files with size differences.
 461 Top method regessions by size (bytes):
 462          645 : System.Diagnostics.Process.dasm - System.Diagnostics.Process:StartCore(ref):bool:this
 463          454 : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:AdjustCallArgumentsForParams(ref,ref,ref,ref,ref,byref):this
 464          447 : System.Threading.Tasks.Dataflow.dasm - System.Threading.Tasks.Dataflow.Internal.SpscTargetCore`1[__Canon][System.__Canon]:ProcessMessagesLoopCore():this
 465          421 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.ImplementsHelper:FindExplicitlyImplementedMember(ref,ref,ref,ref,ref,ref,byref):ref
 466          358 : System.Private.CoreLib.dasm - System.Threading.TimerQueueTimer:Change(int,int):bool:this
 467 Top method improvements by size (bytes):
 468        -2512 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
 469         -824 : Microsoft.CodeAnalysis.dasm - Microsoft.Cci.PeWriter:WriteHeaders(ref,ref,ref,ref,byref):this
 470         -663 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
 471         -627 : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.ManifestBuilder:CreateManifestString():ref:this
 472         -546 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long):int:this (67 methods)
 473 3014 total methods with size differences.
 474 ```
 475
 476 The largest growth is seen in `Process:StartCore`, which has 4
 477 try-finally constructs.
 478
 479 Diffs generally show improved codegen in the try bodies with cloned
 480 finallys. However some of this improvement comes from more aggressive
 481 use of callee save registers, and this causes size inflation in the
 482 funclets (note finally cloning does not alter the number of
 483 funclets). So if funclet save/restore could be contained to registers
 484 used in the funclet, the size impact would be slightly smaller.
 485
 486 There are also some instances where cloning relatively small finallys
 487 leads to large code size increases. xxx is one example.