4 In MSIL, a try-finally is a construct where a block of code
5 (the finally) is guaranteed to be executed after control leaves a
6 protected region of code (the try) either normally or via an
9 In RyuJit a try-finally is currently implemented by transforming the
10 finally into a local function that is invoked via jitted code at normal
11 exits from the try block and is invoked via the runtime for exceptional
12 exits from the try block.
14 For x86 the local function is simply a part of the method and shares
15 the same stack frame with the method. For other architectures the
16 local function is promoted to a potentially separable "funclet"
17 which is almost like a regular function with a prolog and epilog. A
18 custom calling convention gives the funclet access to the parent stack
21 In this proposal we outline three optimizations for finallys: removing
22 empty trys, removing empty finallys and finally cloning.
27 An empty finally is one that has no observable effect. These often
28 arise from `foreach` or `using` constructs (which induce a
29 try-finally) where the cleanup method called in the finally does
30 nothing. Often, after inlining, the empty finally is readily apparent.
32 For example, this snippet of C# code
34 static int Sum(List<int> x) {
42 produces the following jitted code:
44 ; Successfully inlined Enumerator[Int32][System.Int32]:Dispose():this
45 ; (1 IL bytes) (depth 1) [below ALWAYS_INLINE size]
51 488D6C2460 lea rbp, [rsp+60H]
53 488D7DD0 lea rdi, [rbp-30H]
58 488965C0 mov qword ptr [rbp-40H], rsp
62 8945EC mov dword ptr [rbp-14H], eax
63 8B01 mov eax, dword ptr [rcx]
64 8B411C mov eax, dword ptr [rcx+28]
66 48894DD0 mov gword ptr [rbp-30H], rcx
67 8955D8 mov dword ptr [rbp-28H], edx
68 8945DC mov dword ptr [rbp-24H], eax
69 8955E0 mov dword ptr [rbp-20H], edx
72 488D4DD0 lea rcx, bword ptr [rbp-30H]
73 E89B35665B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
75 7418 je SHORT G_M60484_IG05
77 ; Body of foreach loop
80 8B4DE0 mov ecx, dword ptr [rbp-20H]
81 8B45EC mov eax, dword ptr [rbp-14H]
83 8945EC mov dword ptr [rbp-14H], eax
84 488D4DD0 lea rcx, bword ptr [rbp-30H]
85 E88335665B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
87 75E8 jne SHORT G_M60484_IG04
89 ; Normal exit from the implicit try region created by `foreach`
90 ; Calls the finally to dispose of the iterator
94 E80C000000 call G_M60484_IG09 // call to finally
100 8B45EC mov eax, dword ptr [rbp-14H]
103 488D65F0 lea rsp, [rbp-10H]
109 ; Finally funclet. Note it simply sets up and then tears down a stack
110 ; frame. The dispose method was inlined and is empty.
117 488B6920 mov rbp, qword ptr [rcx+32]
118 48896C2420 mov qword ptr [rsp+20H], rbp
119 488D6D60 lea rbp, [rbp+60H]
129 In such cases the try-finally can be removed, leading to code like the following:
136 488D7C2420 lea rdi, [rsp+20H]
137 B906000000 mov ecx, 6
144 8B01 mov eax, dword ptr [rcx]
145 8B411C mov eax, dword ptr [rcx+28]
146 48894C2420 mov gword ptr [rsp+20H], rcx
147 89742428 mov dword ptr [rsp+28H], esi
148 8944242C mov dword ptr [rsp+2CH], eax
149 89742430 mov dword ptr [rsp+30H], esi
152 488D4C2420 lea rcx, bword ptr [rsp+20H]
153 E8A435685B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
155 7414 je SHORT G_M60484_IG05
158 8B4C2430 mov ecx, dword ptr [rsp+30H]
160 488D4C2420 lea rcx, bword ptr [rsp+20H]
161 E89035685B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
163 75EC jne SHORT G_M60484_IG04
175 Empty finally removal is unconditionally profitable: it should always
176 reduce code size and improve code speed.
179 ---------------------
181 If the try region of a try-finally is empty, and the jitted code will
182 execute on a runtime that does not protect finally execution from
183 thread abort, then the try-finally can be replaced with just the
184 content of the finally.
186 Empty trys with non-empty finallys often exist in code that must run
187 under both thread-abort aware and non-thread-abort aware runtimes. In
188 the former case the placement of cleanup code in the finally ensures
189 that the cleanup code will execute fully. But if thread abort is not
190 possible, the extra protection offered by the finally is not needed.
192 Empty try removal looks for try-finallys where the try region does
193 nothing except invoke the finally. There are currently two different
194 EH implementation models, so the try screening has two cases:
196 * callfinally thunks (x64/arm64): the try must be a single empty
197 basic block that always jumps to a callfinally that is the first
198 half of a callfinally/always pair;
199 * non-callfinally thunks (x86/arm32): the try must be a
200 callfinally/always pair where the first block is an empty callfinally.
202 The screening then verifies that the callfinally identified above is
203 the only callfinally for the try. No other callfinallys are expected
204 because this try cannot have multiple leaves and its handler cannot be
205 reached by nested exit paths.
207 When the empty try is identified, the jit modifies the
208 callfinally/always pair to branch to the handler, modifies the
209 handler's return to branch directly to the continuation (the
210 branch target of the second half of the callfinally/always pair),
211 updates various status flags on the blocks, and then removes the
217 Finally cloning is an optimization where the jit duplicates the code
218 in the finally for one or more of the normal exit paths from the try,
219 and has those exit points branch to the duplicated code directly,
220 rather than calling the finally. This transformation allows for
221 improved performance and optimization of the common case where the try
222 completes without an exception.
224 Finally cloning also allows hot/cold splitting of finally bodies: the
225 cloned finally code covers the normal try exit paths (the hot cases)
226 and can be placed in the main method region, and the original finally,
227 now used largely or exclusively for exceptional cases (the cold cases)
228 spilt off into the cold code region. Without cloning, RyuJit
229 would always treat the finally as cold code.
231 Finally cloning will increase code size, though often the size
232 increase is mitigated somewhat by more compact code generation in the
233 try body and streamlined invocation of the cloned finallys.
235 Try-finally regions may have multiple normal exit points. For example
236 the following `try` has two: one at the `return 3` and one at the try
250 Here the finally must be executed no matter how the try exits. So
251 there are to two normal exit paths from the try, both of which pass
252 through the finally but which then diverge. The fact that some try
253 regions can have multiple exits opens the potential for substantial
254 code growth from finally cloning, and so leads to a choice point in
257 * Share the clone along all exit paths
258 * Share the clone along some exit paths
259 * Clone along all exit paths
260 * Clone along some exit paths
261 * Only clone along one exit path
262 * Only clone when there is one exit path
264 The shared clone option must essentially recreate or simulate the
265 local call mechanism for the finally, though likely somewhat more
266 efficiently. Each exit point must designate where control should
267 resume once the shared finally has finished. For instance the jit
268 could introduce a new local per try-finally to determine where the
269 cloned finally should resume, and enumerate the possibilities using a
270 small integer. The end of the cloned finally would then use a switch
271 to determine what code to execute next. This has the downside of
272 introducing unrealizable paths into the control flow graph.
274 Cloning along all exit paths can potentially lead to large amounts of
277 Cloning along some paths or only one path implies that some normal
278 exit paths won't be as well optimized. Nonetheless cloning along one
279 path was the choice made by JIT64 and the one we recommend for
280 implementation. In particular we suggest only cloning along the end of
281 try region exit path, so that any early exit will continue to invoke
282 the funclet for finally cleanup (unless that exit happens to have the
283 same post-finally continuation as the end try region exit, in which
284 case it can simply jump to the cloned finally).
286 One can imagine adaptive strategies. The size of the finally can
287 be roughly estimated and the number of clones needed for full cloning
288 readily computed. Selective cloning can be based on profile
289 feedback or other similar mechanisms for choosing the profitable
292 The current implementation will clone the finally and retarget the
293 last (largest IL offset) leave in the try region to the clone. Any
294 other leave that ultimately transfers control to the same post-finally
295 offset will also be modified to jump to the clone.
297 Empirical studies have shown that most finallys are small. Thus to
298 avoid excessive code growth, a crude size estimate is formed by
299 counting the number of statements in the blocks that make up the
300 finally. Any finally larger that 15 statements is not cloned. In our
301 study this disqualifed about 0.5% of all finallys from cloning.
303 ### EH Nesting Considerations
305 Finally cloning is also more complicated when the finally encloses
306 other EH regions, since the clone will introduce copies of all these
307 regions. While it is possible to implement cloning in such cases we
308 propose to defer for now.
310 Finally cloning is also a bit more complicated if the finally is
311 enclosed by another finally region, so we likewise propose deferring
312 support for this. (Seems like a rare enough thing but maybe not too
313 hard to handle -- though possibly not worth it if we're not going to
314 support the enclosing case).
316 ### Control-Flow and Other Considerations
318 If the try never exits normally, then the finally can only be invoked
319 in exceptional cases. There is no benefit to cloning since the cloned
320 finally would be unreachable. We can detect a subset of such cases
321 because there will be no call finally blocks.
323 JIT64 does not clone finallys that contained switch. We propose to
324 do likewise. (Initially I did not include this restriction but
325 hit a failing test case where the finally contained a switch. Might be
326 worth a deeper look, though such cases are presumably rare.)
328 If the finally never exits normally, then we presume it is cold code,
329 and so will not clone.
331 If the finally is marked as run rarely, we will not clone.
333 Implementation Proposal
334 -----------------------
336 We propose that empty finally removal and finally cloning be run back
337 to back, spliced into the phase list just after fgInline and
338 fgAddInternal, and just before implicit by-ref and struct
339 promotion. We want to run these early before a lot of structural
340 invariants regarding EH are put in place, and before most
341 other optimization, but run them after inlining
342 (so empty finallys can be more readily identified) and after the
343 addition of implicit try-finallys created by the jit. Empty finallys
344 may arise later because of optimization, but this seems relatively
347 We will remove empty finallys first, then clone.
349 Neither optimization will run when the jit is generating debuggable
350 code or operating in min opts mode.
352 ### Empty Finally Removal (Sketch)
354 Skip over methods that have no EH, are compiled with min opts, or
355 where the jit is generating debuggable code.
357 Walk the handler table, looking for try-finally (we could also look
358 for and remove try-faults with empty faults, but those are presumably
361 If the finally is a single block and contains only a `retfilter`
364 * Retarget the callfinally(s) to jump always to the continuation blocks.
365 * Remove the paired jump always block(s) (note we expect all finally
366 calls to be paired since the empty finally returns).
367 * For funclet EH models with finally target bits, clear the finally
368 target from the continuations.
369 * For non-funclet EH models only, clear out the GT_END_LFIN statement
370 in the finally continuations.
371 * Remove the handler block.
372 * Reparent all directly contained try blocks to the enclosing try region
373 or to the method region if there is no enclosing try.
374 * Remove the try-finally from the EH table via `fgRemoveEHTableEntry`.
376 After the walk, if any empty finallys were removed, revalidate the
377 integrity of the handler table.
379 ### Finally Cloning (Sketch)
381 Skip over all methods, if the runtime suports thread abort. More on
384 Skip over methods that have no EH, are compiled with min opts, or
385 where the jit is generating debuggable code.
387 Walk the handler table, looking for try-finally. If the finally is
388 enclosed in a handler or encloses another handler, skip.
390 Walk the finally body blocks. If any is BBJ_SWITCH, or if none
391 is BBJ_EHFINALLYRET, skip cloning. If all blocks are RunRarely
392 skip cloning. If the finally has more that 15 statements, skip
395 Walk the try region from back to front (from largest to smallest IL
396 offset). Find the last block in the try that invokes the finally. That
397 will be the path that will invoke the clone.
399 If the EH model requires callfinally thunks, and there are multiple
400 thunks that invoke the finally, and the callfinally thunk along the
401 clone path is not the first, move it to the front (this helps avoid
404 Set the insertion point to just after the callfinally in the path (for
405 thunk models) or the end of the try (for non-thunk models). Set up a
406 block map. Clone the finally body using `fgNewBBinRegion` and
407 `fgNewBBafter` to make the first and subsequent blocks, and
408 `CloneBlockState` to fill in the block contents. Clear the handler
409 region on the cloned blocks. Bail out if cloning fails. Mark the first
410 and last cloned blocks with appropriate BBF flags. Patch up inter-clone
411 branches and convert the returns into jumps to the continuation.
413 Walk the callfinallys, retargeting the ones that return to the
414 continuation so that they invoke the clone. Remove the paired always
415 blocks. Clear the finally target bit and any GT_END_LFIN from the
418 If all call finallys are converted, modify the region to be try/fault
419 (interally EH_HANDLER_FAULT_WAS_FINALLY, so we can distinguish it
420 later from "organic" try/faults). Otherwise leave it as a
423 Clear the catch type on the clone entry.
427 For runtimes that support thread abort (desktop), more work is
430 * The cloned finally must be reported to the runtime. Likely this
431 can trigger off of the BBF_CLONED_FINALLY_BEGIN/END flags.
432 * The jit must maintain the integrity of the clone by not losing
433 track of the blocks involved, and not allowing code to move in our
434 out of the cloned region
439 Code size impact from finally cloning was measured for CoreCLR on
443 Total bytes of diff: 16158 (0.12 % of base)
444 diff is a regression.
445 Total byte diff includes 0 bytes from reconciling methods
446 Base had 0 unique methods, 0 unique bytes
447 Diff had 0 unique methods, 0 unique bytes
448 Top file regressions by size (bytes):
449 3518 : Microsoft.CodeAnalysis.CSharp.dasm (0.16 % of base)
450 1895 : System.Linq.Expressions.dasm (0.32 % of base)
451 1626 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.07 % of base)
452 1428 : System.Threading.Tasks.Parallel.dasm (4.66 % of base)
453 1248 : System.Linq.Parallel.dasm (0.20 % of base)
454 Top file improvements by size (bytes):
455 -4529 : System.Private.CoreLib.dasm (-0.14 % of base)
456 -975 : System.Reflection.Metadata.dasm (-0.28 % of base)
457 -239 : System.Private.Uri.dasm (-0.27 % of base)
458 -104 : System.Runtime.InteropServices.RuntimeInformation.dasm (-3.36 % of base)
459 -99 : System.Security.Cryptography.Encoding.dasm (-0.61 % of base)
460 57 total files with size differences.
461 Top method regessions by size (bytes):
462 645 : System.Diagnostics.Process.dasm - System.Diagnostics.Process:StartCore(ref):bool:this
463 454 : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:AdjustCallArgumentsForParams(ref,ref,ref,ref,ref,byref):this
464 447 : System.Threading.Tasks.Dataflow.dasm - System.Threading.Tasks.Dataflow.Internal.SpscTargetCore`1[__Canon][System.__Canon]:ProcessMessagesLoopCore():this
465 421 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.ImplementsHelper:FindExplicitlyImplementedMember(ref,ref,ref,ref,ref,ref,byref):ref
466 358 : System.Private.CoreLib.dasm - System.Threading.TimerQueueTimer:Change(int,int):bool:this
467 Top method improvements by size (bytes):
468 -2512 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
469 -824 : Microsoft.CodeAnalysis.dasm - Microsoft.Cci.PeWriter:WriteHeaders(ref,ref,ref,ref,byref):this
470 -663 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
471 -627 : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.ManifestBuilder:CreateManifestString():ref:this
472 -546 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long):int:this (67 methods)
473 3014 total methods with size differences.
476 The largest growth is seen in `Process:StartCore`, which has 4
477 try-finally constructs.
479 Diffs generally show improved codegen in the try bodies with cloned
480 finallys. However some of this improvement comes from more aggressive
481 use of callee save registers, and this causes size inflation in the
482 funclets (note finally cloning does not alter the number of
483 funclets). So if funclet save/restore could be contained to registers
484 used in the funclet, the size impact would be slightly smaller.
486 There are also some instances where cloning relatively small finallys
487 leads to large code size increases. xxx is one example.