1 # Inline Size Estimates
3 Note this is work in progress....
5 Inlining is a heuristic-based optimization. There are some cases where
6 it's obvious that a particular inline is good or bad for performance,
7 but more typically, things are not that clear. In those cases where
8 the benfit is in doubt, the compiler will rely on heurstics, typically
9 based on various estimates of an inline's size impact, speed impact,
10 or other important factors.
12 In this writeup we consider approaches to what should be the simplest
13 of these estimates: the size impact of an inline.
17 There are a number of interesting facets to the inline size estimate
18 problem, but let's start with some generalities. Suppose we have some
19 size estimate for the caller `C` (ignoring for the moment exactly what
20 we are measuring), say `CallerSize`, and some size estimate for the
21 callee `E`, say `CalleeSize`. Now suppose we are contemplating what
22 would happen if `E` is inlined into `C` to create `C'`. We'd like some
23 sort of size estimate `CallerSize'`. The simplest estimate is that
24 `CallerSize'` = `CallerSize + CalleeSize`.
27 (1) `CallerSize'` = `CallerSize + CalleeSize`
30 However, calling conventions impose some addtional code overhead on
31 both the caller and callee. The caller must set up arguments in
32 registers or on the stack, and if there is a return value, might need
33 to move it or store it somewhere. It also might need to spill values
34 that are stored in caller-saved registers around the call.
36 The callee must set up a stack frame and possibly spill callee-save
37 registers it intends to use, and then ultimately must undo all this.
39 When we inline `E` into `C`, none of this calling convention setup is
40 necessary. So one can imagine that there is some additional size
41 savings from inlining, one that depends on the calling convention and
42 the number and kind of arguments passed from caller to callee.
45 (2) CallerSize' = CallerSize + CalleeSize - Overhead
48 Note that it's entirely possible that `Overhead > CalleeSize`, so
49 that `CallerSize' < CallerSize`, that is the inline not only results
50 in faster code but also in smaller code. Indeed this state of affars
51 is increasingly common with that advent of modern programming styles
52 that emphasize building functionality out of lots of small procedures.
54 Alternatively, we can compute the size impact as the size change of C
55 because of the inline of E:
58 (2) SizeImpact = CallerSize - CallerSize'
61 or with a bit of math,
64 (2) SizeImpact = Overhead - CalleeSize
67 Now let's look at some actual data to see how well this simple model
68 for `SizeImpact` holds up. The data below shows the measured size
69 impact of a single inline into various calling methods in mscorlib.
70 This data was obtained by first compiling all the methods without
71 inlining to obtain a baseline size as measured by the encoded
72 instruction size in bytes. The inliner was then modified so it would
73 only do a single inline per caller, with the candidate chosen by a
74 parameter `k` that could be varied externally. mscorlib was then
75 repeatedly compiled with varying values of `k`, and the inlines
76 performed and the subsequent modified caller sizes recorded. This data
77 was then permuted and histogrammed to show the variety of `SizeImpact`
78 values for a fixed `E`. Since the calling convention and number and
79 kind of parameters is fixed across all observations for a fixed `E`,
80 the better model would predict `SizeImpact` is the same in every
83 In this first example, case `E` is `System.Object:.ctor()`, which is
84 close to the simplest possible callee -- it takes one argument and has
85 no return value. In all the observed cases the `SizeImpact` is
86 negative, but it's far from the same in every case.
90 Inlining for System.Object:.ctor():this (size 6)
91 Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5
94 [-50,-41]: 0 1 1 0 0 1 0 0 1 1
95 [-40,-31]: 0 1 0 0 1 0 0 3 1 2
96 [-30,-21]: 1 3 2 2 1 8 76 5 22 18
97 [-20,-11]: 38 23 98 7 29 4 8 27 14 21
98 [-10, -1]: 1 362 373 0 0 3 0 0 0 0
99 [ 0, 9]: 0 0 0 0 0 0 0 0 0 0
100 [ 10, 19]: 0 0 0 0 0 0 0 0 0 0
101 [ 20, 29]: 0 0 0 0 0 0 0 0 0 0
102 [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
103 [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
107 It should be evident from this set of observations that `SizeImpact`
108 cannot be completely characterized by a simple formula like `(2)`. For
109 this inlinee, inlining always saves at least 5 bytes, roughly half the time
110 it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often
111 saves considerably more.
113 Other inlinees show similar spreads in `SizeImpact`. Here we see a case
114 where sometimes the `SizeImpact` is negative and other times it's
118 Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29)
119 Instances 42 SizeImpact Mean 11.33 Min -20 Max 28
122 [-50,-41]: 0 0 0 0 0 0 0 0 0 0
123 [-40,-31]: 0 0 0 0 0 0 0 0 0 0
124 [-30,-21]: 0 0 0 0 0 0 0 0 0 0
125 [-20,-11]: 1 0 0 0 0 0 0 1 0 0
126 [-10, -1]: 0 2 0 0 0 0 0 0 1 2
127 [ 0, 9]: 0 0 1 1 0 2 0 1 4 1
128 [ 10, 19]: 1 0 1 0 4 1 2 1 2 6
129 [ 20, 29]: 0 0 2 0 2 0 0 0 3 0
130 [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
131 [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
135 Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread
139 Inlining for System.Environment:GetResourceString(ref):ref (size 15)
140 Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6
143 [-50,-41]: 0 0 0 0 0 0 0 0 0 0
144 [-40,-31]: 0 0 0 0 0 0 0 0 0 0
145 [-30,-21]: 0 0 0 0 0 0 0 0 0 0
146 [-20,-11]: 0 0 0 0 0 0 0 0 0 0
147 [-10, -1]: 0 0 0 0 0 0 0 7 0 7
148 [ 0, 9]:2212 3 0 5 0 0 4 0 0 0
149 [ 10, 19]: 0 0 0 0 0 0 0 0 0 0
150 [ 20, 29]: 0 0 0 0 0 0 0 0 0 0
151 [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
152 [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
159 Inlining for System.DateTime:get_Ticks():long:this (size 15)
160 Instances 129 SizeImpact Mean 0.00 Min 0 Max 0
163 [-50,-41]: 0 0 0 0 0 0 0 0 0 0
164 [-40,-31]: 0 0 0 0 0 0 0 0 0 0
165 [-30,-21]: 0 0 0 0 0 0 0 0 0 0
166 [-20,-11]: 0 0 0 0 0 0 0 0 0 0
167 [-10, -1]: 0 0 0 0 0 0 0 0 0 0
168 [ 0, 9]: 129 0 0 0 0 0 0 0 0 0
169 [ 10, 19]: 0 0 0 0 0 0 0 0 0 0
170 [ 20, 29]: 0 0 0 0 0 0 0 0 0 0
171 [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
172 [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
176 So it appears there must be other -- perhaps many -- factors that
177 influence the `SizeImpact` of a particular inline. The question is,
178 what factors are important, how do we measure them, and how do they
179 they combine to influence the overall result? The remainder of this
180 document will look into some of the ways we can answer this question.
184 Before we move on, however, we should address what kind of information
185 will actually be available to observe and use in building a heuristic
188 It's not realistic to feed the heuristic on the actual encoded size of
189 methods. While compiling a method `C` obtaining the encoded size of
190 `C` is problematic. It's also quite likely that the encoded size of
191 the inlinee `E` is unknown, except possibly in some AOT scenarios
192 where one could generally arrange for methods to be compiled in
193 bottom-up order (that is, callees before callers). While one could
194 imagine spending compile time doing experimental compilations of `C`'s
195 and `E`'s to see what code size results, this is probably too costly
198 Second, even if we could obtain the actual size of prospective inline
199 candidates, we might not want to use this data. The final code
200 sequence emitted by the compiler depends intimately on details of the
201 target archtecture, runtime conventions (ABIs), and capabilites of the
202 compiler phases that run after inlining. If we allow feedback into hey
203 heuristics by incorporating data from these "downstream" sources, we
204 introduce various forms of coupling that have important
205 consequences. For example, it would be challenging to work on a
206 downstream phase (say, register allocation) if some trivial change to
207 allocation in an inlinee `E` perturbs inlining decisions for other
208 methods like `C`. Likewise we may or may not want to allow runtime
209 conventions to influence inlining -- if our application can run cross
210 platform (say in both Windows and Linux) we might not want the various
211 versions to exhibit vastly different inlining patterns.
213 For that reason, we intend to restrict the information we can use in
214 forming heuristics to things that are generally true of the caller and
215 callee, and the general capabilities of the downstream phases. For the
216 caller and callee this largely means information that can be obtained
217 by inspecting either the input to the compiler, or the compiler's IR
218 in the stages leading up to inlining, and perhaps (if available) some
219 carefully chosen information obtained from actual compilation of the
222 At the same time, we still intend for the predictions made by the
223 heuristics to be indicative of the final binary size. We have no other
224 reliable source of truth to guide us.
226 Given all this, we can restate the central question as: how best can
227 the compiler estimate the ultimate size impact of an inline while
228 restricting itself to considering features that are generally true of
229 caller, callee, and the capabilities of the compiler?
231 ## Building a Heuristic Manually
233 The tried-and-true approach is to build the heuristic manually. There
234 are two prongs to the approach: the first is case analysis of actual
235 behavior, and the second is modelling based on the compiler writer's
236 experience and intution.
238 ### Some Case Analysis
240 #### Case 1: Maximal Savings
242 It's always instructive to look at actual behavior, so let's consider
243 some of the cases. From the table above for `System.Object:.ctor()` we
244 see a number of cases where there were large savings in byte size. The
245 most significant of these is the following:
248 Inline System.Object:.ctor():this
249 into System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
256 where `System.Object:.ctor():this` is simply:
259 ; System.Object:.ctor():this
260 ; Total bytes of code 6, prolog size 5
265 As an aside, one might wonder why there are 5 bytes of nops here --
266 they are put there to support the *rejit* feature used by the
267 application insights framework.
269 The caller's source code shows it simply delegates immediately to
270 another method to initialize the object:
273 UnmanagedMemoryAccessor(
274 SafeBuffer buffer, Int64 offset,
275 Int64 capacity, FileAccess access) {
276 Initialize(buffer, offset, capacity, access);
280 Here's the code before the inline:
284 ; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
285 ; Total bytes of code 72, prolog size 10
297 E800000000 call System.Object:.ctor():this
298 448B742470 mov r14d, dword ptr [rsp+70H]
299 4489742470 mov dword ptr [rsp+70H], r14d
304 488D0500000000 lea rax, [(reloc)]
314 and here's the code after the inline:
318 ; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
319 ; Total bytes of code 24, prolog size 5
322 8B442428 mov eax, dword ptr [rsp+28H]
323 89442428 mov dword ptr [rsp+28H], eax
324 488D0500000000 lea rax, [(reloc)]
328 In both cases this method ends up tail calling to `Initialize` (via the
329 `rex.jmp`), passing it the exact set of arguments that
330 was passed to the method.
332 In the before case, the un-inlined call changes the method from a leaf
333 to a non-leaf. Leaf functions don't need to set up their own frame,
334 while non-leaves do. So some of the extra code in the prolog and
335 epilog is the frame setup. Beyond that, however, the un-inlined callee
336 forces the compiler to move the arguments that will be ultimately be
337 passed to the tail-called method out of the volatile registers and
338 into preserved registers before the call, then back again to again to
339 the proper argument registers after the call. And in order to use the
340 callee saves these registers must be pushed in the prolog and popped
343 In the after case, no frame setup is required, no registers need to be
344 preserved across the call, so no registers need saving and
345 restoring. The two `mov`s that remain appear to be gratuitous. The
346 first `nop' is there for rejit padding. Not sure why the second one is
349 So we can now see why this particular case has such extreme
350 `SizeImpact`: the un-inlined call triggers frame creation and a fair
351 amount of shuffling to accomodate the potential side effects of the
354 #### Case 2: Typical Savings
356 In the data above, the inline of `System.Object:.ctor()` typically
357 saves either 8 or 9 bytes of code. Let's look at one such case.
360 Inline System.Object:.ctor():this
361 into System.Security.SecurityElement:.ctor():this
368 Here are before and after listings for the caller:
372 ; System.Security.SecurityElement:.ctor():this
373 ; Total bytes of code 15, prolog size 5
375 488D0500000000 lea rax, [(reloc)]
381 ; System.Security.SecurityElement:.ctor():this
382 ; Total bytes of code 6, prolog size 5
387 In this case the call site was initially handled via a tail call that
388 required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all
389 that's left is a single byte `ret`. This case covers 154 of the 187
390 instances where inlining `System.Object:.ctor()` saved 9 bytes.
392 ### Cases 3, 4: Size May Decrease or Increase
394 Now let's look at a set of examples where an inline can either
395 decrease or increase the caller size. Here's the good case:
398 Inline System.IntPtr:op_Explicit(long):long
399 into System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
406 Here's the callee code:
408 SIGH -- turns out the above has the wrong callee size -- there are
409 several overloaded versions and my script picks up the wrong one.
410 true callee is only 9 bytes long:
413 ; System.IntPtr:op_Explicit(long):long
414 ; Total bytes of code 9, prolog size 5
420 At any rate the `SizeImpact` is still correct.
422 Here's the before caller code:
426 ; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
427 ; Total bytes of code 50, prolog size 12
432 488D6C2430 lea rbp, [rsp+30H]
436 E800000000 call System.IntPtr:op_Explicit(long):long
440 488D0500000000 lea rax, [(reloc)]
441 488D65F0 lea rsp, [rbp-10H]
445 48FFE0 rex.jmp rax // tail call WriteInt64
448 and the after caller code:
452 ; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
453 ; Total bytes of code 22, prolog size 10
456 488D6C2420 lea rbp, [rsp+20H]
457 E800000000 call System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long)
459 488D6500 lea rsp, [rbp]
464 Somewhat confusingly, the inline has stopped the jit from making a
465 tail call, and the size savings should be even greater than they
466 are. There appears to be some kind of code generation issue within the
467 jit. The IL for the callee is
470 IL_0000 0f 00 ldarga.s 0x0
471 IL_0002 7b 8f 02 00 04 ldfld 0x400028F
476 and the JIT rejects the tail call because
479 Rejecting tail call late for call [000008]: Local address taken
482 so presumably the `ldarga.s` in the inlinee is causing trouble.
484 Now let's see if that same inlinee can cause a size increase. Here's a
487 (NOTE this is calling a different overload..., need to get this
491 Inline System.IntPtr:op_Explicit(long):long [35]
492 into EventData:get_DataPointer():long:this [18] size 35 delta 17
501 ; EventData:get_DataPointer():long:this
502 ; Total bytes of code 18, prolog size
504 488B09 mov rcx, qword ptr [rcx]
505 488D0500000000 lea rax, [(reloc)]
511 ; EventData:get_DataPointer():long:this
512 ; Total bytes of code 35, prolog size 5
514 488B11 mov rdx, qword ptr [rcx]
516 48894C2420 mov qword ptr [rsp+20H], rcx
517 488D4C2420 lea rcx, bword ptr [rsp+20H]
518 E800000000 call System.IntPtr:.ctor(long):this
519 488B442420 mov rax, qword ptr [rsp+20H]
524 Here the un-inlined case made a tail call. The inlined case was unable
525 to do the same optimization for some reason, though it looks like it
526 should have been able to do so as well.