Documentation/design-docs/inline-size-estimates.md

   1 # Inline Size Estimates
   2
   3 Note this is work in progress....
   4
   5 Inlining is a heuristic-based optimization. There are some cases where
   6 it's obvious that a particular inline is good or bad for performance,
   7 but more typically, things are not that clear. In those cases where
   8 the benfit is in doubt, the compiler will rely on heurstics, typically
   9 based on various estimates of an inline's size impact, speed impact,
  10 or other important factors.
  11
  12 In this writeup we consider approaches to what should be the simplest
  13 of these estimates: the size impact of an inline.
  14
  15 ## Background
  16
  17 There are a number of interesting facets to the inline size estimate
  18 problem, but let's start with some generalities. Suppose we have some
  19 size estimate for the caller `C` (ignoring for the moment exactly what
  20 we are measuring), say `CallerSize`, and some size estimate for the
  21 callee `E`, say `CalleeSize`. Now suppose we are contemplating what
  22 would happen if `E` is inlined into `C` to create `C'`. We'd like some
  23 sort of size estimate `CallerSize'`.  The simplest estimate is that
  24 `CallerSize'` = `CallerSize + CalleeSize`.
  25
  26 ```
  27 (1)  `CallerSize'` = `CallerSize + CalleeSize`
  28 ```
  29
  30 However, calling conventions impose some addtional code overhead on
  31 both the caller and callee. The caller must set up arguments in
  32 registers or on the stack, and if there is a return value, might need
  33 to move it or store it somewhere. It also might need to spill values
  34 that are stored in caller-saved registers around the call.
  35
  36 The callee must set up a stack frame and possibly spill callee-save
  37 registers it intends to use, and then ultimately must undo all this.
  38
  39 When we inline `E` into `C`, none of this calling convention setup is
  40 necessary. So one can imagine that there is some additional size
  41 savings from inlining, one that depends on the calling convention and
  42 the number and kind of arguments passed from caller to callee.
  43
  44 ```
  45 (2)  CallerSize' = CallerSize + CalleeSize - Overhead
  46 ```
  47
  48 Note that it's entirely possible that `Overhead > CalleeSize`, so
  49 that `CallerSize' < CallerSize`, that is the inline not only results
  50 in faster code but also in smaller code. Indeed this state of affars
  51 is increasingly common with that advent of modern programming styles
  52 that emphasize building functionality out of lots of small procedures.
  53
  54 Alternatively, we can compute the size impact as the size change of C
  55 because of the inline of E:
  56
  57 ```
  58 (2)  SizeImpact = CallerSize - CallerSize'
  59 ```
  60
  61 or with a bit of math,
  62
  63 ```
  64 (2)  SizeImpact = Overhead - CalleeSize
  65 ```
  66
  67 Now let's look at some actual data to see how well this simple model
  68 for `SizeImpact` holds up. The data below shows the measured size
  69 impact of a single inline into various calling methods in mscorlib.
  70 This data was obtained by first compiling all the methods without
  71 inlining to obtain a baseline size as measured by the encoded
  72 instruction size in bytes.  The inliner was then modified so it would
  73 only do a single inline per caller, with the candidate chosen by a
  74 parameter `k` that could be varied externally. mscorlib was then
  75 repeatedly compiled with varying values of `k`, and the inlines
  76 performed and the subsequent modified caller sizes recorded. This data
  77 was then permuted and histogrammed to show the variety of `SizeImpact`
  78 values for a fixed `E`.  Since the calling convention and number and
  79 kind of parameters is fixed across all observations for a fixed `E`,
  80 the better model would predict `SizeImpact` is the same in every
  81 case.
  82
  83 In this first example, case `E` is `System.Object:.ctor()`, which is
  84 close to the simplest possible callee -- it takes one argument and has
  85 no return value. In all the observed cases the `SizeImpact` is
  86 negative, but it's far from the same in every case.
  87
  88
  89 ```
  90 Inlining for System.Object:.ctor():this (size 6)
  91 Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5
  92 Distribution
  93   less than -50:   1
  94       [-50,-41]:   0    1    1    0    0    1    0    0    1    1
  95       [-40,-31]:   0    1    0    0    1    0    0    3    1    2
  96       [-30,-21]:   1    3    2    2    1    8   76    5   22   18
  97       [-20,-11]:  38   23   98    7   29    4    8   27   14   21
  98       [-10, -1]:   1  362  373    0    0    3    0    0    0    0
  99       [  0,  9]:   0    0    0    0    0    0    0    0    0    0
 100       [ 10, 19]:   0    0    0    0    0    0    0    0    0    0
 101       [ 20, 29]:   0    0    0    0    0    0    0    0    0    0
 102       [ 30, 39]:   0    0    0    0    0    0    0    0    0    0
 103       [ 40, 49]:   0    0    0    0    0    0    0    0    0    0
 104 greater than 49:   0
 105 ```
 106
 107 It should be evident from this set of observations that `SizeImpact`
 108 cannot be completely characterized by a simple formula like `(2)`. For
 109 this inlinee, inlining always saves at least 5 bytes, roughly half the time
 110 it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often
 111 saves considerably more.
 112
 113 Other inlinees show similar spreads in `SizeImpact`. Here we see a case
 114 where sometimes the `SizeImpact` is negative and other times it's
 115 positive.
 116
 117 ```
 118 Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29)
 119 Instances 42 SizeImpact Mean 11.33 Min -20 Max 28
 120 Distribution
 121   less than -50:   0
 122       [-50,-41]:   0    0    0    0    0    0    0    0    0    0
 123       [-40,-31]:   0    0    0    0    0    0    0    0    0    0
 124       [-30,-21]:   0    0    0    0    0    0    0    0    0    0
 125       [-20,-11]:   1    0    0    0    0    0    0    1    0    0
 126       [-10, -1]:   0    2    0    0    0    0    0    0    1    2
 127       [  0,  9]:   0    0    1    1    0    2    0    1    4    1
 128       [ 10, 19]:   1    0    1    0    4    1    2    1    2    6
 129       [ 20, 29]:   0    0    2    0    2    0    0    0    3    0
 130       [ 30, 39]:   0    0    0    0    0    0    0    0    0    0
 131       [ 40, 49]:   0    0    0    0    0    0    0    0    0    0
 132 greater than 49:   0
 133 ```
 134
 135 Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread
 136 just a little:
 137
 138 ```
 139 Inlining for System.Environment:GetResourceString(ref):ref (size 15)
 140 Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6
 141 Distribution
 142   less than -50:   0
 143       [-50,-41]:   0    0    0    0    0    0    0    0    0    0
 144       [-40,-31]:   0    0    0    0    0    0    0    0    0    0
 145       [-30,-21]:   0    0    0    0    0    0    0    0    0    0
 146       [-20,-11]:   0    0    0    0    0    0    0    0    0    0
 147       [-10, -1]:   0    0    0    0    0    0    0    7    0    7
 148       [  0,  9]:2212    3    0    5    0    0    4    0    0    0
 149       [ 10, 19]:   0    0    0    0    0    0    0    0    0    0
 150       [ 20, 29]:   0    0    0    0    0    0    0    0    0    0
 151       [ 30, 39]:   0    0    0    0    0    0    0    0    0    0
 152       [ 40, 49]:   0    0    0    0    0    0    0    0    0    0
 153 greater than 49:   0
 154 ```
 155
 156 and some not at all:
 157
 158 ```
 159 Inlining for System.DateTime:get_Ticks():long:this (size 15)
 160 Instances 129 SizeImpact Mean 0.00 Min 0 Max 0
 161 Distribution
 162   less than -50:   0
 163       [-50,-41]:   0    0    0    0    0    0    0    0    0    0
 164       [-40,-31]:   0    0    0    0    0    0    0    0    0    0
 165       [-30,-21]:   0    0    0    0    0    0    0    0    0    0
 166       [-20,-11]:   0    0    0    0    0    0    0    0    0    0
 167       [-10, -1]:   0    0    0    0    0    0    0    0    0    0
 168       [  0,  9]: 129    0    0    0    0    0    0    0    0    0
 169       [ 10, 19]:   0    0    0    0    0    0    0    0    0    0
 170       [ 20, 29]:   0    0    0    0    0    0    0    0    0    0
 171       [ 30, 39]:   0    0    0    0    0    0    0    0    0    0
 172       [ 40, 49]:   0    0    0    0    0    0    0    0    0    0
 173 greater than 49:   0
 174 ```
 175
 176 So it appears there must be other -- perhaps many -- factors that
 177 influence the `SizeImpact` of a particular inline. The question is,
 178 what factors are important, how do we measure them, and how do they
 179 they combine to influence the overall result? The remainder of this
 180 document will look into some of the ways we can answer this question.
 181
 182 ## Some Ground Rules
 183
 184 Before we move on, however, we should address what kind of information
 185 will actually be available to observe and use in building a heuristic
 186 model.
 187
 188 It's not realistic to feed the heuristic on the actual encoded size of
 189 methods. While compiling a method `C` obtaining the encoded size of
 190 `C` is problematic. It's also quite likely that the encoded size of
 191 the inlinee `E` is unknown, except possibly in some AOT scenarios
 192 where one could generally arrange for methods to be compiled in
 193 bottom-up order (that is, callees before callers). While one could
 194 imagine spending compile time doing experimental compilations of `C`'s
 195 and `E`'s to see what code size results, this is probably too costly
 196 to be practical.
 197
 198 Second, even if we could obtain the actual size of prospective inline
 199 candidates, we might not want to use this data. The final code
 200 sequence emitted by the compiler depends intimately on details of the
 201 target archtecture, runtime conventions (ABIs), and capabilites of the
 202 compiler phases that run after inlining. If we allow feedback into hey
 203 heuristics by incorporating data from these "downstream" sources, we
 204 introduce various forms of coupling that have important
 205 consequences. For example, it would be challenging to work on a
 206 downstream phase (say, register allocation) if some trivial change to
 207 allocation in an inlinee `E` perturbs inlining decisions for other
 208 methods like `C`. Likewise we may or may not want to allow runtime
 209 conventions to influence inlining -- if our application can run cross
 210 platform (say in both Windows and Linux) we might not want the various
 211 versions to exhibit vastly different inlining patterns.
 212
 213 For that reason, we intend to restrict the information we can use in
 214 forming heuristics to things that are generally true of the caller and
 215 callee, and the general capabilities of the downstream phases. For the
 216 caller and callee this largely means information that can be obtained
 217 by inspecting either the input to the compiler, or the compiler's IR
 218 in the stages leading up to inlining, and perhaps (if available) some
 219 carefully chosen information obtained from actual compilation of the
 220 callee.
 221
 222 At the same time, we still intend for the predictions made by the
 223 heuristics to be indicative of the final binary size. We have no other
 224 reliable source of truth to guide us.
 225
 226 Given all this, we can restate the central question as: how best can
 227 the compiler estimate the ultimate size impact of an inline while
 228 restricting itself to considering features that are generally true of
 229 caller, callee, and the capabilities of the compiler?
 230
 231 ## Building a Heuristic Manually
 232
 233 The tried-and-true approach is to build the heuristic manually. There
 234 are two prongs to the approach: the first is case analysis of actual
 235 behavior, and the second is modelling based on the compiler writer's
 236 experience and intution.
 237
 238 ### Some Case Analysis
 239
 240 #### Case 1: Maximal Savings
 241
 242 It's always instructive to look at actual behavior, so let's consider
 243 some of the cases. From the table above for `System.Object:.ctor()` we
 244 see a number of cases where there were large savings in byte size. The
 245 most significant of these is the following:
 246
 247 ```
 248 Inline System.Object:.ctor():this
 249 into   System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
 250 CalleeSize  = 6
 251 CallerSize  = 72
 252 CallerSize' = 24
 253 SizeImpact  = -48
 254 ```
 255
 256 where `System.Object:.ctor():this` is simply:
 257
 258 ```Assembly
 259 ; System.Object:.ctor():this
 260 ; Total bytes of code 6, prolog size 5
 261        0F1F440000           nop
 262        C3                   ret
 263 ```
 264
 265 As an aside, one might wonder why there are 5 bytes of nops here --
 266 they are put there to support the *rejit* feature used by the
 267 application insights framework.
 268
 269 The caller's source code shows it simply delegates immediately to
 270 another method to initialize the object:
 271
 272 ```C#
 273  UnmanagedMemoryAccessor(
 274      SafeBuffer buffer, Int64 offset,
 275      Int64 capacity, FileAccess access) {
 276    Initialize(buffer, offset, capacity, access);
 277  }
 278 ```
 279
 280 Here's the code before the inline:
 281
 282 ```Assembly
 283 ; BEFORE
 284 ; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
 285 ; Total bytes of code 72, prolog size 10
 286        4156                 push     r14
 287        57                   push     rdi
 288        56                   push     rsi
 289        55                   push     rbp
 290        53                   push     rbx
 291        4883EC20             sub      rsp, 32
 292        488BF1               mov      rsi, rcx
 293        488BFA               mov      rdi, rdx
 294        498BD8               mov      rbx, r8
 295        498BE9               mov      rbp, r9
 296        488BCE               mov      rcx, rsi
 297        E800000000           call     System.Object:.ctor():this
 298        448B742470           mov      r14d, dword ptr [rsp+70H]
 299        4489742470           mov      dword ptr [rsp+70H], r14d
 300        488BCE               mov      rcx, rsi
 301        488BD7               mov      rdx, rdi
 302        4C8BC3               mov      r8, rbx
 303        4C8BCD               mov      r9, rbp
 304        488D0500000000       lea      rax, [(reloc)]
 305        4883C420             add      rsp, 32
 306        5B                   pop      rbx
 307        5D                   pop      rbp
 308        5E                   pop      rsi
 309        5F                   pop      rdi
 310        415E                 pop      r14
 311        48FFE0               rex.jmp  rax
 312 ```
 313
 314 and here's the code after the inline:
 315
 316 ```Assembly
 317 ; AFTER
 318 ; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
 319 ; Total bytes of code 24, prolog size 5
 320        0F1F440000           nop
 321        90                   nop
 322        8B442428             mov      eax, dword ptr [rsp+28H]
 323        89442428             mov      dword ptr [rsp+28H], eax
 324        488D0500000000       lea      rax, [(reloc)]
 325        48FFE0               rex.jmp  rax
 326 ```
 327
 328 In both cases this method ends up tail calling to `Initialize` (via the
 329 `rex.jmp`), passing it the exact set of arguments that
 330 was passed to the method.
 331
 332 In the before case, the un-inlined call changes the method from a leaf
 333 to a non-leaf.  Leaf functions don't need to set up their own frame,
 334 while non-leaves do. So some of the extra code in the prolog and
 335 epilog is the frame setup. Beyond that, however, the un-inlined callee
 336 forces the compiler to move the arguments that will be ultimately be
 337 passed to the tail-called method out of the volatile registers and
 338 into preserved registers before the call, then back again to again to
 339 the proper argument registers after the call.  And in order to use the
 340 callee saves these registers must be pushed in the prolog and popped
 341 in the epilog.
 342
 343 In the after case, no frame setup is required, no registers need to be
 344 preserved across the call, so no registers need saving and
 345 restoring. The two `mov`s that remain appear to be gratuitous. The
 346 first `nop' is there for rejit padding. Not sure why the second one is
 347 there.
 348
 349 So we can now see why this particular case has such extreme
 350 `SizeImpact`: the un-inlined call triggers frame creation and a fair
 351 amount of shuffling to accomodate the potential side effects of the
 352 call.
 353
 354 #### Case 2: Typical Savings
 355
 356 In the data above, the inline of `System.Object:.ctor()` typically
 357 saves either 8 or 9 bytes of code. Let's look at one such case.
 358
 359 ```
 360 Inline System.Object:.ctor():this
 361 into   System.Security.SecurityElement:.ctor():this
 362 CalleeSize = 6
 363 CallerSize = 15
 364 CallerSize' = 6
 365 SizeImpact = -9
 366 ```
 367
 368 Here are before and after listings for the caller:
 369
 370 ```Assembly
 371 ; BEFORE
 372 ; System.Security.SecurityElement:.ctor():this
 373 ; Total bytes of code 15, prolog size 5
 374        0F1F440000           nop
 375        488D0500000000       lea      rax, [(reloc)]
 376        48FFE0               rex.jmp  rax
 377 ```
 378
 379 ```Assembly
 380 ; AFTER
 381 ; System.Security.SecurityElement:.ctor():this
 382 ; Total bytes of code 6, prolog size 5
 383        0F1F440000           nop
 384        C3                   ret
 385 ```
 386
 387 In this case the call site was initially handled via a tail call that
 388 required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all
 389 that's left is a single byte `ret`. This case covers 154 of the 187
 390 instances where inlining `System.Object:.ctor()` saved 9 bytes.
 391
 392 ### Cases 3, 4: Size May Decrease or Increase
 393
 394 Now let's look at a set of examples where an inline can either
 395 decrease or increase the caller size. Here's the good case:
 396
 397 ```
 398 Inline System.IntPtr:op_Explicit(long):long
 399 into   System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
 400 CalleeSize  = 35
 401 CallerSize  = 50
 402 CallerSize' = 22
 403 SizeImpact  = -28
 404 ```
 405
 406 Here's the callee code:
 407
 408 SIGH -- turns out the above has the wrong callee size -- there are
 409 several overloaded versions and my script picks up the wrong one.
 410 true callee is only 9 bytes long:
 411
 412 ```Assembly
 413 ; System.IntPtr:op_Explicit(long):long
 414 ; Total bytes of code 9, prolog size 5
 415        0F1F440000           nop
 416        488BC1               mov      rax, rcx
 417        C3                   ret
 418 ```
 419
 420 At any rate the `SizeImpact` is still correct.
 421
 422 Here's the before caller code:
 423
 424 ```Assembly
 425 ; BEFORE
 426 ; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
 427 ; Total bytes of code 50, prolog size 12
 428        55                   push     rbp
 429        57                   push     rdi
 430        56                   push     rsi
 431        4883EC20             sub      rsp, 32
 432        488D6C2430           lea      rbp, [rsp+30H]
 433        488BF1               mov      rsi, rcx
 434        8BFA                 mov      edi, edx
 435        498BC8               mov      rcx, r8
 436        E800000000           call     System.IntPtr:op_Explicit(long):long
 437        4C8BC0               mov      r8, rax
 438        8BD7                 mov      edx, edi
 439        488BCE               mov      rcx, rsi
 440        488D0500000000       lea      rax, [(reloc)]
 441        488D65F0             lea      rsp, [rbp-10H]
 442        5E                   pop      rsi
 443        5F                   pop      rdi
 444        5D                   pop      rbp
 445        48FFE0               rex.jmp  rax   // tail call WriteInt64
 446 ```
 447
 448 and the after caller code:
 449
 450 ```Assembly
 451 ; AFTER
 452 ; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
 453 ; Total bytes of code 22, prolog size 10
 454        55                   push     rbp
 455        4883EC20             sub      rsp, 32
 456        488D6C2420           lea      rbp, [rsp+20H]
 457        E800000000           call     System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long)
 458        90                   nop
 459        488D6500             lea      rsp, [rbp]
 460        5D                   pop      rbp
 461        C3                   ret
 462 ```
 463
 464 Somewhat confusingly, the inline has stopped the jit from making a
 465 tail call, and the size savings should be even greater than they
 466 are. There appears to be some kind of code generation issue within the
 467 jit. The IL for the callee is
 468
 469 ```
 470 IL_0000  0f 00             ldarga.s     0x0
 471 IL_0002  7b 8f 02 00 04    ldfld        0x400028F
 472 IL_0007  6e                conv.u8
 473 IL_0008  2a                ret
 474
 475 ```
 476 and the JIT rejects the tail call because
 477
 478 ```
 479 Rejecting tail call late for call [000008]: Local address taken
 480 ```
 481
 482 so presumably the `ldarga.s` in the inlinee is causing trouble.
 483
 484 Now let's see if that same inlinee can cause a size increase. Here's a
 485 case:
 486
 487 (NOTE this is calling a different overload..., need to get this
 488 straghtened out)
 489
 490 ```
 491 Inline System.IntPtr:op_Explicit(long):long [35]
 492 into   EventData:get_DataPointer():long:this [18] size 35 delta 17
 493 CalleeSize  = 35
 494 CallerSize  = 18
 495 CallerSize' = 35
 496 SizeImpact  = 17
 497 ```
 498
 499 ```Assembly
 500 ; BEFORE
 501 ; EventData:get_DataPointer():long:this
 502 ; Total bytes of code 18, prolog size
 503        0F1F440000           nop
 504        488B09               mov      rcx, qword ptr [rcx]
 505        488D0500000000       lea      rax, [(reloc)]
 506        48FFE0               rex.jmp  rax
 507 ```
 508
 509 ```Assembly
 510 ; AFTER
 511 ; EventData:get_DataPointer():long:this
 512 ; Total bytes of code 35, prolog size 5
 513        4883EC28             sub      rsp, 40
 514        488B11               mov      rdx, qword ptr [rcx]
 515        33C9                 xor      rcx, rcx
 516        48894C2420           mov      qword ptr [rsp+20H], rcx
 517        488D4C2420           lea      rcx, bword ptr [rsp+20H]
 518        E800000000           call     System.IntPtr:.ctor(long):this
 519        488B442420           mov      rax, qword ptr [rsp+20H]
 520        4883C428             add      rsp, 40
 521        C3                   ret
 522 ```
 523
 524 Here the un-inlined case made a tail call. The inlined case was unable
 525 to do the same optimization for some reason, though it looks like it
 526 should have been able to do so as well.