Documentation/design-docs/tailcalls-with-helpers.md

   1 # The current way of handling tail-calls
   2 ## Fast tail calls
   3 These are tail calls that are handled directly by the jitter and no runtime cooperation is needed. They are limited to cases where:
   4 * Return value and call target arguments are all either primitive types, reference types, or valuetypes with a single primitive type or reference type fields
   5 * The aligned size of call target arguments is less or equal to aligned size of caller arguments
   6
   7 ## Tail calls using a helper
   8 Tail calls in cases where we cannot perform the call in a simple way are implemented using a tail call helper. Here is a rough description of how it works:
   9 * For each tail call target, the jitter asks runtime to generate an assembler argument copying routine. This routine reads vararg list of arguments and places the arguments in their proper slots in the CONTEXT or on the stack. Together with the argument copying routine, the runtime also builds a list of offsets of references and byrefs for return value of reference type or structs returned in a hidden return buffer and for structs passed by ref. The gc layout data block is stored at the end of the argument copying thunk.
  10 * At the time of the tail call, the caller generates a vararg list of all arguments of the tail called function and then calls JIT_TailCall runtime function. It passes it the copying routine address, the target address and the vararg list of the arguments.
  11 * The JIT_TailCall then performs the following:
  12   * It calls RtlVirtualUnwind twice to get the context of the caller of the caller of the tail call to simulate the effect of running epilog of the caller of the tail call and also its return.
  13   * It prepares stack space for the callee stack arguments, a helper explicit TailCallFrame and also a CONTEXT structure where the argument registers of the callee, stack pointer and the target function address are set. In case the tail call caller has enough space for the callee arguments and the TailCallFrame in its stack frame, that space is used directly for the callee arguments. Otherwise the stack arguments area is allocated at the top of the stack. This slightly differs in the case the tail call was called from another tail called function - the TailCallFrame already exists and so it is not recreated. The TailCallFrame also keeps a pointer to the list of gc reference offsets of the arguments and structure return buffer members. The stack walker during GC then uses that to ensure proper GC liveness of those references.
  14   * It calls the copying routine to translate the arguments from the vararg list to the just reserved stack area and the context.
  15   * If the stack arguments and TailCaillFrame didn't fit into the caller's stack frame, these data are now moved to the final location
  16   * RtlRestoreContext is used to start executing the callee.
  17
  18 There are several issues with this approach:
  19 * It is expensive to port to new platforms
  20   * Parsing the vararg list is not possible to do in a portable way on Unix. Unlike on Windows, the list is not stored a linear sequence of the parameter data bytes in memory. va_list on Unix is an opaque data type, some of the parameters can be in registers and some in the memory.
  21   * Generating the copying asm routine needs to be done for each target architecture / platform differently. And it is also very complex, error prone and impossible to do on platforms where code generation at runtime is not allowed.
  22 * It is slower than it has to be
  23   * The parameters are copied possibly twice - once from the vararg list to the stack and then one more time if there was not enough space in the caller's stack frame.
  24   * RtlRestoreContext restores all registers from the CONTEXT structure, not just a subset of them that is really necessary for the functionality, so it results in another unnecessary memory accesses.
  25 * Stack walking over the stack frames of the tail calls requires runtime assistance.
  26
  27 # The new approach to tail calls using helpers
  28 ## Objectives
  29 The new way of handling tail calls using helpers was designed with the following objectives:
  30 * It should be cheap to port to new platforms, architectures and code generators
  31 * It needs to work in both jitted and AOT compiled scenarios
  32 * It should support platforms where runtime code generation is not possible
  33 * The tail calls should be reasonably fast compared to regular calls with the same arguments
  34 * The tail calls should not be slower than existing mechanism on Windows
  35 * No runtime assistance should be necessary for unwinding stack with tail call frames on it
  36 * The stack should be unwindable at any spot during the tail calls to properly support sampling profilers and similar tools.
  37 * Stack walk during GC must be able to always correctly report GC references.
  38 * It should work in all cases except those where a tail call is not allowed as described in the ECMA 335 standard section III.2.4
  39
  40 ## Requirements
  41 * The code generator needs to be able to compile a tail call to a target as a call to a thunk with the same parameters as the target, but void return, followed by a jump to an assembler helper.
  42
  43 ## Implementation
  44 This section describes the helper functions and data structures that the tail calls use and also describes the tail call sequence step by step.
  45 ### Helper functions
  46 The tail calls use the following thunks and helpers:
  47 * StoreArguments - this thunk stores the arguments into a thread local storage together with the address of the corresponding CallTarget thunk and a descriptor of locations and types of managed references in the stored arguments data. This thunk is generated as IL and compiled by the jitter or AOT compiler. There is one such thunk per tail call target.
  48 It has a signature that is compatible with the tailcall target function except for the return type which is void. But it is not the same. It gets the same arguments as the tail call target function, but it would also get "this" pointer and the generic context as explicit arguments if the tail call target requires them. Arguments of generic reference types are passed as "object" so that the StoreArguments doesn't have to be generic.
  49 * CallTarget - this thunk gets the arguments buffer that was filled by the StoreArguments thunk, loads the arguments from the buffer, releases the buffer and calls the target function using calli. The signature used for the calli would ensure that all arguments including the optional hidden return buffer and the generic context are passed in the right registers / stack slots. Generic reference arguments will be specified as "object" in the signature so that the CallTarget doesn't have to be generic.
  50 The CallTarget is also generated as IL and compiled by the jitter or AOT compiler. There is one such thunk per tail call target. This thunk has the same return type as the tailcall target or return "object" if the return type of the tail call target is a generic reference type.
  51 * TailCallHelper - this is an assembler helper that is responsible for restoring stack pointer to the location where it was when the first function in a tail call chain was entered and then jumping to the CallTarget thunk. This helper is common for all tail call targets.
  52 In a context of each tailcall invocation, the TailCallHelper will be handled by the jitter as if it had the same return type as the tail call target. That means that if the tail call target needs a hidden return buffer for returning structs, the pointer to this buffer will be passed to the TailCallHelper the same way as it would be passed to the tail call target. The TailCallHelper would then pass this hidden argument to the CallTarget helper.
  53 There will be two flavors of this helper, based on whether the tail call target needs a hidden return buffer or not:
  54   * TailCallHelper
  55   * TailCallHelper_RetBuf
  56
  57 ### Helper data structures
  58 The tail calls use the following data structures:
  59 * Thread local storage for arguments. It stores the arguments of a tail call for a short period of time between the StoreArguments and CallTarget calls.
  60 * Arguments GC descriptor - descriptor of locations and types of managed references in the arguments.
  61 * TailCallHelperStack - a per thread stack of helper entries that is used to determine whether a tail call is chained or not. Its entries are allocated as local variables in CallTarget thunks. Each entry contains:
  62   * Stack pointer captured right before a call to a tail call target
  63   * ChainCall flag indicating whether the CallTarget thunk should return after the call to the tail call target or whether it should execute its epilog and jump to TailCallHelper instead. The latter is used by the TailCallHelper to remove the stack frame of the CallTarget before making a tail call from a tail called function.
  64   * Pointer to the next entry on the stack.
  65
  66 ### Tail call sequence
  67 * The caller calls the StoreArguments thunk corresponding to the callee to store the pointer to the tail call target function, its arguments, their GC descriptors, optional "this" and generic context arguments and the corresponding CallTarget thunk address in a thread local storage.
  68 * The caller executes its epilog, restoring stack pointer and callee saved registers to their values when the caller was entered.
  69 * The caller jumps to the TailCallHelper. This function performs the following operations:
  70   * Get the topmost TailCallHelperStack entry for the current thread.
  71   * Check if the previous stack frame is a CallTarget thunk frame by comparing the stack pointer value stored in the TailCallHelperStack entry to the current CFA (call frame address). If it matches, it means that the previous stack frame belongs to a CallTarget thunk and so the tail call caller was also tail called.
  72   * If the previous frame was a CallTarget thunk, its stack frame needs to be removed to ensure that the stack will not grow when tail calls are chained. Set ChainCall flag in the TailCallHelperStack entry and return. That returns control to the CallTarget thunk. It checks the ChainCall flag and since it set, it executes its epilog and jumps to the TailCallHelper again.
  73   * If the previous frame was not a CallTarget thunk, get the address of the CallTarget thunk of the tailcall target from the arguments buffer and jump to it.
  74 * The CallTarget thunk function then does the following operation:
  75   * Create local instance of TailCallHelperStack entry and store the current stack pointer value in it.
  76   * Push the entry to the TailCallHelperStack of the current thread.
  77   * Get the arguments buffer from the thread local storage, extract the regular arguments and the optional "this" and generic context arguments and the target function pointer. Release the buffer and call the target function. The frame of the CallTarget thunk ensures that the arguments of the target are GC protected until the target function returns or tail calls to another function.
  78   * Pop the TailCallHelperStack entry from the TailCallHelperStack of the current thread.
  79   * Check the ChainCall flag in the TailCallHelperStack entry. If it is set, run epilog and jump to the TailCallHelper.
  80   * If the ChainCall flag is clear, it means that the last function in the tail call chain has returned. So return the return value of the target function.
  81
  82 ## Work that needs to be done to implement the new tail calls mechanism
  83 ### JIT (compiler in the AOT scenario)
  84 * Modify compilation of tail calls with helper so that a tail call is compiled as a call to the StoreArguments thunk followed by the jump to the assembler TailCallHelper. In other words, the
  85 ```
  86 tail. call/callvirt <method>
  87 ret
  88 ```
  89 becomes
  90 ```
  91 call/callvirt <StoreArguments thunk>
  92 tail. call <TailCallHelper>
  93 ret
  94 ```
  95 ### Runtime (compiler in the AOT scenario)
  96 * Add generation of the StoreArguments and CallTarget IL thunks to the runtime (compiler tool chain in at AOT scenario). As a possible optimization, In the AOT scenario, the thunks can be generated by the compiler as native code directly without the intermediate IL.
  97 * For the JIT scenario, add a new method to the JIT to EE interface to get the StoreArguments thunk method handle for a given target method and the TailCallHelper address.
  98
  99 ### Runtime in both scenarios
 100 * Add support for the arguments buffer, which means:
 101   * Add functions to create, release and get the buffer for a thread
 102   * Add support for GC scanning the arguments buffers.
 103 * Implement the TailCallHelper asm helper for all architectures
 104 ### Debugging in both scenarios
 105 Ensure that the stepping in a debugger works correctly. In CoreCLR, the TailCallStubManager needs to be updated accordingly.
 106
 107 ## Example code
 108
 109 ```C#
 110 struct S
 111 {
 112     public S(long p1, long p2, long p3)
 113     {
 114         s1 = p1; s2 = p2; s3 = p3;
 115     }
 116
 117     public long s1, s2, s3;
 118 }
 119
 120 struct T
 121 {
 122     public T(S s)
 123     {
 124         t1 = s.s1; t2 = s.s2; t3 = s.s3 t4 = 4;
 125     }
 126     public long t1, t2, t3, t4;
 127 }
 128
 129 struct U
 130 {
 131     public U(T t)
 132     {
 133         u1 = t.t1; u2 = t.t2; u3 = t.t3 u4 = t.t4; u5 = 5;
 134     }
 135     public long u1, u2, u3, u4, u5;
 136 }
 137
 138 int D(U u)
 139 {
 140     int local;
 141     Console.WriteLine("In C, U = [{0}, {1}, {2}, {3}, {4}", u.u1, u.u2, u.u3, u.u4, u.u5);
 142     return 1;
 143 }
 144
 145 int C(T t)
 146 {
 147     int local;
 148     Console.WriteLine("In C");
 149     U args = new U(t);
 150     return tailcall D(args);
 151 }
 152
 153 int B(S s)
 154 {
 155     int local;
 156     Console.WriteLine("In B");
 157     T args = new T(S);
 158     return tailcall C(args);
 159 }
 160
 161 int A()
 162 {
 163     S args = new S(1, 2, 3);
 164     int result = B(args);
 165     Console.WriteLine("Done, result = {0}\n", result);
 166 }
 167 ```
 168
 169 ## Example code execution
 170 This section shows how stack evolves during the execution of the example code above. Execution starts at function A, but the details below start at the interesting point where the first tail call is about to be called.
 171 ### B is about to tail call C
 172 ```
 173 * Return address of A
 174 * Callee saved registers and locals of A
 175 * Stack arguments of B
 176 * Return address of B
 177 * Callee saved registers and locals of B
 178 ```
 179 ### Arguments of C are stored in the thread local buffer, now we are in the TailCallHelper
 180 The callee saved registers and locals of B are not on the stack anymore
 181 ```
 182 * Return address of A
 183 * Callee saved registers and locals of A
 184 * Stack arguments of B
 185 * Return address of B
 186 ```
 187
 188 ### In CallTarget thunk for C, about to call C
 189 The thunk will now extract parameters for C from the thread local storage and call C.
 190 ```
 191 * Return address of A
 192 * Callee saved registers and locals of A
 193 * Stack arguments of B
 194 * Return address of B
 195 * Callee saved registers and locals of CallTarget thunk for C
 196 ```
 197 ### C is about to tail call D
 198 ```
 199 * Return address of A
 200 * Callee saved registers and locals of A
 201 * Stack arguments of B
 202 * Return address of B
 203 * Callee saved registers and locals of CallTarget thunk for C
 204 * Stack arguments of C
 205 * Return address of C
 206 * Callee saved registers and locals of C
 207 ```
 208 ### Arguments of D are stored in the thread local buffer, now we are in the TailCallHelper
 209 The callee saved registers and locals of C are not on the stack anymore.
 210 But we still have the return address of C, stack arguments of C and callee saved registers and locals of CallTarget thunk for C on the stack.
 211 We need to remove them as well to prevent stack growing.
 212 The TailCallHelper detects that the previous stack frame was the frame of the CallTarget thunk for C and so it sets the ChainCall flag in the topmost TailCallHelperStackEntry and returns to CallTarget thunk for C in order to let it cleanup its stack frame.
 213 ```
 214 * Return address of A
 215 * Callee saved registers and locals of A
 216 * Stack arguments of B
 217 * Return address of B
 218 * Callee saved registers and locals of CallTarget thunk for C
 219 * Stack arguments of C
 220 * Return address of C
 221 ```
 222 ### Returned to CallTarget thunk for C with ChainCall flag in the TailCallHelperStackEntry **set**
 223 The thunk checks the ChainCall flag and since it is set, it runs its epilog and then jumps to the TailCallHelper.
 224 ```
 225 * Return address of A
 226 * Callee saved registers and locals of A
 227 * Stack arguments of B
 228 * Return address of B
 229 * Callee saved registers and locals of CallTarget thunk for C
 230 ```
 231 ### Back in TailCallHelper
 232 Now the stack is back in the state where we have made the previous tail call. Since the previous stack frame was not a CallTarget thunk frame, we just jump to the CallTarget thunk for D.
 233 ```
 234 * Return address of A
 235 * Callee saved registers and locals of A
 236 * Stack arguments of B
 237 * Return address of B
 238 ```
 239 ### In CallTarget thunk for D, about to call D
 240 The thunk will now extract parameters for D from the thread local storage and call D.
 241 ```
 242 * Return address of A
 243 * Callee saved registers and locals of A
 244 * Stack arguments of B
 245 * Return address of B
 246 * Callee saved registers and locals of CallTarget thunk for D
 247 ```
 248 ### In D
 249 We are in the last function of the chain, so after it does its work, it returns to its CallTarget thunk.
 250 ```
 251 * Return address of A
 252 * Callee saved registers and locals of A
 253 * Stack arguments of B
 254 * Return address of B
 255 * Callee saved registers and locals of CallTarget thunk for D
 256 * Stack arguments of D
 257 * Return address of D
 258 * Callee saved registers and locals of D
 259 ```
 260 ### Returned to CallTarget thunk for D with ChainCall flag in the TailCallHelperStackEntry **clear**
 261 The thunk checks the ChainCall flag and since it is clear, it recognizes we are now returning from the call chain and so it returns the result of the D.
 262 ```
 263 * Return address of A
 264 * Callee saved registers and locals of A
 265 * Stack arguments of B
 266 * Return address of B
 267 * Callee saved registers and locals of CallTarget thunk for D
 268 ```
 269 ### Returned to A
 270 We are back in A and we have the return value of the call chain.
 271 ```
 272 * Return address of A
 273 * Callee saved registers and locals of A
 274 ```
 275
 276 ## Example of thunks generated for a simple generic method
 277
 278 ```C#
 279 struct Point
 280 {
 281     public int x;
 282     public int y;
 283     public int z;
 284 }
 285
 286 class Foo
 287 {
 288     public Point Test<T>(int x, T t) where T : class
 289     {
 290         Console.WriteLine("T: {0}", typeof(t));
 291         return new Point(x, x, x);
 292     }
 293 }
 294 ```
 295
 296 For tail calling the Test function, the IL helpers could look e.g. as follows:
 297
 298 ```IL
 299 .method public static void StoreArgumentsTest(native int target, object thisObj, native int genericContext, int x, object t) cil managed
 300 {
 301     .maxstack 4
 302     .locals init (
 303        [0] native int buffer,
 304     )
 305
 306     call native int AllocateArgumentBuffer(56)
 307     stloc.0
 308
 309     ldloc.0
 310     ldc.i8 0x12345678 // pointer to the GC descriptor of the arguments buffer
 311     stind.i
 312     ldloc.0
 313     sizeof native int
 314     add
 315     stloc.0
 316
 317     ldloc.0
 318     ldftn Point CallTargetTest()
 319     stind.i
 320     ldloc.0
 321     sizeof native int
 322     add
 323     stloc.0
 324
 325     ldloc.0
 326     ldarg.1 // "thisObj"
 327     stobj object
 328     ldloc.0
 329     sizeof object
 330     add
 331     stloc.0
 332
 333     ldloc.0
 334     ldarg.2 // "genericContext"
 335     stind.i
 336     ldloc.0
 337     sizeof native int
 338     add
 339     stloc.0
 340
 341     ldloc.0
 342     ldarg.3 // "x"
 343     stind.i4
 344     ldloc.0
 345     sizeof native int
 346     add
 347     stloc.0
 348
 349     ldloc.0
 350     ldarg.4 // "t"
 351     stobj object
 352     ldloc.0
 353     sizeof object
 354     add
 355     stloc.0
 356
 357     ldloc.0
 358     ldarg.0 // "target"
 359     stind.i
 360
 361     ret
 362 }
 363
 364 ```
 365
 366 ```IL
 367 .method public static Point CallTargetTest() cil managed
 368 {
 369     .maxstack 4
 370     .locals init (
 371        [0] native int buffer,
 372        [1] valuetype TailCallHelperStackEntry entry,
 373        [2] Point result
 374     )
 375
 376     // Initialize the TailCallHelperStackEntry
 377     // chainCall = false
 378     // sp = current sp
 379
 380     ldloca.s 1
 381     ldc.i4.0
 382     stfld bool TailCallHelperStackEntry::chainCall
 383     ldloca.s 1
 384     call native int GetCurrentSp()
 385     stfld native int TailCallHelperStackEntry::sp
 386
 387     ldloca.s 1 // TailCallHelperStackEntry
 388     call void PushHelperStackEntry(native int)
 389
 390     // Prepare arguments for the tail call target
 391
 392     call native int FetchArgumentBuffer()
 393     sizeof native int
 394     add // skip the pointer to the GC descriptor of the arguments buffer
 395     sizeof native int
 396     add // skip the address of the CallTargetTest in the buffer, it is used by the TailCallHelper only
 397     stloc.0
 398
 399     ldloc.0
 400     ldobj object // this
 401     ldloc.0
 402     sizeof object
 403     add
 404     stloc.0
 405
 406     ldloc.0
 407     ldind.i // generic context
 408     ldloc.0
 409     sizeof native int
 410     add
 411     stloc.0
 412
 413     ldloc.0
 414     ldind.i4 // int x
 415     ldloc.0
 416     sizeof native int
 417     add
 418     stloc.0
 419
 420     ldloc.0
 421     ldobj object // T t
 422     ldloc.0
 423     sizeof object
 424     add
 425
 426     ldobj native int // tailcall target
 427
 428     // The arguments buffer is not needed anymore
 429     call void ReleaseArgumentBuffer()
 430
 431     .try
 432     {
 433         calli Point (object, native int, int32, object) // this, generic context, x, t
 434         stloc.2
 435         leave.s Done
 436     }
 437     finally
 438     {
 439         ldloca.s 1 // TailCallHelperStackEntry
 440         call void PopHelperStackEntry(native int)
 441         endfinally
 442     }
 443
 444 Done:
 445     ldloc.1 // TailCallHelperStackEntry
 446     ldfld bool TailCallHelperStackEntry::chainCall
 447     brfalse.s NotChained
 448
 449     // Jump to the TailCallHelper that will call to the next tail call in the chain.
 450     // The stack frame of the current CallTargetTest is reclaimed and epilog executed
 451     // before the TailCallHelper is entered.
 452     tail.call int32 TailCallHelper()
 453     ret
 454
 455 NotChained:
 456     // Now we are returning from a chain of tail calls to the caller of this chain
 457     ldloc.2
 458     ret
 459 }
 460 ```