From 6112b5c6517449736d5e375df06a0883f1047741 Mon Sep 17 00:00:00 2001 From: Bruce Forstall Date: Wed, 22 Mar 2017 16:36:53 -0700 Subject: [PATCH] Jump stubs (dotnet/coreclr#10398) Add document on CLR Jump Stubs Commit migrated from https://github.com/dotnet/coreclr/commit/56d550d4f8aec2dd40b72a182205d0a2463a1bc9 --- docs/coreclr/design-docs/jump-stubs.md | 518 +++++++++++++++++++++++++++++++++ 1 file changed, 518 insertions(+) create mode 100644 docs/coreclr/design-docs/jump-stubs.md diff --git a/docs/coreclr/design-docs/jump-stubs.md b/docs/coreclr/design-docs/jump-stubs.md new file mode 100644 index 0000000..86bf0ac --- /dev/null +++ b/docs/coreclr/design-docs/jump-stubs.md @@ -0,0 +1,518 @@ +# Jump Stubs + +## Overview + +On 64-bit platforms (AMD64 (x64) and ARM64), we have a 64-bit address +space. When the CLR formulates code and data addresses, it generally +uses short (<64 bit) relative addresses, and attempts to pack all code +and data relatively close together at runtime, to reduce code size. For +example, on x64, the JIT generates 32-bit relative call instruction +sequences, which can refer to a target address +/- 2GB from the source +address, and which are 5 bytes in size: 1 byte for opcode and 4 bytes +for a 32-bit IP-relative offset (called a rel32 offset). A call sequence +with a full 64-bit target address requires 12 bytes, and in addition +requires a register. Jumps have the same characteristics as calls: there +are rel32 jumps as well. + +In case the short relative address is insufficient to address the target +from the source address, we have two options: (1) for data, we must +generate full 64-bit sized addresses, (2) for code, we insert a "jump +stub", so the short relative call or jump targets a "jump stub" which +then jumps directly to the target using a full 64-bit address (and +trashes a register to load that address). Since calls are so common, and +the need for full 64-bit call sequences so rare, using this design +drastically improves code size. The need for jump stubs only arises when +jumps of greater than 2GB range (on x64; 128MB on arm64) are required. +This only happens when the amount of code in a process is very large, +such that all the related code can't be packed tightly together, or the +address space is otherwise tightly packed in the range where code is +normally allocated, once again preventing from packing code together. + +An important issue arises, though: these jump stubs themselves must be +allocated within short relative range of the small call or jump +instruction. If that doesn't occur, we encounter a fatal error +condition, if we have no way for the already generated instruction to +reach its intended target. + +ARM64 has a similar issue: it has a 28-bit relative branch that is the +preferred branch instruction. The JIT always generates this instruction, +and requires the VM to generate jump stubs if required. However, the VM +does not use this form in any of its stubs; it always uses large form +branches. The remainder of this document will only describe the AMD64 +case. + +This document will describe the design and implementation of jump stubs, +their various users, the design of their allocation, and how we can +address the problem of failure to allocate required jump stubs (which in +this document I call "mitigation"), for each case. + +## Jump stub creation and management + +A jump stub looks like this: +``` +mov rax, <8-byte address> +jmp rax +``` + +It is 12 bytes in size. Note that it trashes the RAX register. Since it +is normally used to interpose on a call instruction, and RAX is a +callee-trashed (volatile) register for amd64 (for both Windows and Linux +/ System V ABI), this is not a problem. For calls with custom calling +conventions, like profiler hooks, the VM is careful not to use jump +stubs that might interfere with those conventions. + +Jump stub creation goes through the function `rel32UsingJumpStub()`. It +takes the rel32 data address, the target address, and computes the +offset from the source to the target address, and returns this offset. +Note that the source, or "base", address is the address of the rel32 +data plus 4 bytes, which it assumes due to the rules of the x86/x64 +instruction set which state that the "base" address for computing a +branch offset is the instruction pointer value, or address, of the +following instruction, which is the rel32 address plus 4. + +If the offset doesn't fit, it computes the allowed address range (e.g., +[low ... high]) where a jump stub must be located to create a legal +rel32 offset, and calls `ExecutionManager::jumpStub()` to create or find +an appropriate jump stub. + +Jump stubs are allocated in the loader heap associated with a particular +use: either the `LoaderCodeHeap` for normal code, or the `HostCodeHeap` +for DynamicMethod / LCG functions. Dynamic methods cannot share jump +stubs, to support unloading individual methods and reclaiming their +memory. For normal code, jump stubs are reused. In fact, we maintain a +hash table mapping from jump stub target to the jump stub itself, and +look up in this table to find a jump stub to reuse. + +In case there is no space left for a jump stub in any existing code heap +in the correct range, a new code heap is attempted to be created in the +range required by the new jump stub, using the function +`ClrVirtualAllocWithinRange()`. This function walks the acceptable address +space range, using OS virtual memory query/allocation APIs, to find and +allocate a new block of memory in the acceptable range. If this function +can't find and allocate space in the required range, we have, on AMD64, +one more fallback: if an emergency jump stub reserve was created using +the `COMPlus_NGenReserveForjumpStubs` configuration (see below), we +attempt to find an appropriate, in range, allocation from that emergency +pool. If all attempts fail to create an allocation in the appropriate +range, we encounter a fatal error (and tear down the process), with a +distinguished "out of memory within range" message (using the +`ThrowOutOfMemoryWithinRange()` function). + +## Jump stub allocation failure mitigation + +Several strategies have already been created to attempt to lessen the +occurrence of jump stub allocation failure. The following CLR +configuration variables are relevant (these can be set in the registry +as well as the environment, as usual): + +* `COMPlus_CodeHeapReserveForJumpStubs`. This value specifies a percentage +of every code heap to reserve for jump stubs. When a non-jump stub +allocation in the code heap would eat into the reserved percentage, a +new code heap is allocated instead, leaving some buffer in the existing +code heap. The default value is 2. +* `COMPlus_NGenReserveForjumpStubs`. This value, when non-zero, creates an +"emergency jump stub reserve". For each NGEN image loaded, an emergency +jump stub reserve space is calculated by multiplying this number, as a +percentage, against the loaded native image size. This amount of space +is allocated, within rel32 range of the NGEN image. An allocation +granularity for these emergency code heaps exceeds the specific +requirement, but multiple NGEN images can share the same jump stub +emergency space heap if it is in range. If an emergency jump stub space +can't be allocated, the failure is ignored (hopefully in this case any +required jump stub will be able to be allocated somewhere else). When +looking to allocate jump stubs, the normal mechanisms for finding jump +stub space are followed, and only if they fail to find appropriate space +are the emergency jump stub reserve heaps tried. The default value is +zero. +* `COMPlus_BreakOnOutOfMemoryWithinRange`. When set to 1, this breaks into +the debugger when the specific jump stub allocation failure condition +occurs. + +The `COMPlus_NGenReserveForjumpStubs` mitigation is described publicly +here: +https://support.microsoft.com/en-us/help/3152158/out-of-memory-exception-in-a-managed-application-that-s-running-on-the-64-bit-.net-framework. +(It also mentions, in passing, `COMPlus_CodeHeapReserveForJumpStubs`, but +only to say not to use it.) + +## Jump stubs and the JIT + +As the JIT generates code on AMD64, it starts by generating all data and +code addresses as rel32 IP-relative offsets. At the end of code +generation, the JIT determines how much code will be generated, and +requests buffers from the VM to hold the generated artifacts: a buffer +for the "hot" code, a buffer for the "cold" code (only used in the case +of hot/cold splitting during NGEN), and a buffer for the read-only data +(see `ICorJitInfo::allocMem()`). The VM finds allocation space in either +existing code heaps, or in newly created code heaps, to satisfy this +request. It is only at this point that the actual addresses where the +generated code will live is known. Note that the JIT has finalized the +exact generated code sequences in the function before calling +`allocMem()`. Then, the JIT issues (or "emits") the generated instruction +bytes into the provided buffers, as well as telling the VM about +exception handling ranges, GC information, and debug information. +When the JIT emits an instruction that includes a rel32 offset (as well +as for other cases of global pointer references), it calls the VM +function `ICorJitInfo::recordRelocation()` to tell the VM the address of +the rel32 data and the intended target address of the rel32 offset. How +this is handled in the VM depends on whether we are JIT-compiling, or +compiling for NGEN. + +For JIT compilation, the function `CEEJitInfo::recordRelocation()` +determines the actual rel32 value to use, and fills in the rel32 data in +the generated code buffer. However, what if the offset doesn't fit in a +32-bit rel32 space? + +Up to this point, the VM has allowed the JIT to always generate rel32 +addresses. It is allowed by the JIT calling +`ICorJitInfo::getRelocTypeHint()`. If this function returns +`IMAGE_REL_BASED_REL32`, then the JIT generates a rel32 address. The first +time in the lifetime of the process when recordRelocation() fails to +compute an offset that fits in a rel32 space, the VM aborts the +compilation, and restarts it in a mode where +`ICorJitInfo::getRelocTypeHint()` never returns `IMAGE_REL_BASED_REL32`. +That is, the VM never allows the JIT to generate rel32 addresses. This +is "rel32 overflow" mode. However, this restriction only applies to data +addresses. The JIT will then load up full 64-bit data addresses in the +code (which are also subject to relocation), and use those. These 64-bit +data addresses are guaranteed to reach the entire address space. + +The JIT continues to generate rel32 addresses for call instructions. +After the process is in rel32 overflow mode, if the VM gets a +`ICorJitInfo::recordRelocation()` that overflows rel32 space, it assumes +the rel32 address is for a call instruction, and it attempts to build a +jump stub, and patch the rel32 with the offset to the generated jump +stub. + +Note that in rel32 overflow mode, most call instructions are likely to +still reach their intended target with a rel32 offset, so jump stubs are +not expected to be required in most cases. + +If this attempt to create a jump stub fails, then the generated code +cannot be used, and we hit a fatal error; we have no mechanism currently +to recover from this failure, or to prevent it. + +There are several problems with this system: +1. Because the VM doesn't know whether a `IMAGE_REL_BASED_REL32` +relocation is for data or for code, in the normal case (before "rel32 +overflow" mode), it assumes the worst, that it is for data. It's +possible that if all rel32 data accesses fit, and only code offsets +don't fit, and the VM could distinguish between code and data +references, that we could generate jump stubs for the too-large code +offsets, and never go into "rel32 overflow" mode that leads to +generating 64-bit data addresses. +2. We can't stress jump stub creation functionality for JIT-generated +code because the JIT generates `IMAGE_REL_BASED_REL32` relocations for +intra-function jumps and calls that it expects and, in fact, requires, +not be replaced with jump stubs, because it doesn't expect the register +used by jump stubs (RAX) to be trashed. +3. We don't have any mechanism to recover if a jump stub can't be +allocated. + +In the NGEN case, rel32 calls are guaranteed to always reach, as PE +image files are limited to 2GB in size, meaning a rel32 offset is +sufficient to reach from any location in the image to any other +location. In addition, all control transfers to locations outside the +image go through indirection stubs. These stubs themselves might require +jump stubs, as described later. + +### Failure mitigation + +There are several possible mitigations for JIT failure to allocate jump +stubs. +1. When we get into "rel32 overflow" mode, the JIT could always generate +large calls, and never generate rel32 offsets. This is obviously +somewhat expensive, as every external call, such as every call to a JIT +helper, would increase from 5 to 12 bytes. Since it would only occur +once you are in "rel32 overflow" mode, you already know that the process +is quite large, so this is perhaps justifiable, though also perhaps +could be optimized somewhat. This is very simple to implement. +2. Note that you get into "rel32 overflow" mode even for data addresses. +It would be useful to verify that the need for large data addresses +doesn't happen much more frequently than large code addresses. +3. An alternative is to have two separate overflow modes: "data rel32 +overflow" and "code rel32 overflow", as follows: + 1. "data rel32 overflow" is entered by not being able to generate a + rel32 offset for a data address. Restart the compile, and all subsequent + data addresses will be large. + 2. "code rel32 overflow" is entered by not being able to generate a + rel32 offset or jump stub for a code address. Restart the compile, and + all subsequent external call/jump sequences will be large. + These could be independent, which would require distinguishing code and + data rel32 to the VM (which might be useful for other reasons, such as + enabling better stress modes). Or, we could layer them: "data rel32 + overflow" would be the current "rel32 overflow" we have today, which we + must enter before attempting to generate a jump stub. If a jump stub + fails to be created, we fail and retry the compilation again, enter + "code rel32 overflow" mode, and all subsequent code (and data) addresses + would be large. We would need to add the ability to communicate this new + mode from the VM to the JIT, implement large call/jump generation in the + JIT, and implement another type of retry in the VM. +4. Another alternative: The JIT could determine the total number of +unique external call/jump targets from a function, and report that to +the VM. Jump stub space for exactly this number would be allocated, +perhaps along with the function itself (such as at the end), and only if +we are in a "rel32 overflow" mode. Any jump stub required would come +from this space (and identical targets would share the same jump stub; +note that sharing is optional). Since jump stubs would not be shared +between functions, this requires more space than the current jump stub +system but would be guaranteed to work and would only kick in when we +are already experiencing large system behavior. + +## Other jump stub creation paths + +The VM has several other locations that dynamically generate code or +patch previously generated code, not related to the JIT generating code. +These also must use the jump stub mechanism to possibly create jump +stubs for large distance jumps. The following sections describe these +cases. + +## ReJIT + +ReJIT is a CLR profiler feature, currently only implemented for x86 and +amd64, that allows a profiler to request a function be re-compiled with +different IL, given by the profiler, and have that newly compiled code +be used instead of the originally compiled IL. This happens within a +live process. A single function can be ReJIT compiled more than once, +and in fact, any number of times. The VM currently implements the +transfer of control to the ReJIT compiled function by replacing the +first five bytes of the generated code of the original function with a +"jmp rel32" to the newly generated code. Call this the "jump patch" +space. One fundamental requirement for this to work is that every +function (a) be at least 5 bytes long, and (b) the first 5 bytes of a +function (except the first, which is the address of the function itself) +can't be the target of any branch. (As an implementation detail, the JIT +currently pads the function prolog out to 5 bytes with NOP instructions, +if required, even if there is enough code following the prolog to +satisfy the 5-byte requirement if those non-prolog bytes are also not +branch targets.) + +If the newly ReJIT generated code is at an address that doesn't fit in a +rel32 in the "jmp rel32" patch, then a jump stub is created. + +The JIT only creates the required jump patch space if the +`CORJIT_FLG_PROF_REJIT_NOPS` flag is passed to the JIT. For dynamic +compilation, this flag is only passed if a profiler is attached and has +also requested ReJIT services. Note that currently, to enable ReJIT, the +profiler must be present from process launch, and must opt-in to enable +ReJIT at process launch, meaning that all JIT generated functions will +have the jump patch space under these conditions. There will never be a +mix of functions with and without jump patch space in the process if a +profiler has enabled ReJIT. A desirable future state from the profiler +perspective would be to support profiler attach-to-process and ReJIT +(with function swapping) at any time thereafter. This goal may or may +not be achieved via the jump stamp space design. + +All NGEN and Ready2Run images are currently built with the +`CORJIT_FLG_PROF_REJIT_NOPS` flag set, to always enable ReJIT using native +images. + +A single function can be ReJIT compiled many times. Only the last ReJIT +generated function can be active; the previous compilations consume +address space in the process, but are not collected until the AppDomain +unloads. Each ReJIT event must update the "jmp rel32" patch to point to +the new function, and thus each ReJIT event might require a new jump +stub. + +If a situation arises where a single function is ReJIT compiled many +times, and each time requires a new jump stub, it's possible that all +jump stub space near the original function can be consumed simply by the +"leaked" jump stubs created by all the ReJIT compilations for a single +function. The "leaked" ReJIT compiled functions (since they aren't +collected until AppDomain unload) also make it more likely that "close" +code heap address space gets filled up. + +### Failure mitigation + +A simple mitigation would be to increase the size of the required +function jump patch space from 5 to 12 bytes. This is a two line change +in the `CodeGen::genPrologPadForReJit()` function in the JIT. However, +this would increase the size of all NGEN and Ready2Run images. Note that +many managed code functions are very small, with very small prologs, so +this could significantly impact code size (the change could easily be +measured). For JIT-generated code, where the additional size would only +be added once a profiler has enabled ReJIT, it seems like the additional +code size would be easily justified. + +Note that a function has at most one active ReJIT companion function. +When that ReJIT function is no longer used (and thus never again used), +the associated jump stub is also "leaked", and never used again. We +could reserve space for a single jump stub for each function, to be used +by ReJIT, and then, if a jump stub is required for ReJIT, always use +that space. The JIT could pad the function end by 12 bytes when the +`CORJIT_FLG_PROF_REJIT_NOPS` flag is passed, and the ReJIT patching code +could use this reserved space any time it required a jump stub. This +would require 12 bytes extra bytes to be allocated for every function +generated when the `CORJIT_FLG_PROF_REJIT_NOPS` flag is passed. These 12 +bytes could also be allocated at the end of the code heap, in the +address space, but not in the normal working set. + +For NGEN and Ready2Run, this would require 12 bytes for every function +in the image. This is quite a bit more space than the suggested +mitigation of increasing prolog padding to 12 bytes but only if +necessary (meaning, only if they aren't already 12 bytes in size). +Alternatively, NGEN could allocate this space itself in the native +image, putting it in some distant jump stub data area or section that +would be guaranteed to be within range (due to the 2GB PE file size +limitation) but wouldn't consume physical memory unless needed. This +option would require more complex logic to allocate and find the +associated jump stub during ReJIT. This would be similar to the JIT +case, above, of reserving the jump stub in a distant portion of the code +heap. + +## NGEN + +NGEN images are built with several tables of code addresses that must be +patched when the NGEN image is loaded. + +### CLR Helpers + +During NGEN, the JIT generates either direct or indirect calls to CLR +helpers. Most are direct calls. When NGEN constructs the PE file, it +causes these all to branch to (or through, in the case of indirect +calls) the helper table. When a native image is loaded, it replaces the +helper number in the table with a 5-byte "jmp rel32" sequence. If the +rel32 doesn't fit, a jump stub is created. Note that each helper table +entry is allocated with 8 bytes (only 5 are needed for "jmp rel32", but +presumably 8 bytes are reserved to improve alignment.) + +The code for filling out the helper table is `Module::LoadHelperTable()`. + +#### Failure mitigation + +A simple fix is to change NGEN to reserve 12 bytes for each direct call +table entry, to accommodate the 12-byte jump stub sequence. A 5-byte +"jmp rel32" sequence could still be used, if it fits, but the full 12 +bytes would be used if necessary. + +There are fewer than 200 helpers, so a maximum additional overhead would +be about `200 * (12 - 8) = 800` bytes. That is by far a worst-case +scenario. Mscorlib.ni.dll itself has 72 entries in the helper table. +System.XML.ni.dll has 51 entries, which would lead to 288 and 204 bytes +of additional space, out of 34MB and 12MB total NI file size, +respectively. + +An alternative is to change all helper calls in NGEN to be indirect: +``` +call [rel32] +``` +where the [rel32] offset points to an 8-byte address stored in the +helper table. This method is already used by exactly one helper on +AMD64: `CORINFO_HELP_STOP_FOR_GC`, in particular because this helper +doesn't allow us to trash RAX, as required by jump stubs. +Similarly, Ready2Run images use: +``` +call [rel32] +``` +for "hot" helpers and: +``` +call [rel32] +``` +to a shared: +``` +jmp [rel32] +``` +for cold helpers. We could change NGEN to use the Ready2Run scheme. + +Alternatively, we might handle all NGEN jump stub issues by reserving a +section in the image for jump stubs that reserves virtual address space +but does not increase the size of the image (in C++ this is the ".bss" +section). The size of this section could be calculated precisely from +all the required possible jump stub contributions to the image. Then, +the jump stub code would allocate jump stubs from this space when +required for a NGEN image. + +### Cross-module inherited methods + +Per the comments on `VirtualMethodFixupWorker()`, in an NGEN image, +virtual slots inherited from cross-module dependencies point to jump +thunks. The jump thunk invokes code to ensure the method is loaded and +has a stable entry point, at which point the jump thunk is replaced by a +"jmp rel32" to that stable entrypoint. This is represented by +`CORCOMPILE_VIRTUAL_IMPORT_THUNK`. This can require a jump stub. + +Similarly, `CORCOMPILE_EXTERNAL_METHOD_THUNK` represents another kind of +jump thunk in the NGEN image that also can require a jump stub. + +#### Failure mitigation + +Both external method thunks could be changed to reserve 12 bytes instead +of just 5 for the jump thunk, to provide for space required for any +potential jump stub. + +## Precode + +Precodes are used as temporary entrypoints for functions that will be +JIT compiled. They are also used for temporary entrypoints in NGEN +images for methods that need to be restored (i.e., the method code has +external references that need to be loaded before the code runs). There +exists `StubPrecode`, `FixupPrecode`, `RemotingPrecode`, and +`ThisPtrRetBufPrecode`. Each of these generates a rel32 jump and/or call +that might require a jump stub. + +StubPrecode is the "normal" general case. FixupPrecode is the most +common, and has been heavily size optimized. Each FixupPrecode is 8 +bytes. Generated code calls the FixupPrecode address. Initially, the +precode invokes code to generate or fix up the method being called, and +then "fix up" the FixupPrecode itself to jump directly to the native +code. This final code will be a "jmp rel32", possibly via a jump stub. +DynamicMethod / LCG uses FixupPrecode. This code path has been found to +fail in large customer installations. + +### Failure mitigation + +An implementation has been made which changes the allocation of +FixupPrecode to pre-allocate space for jump stubs, but only in the case +of DynamicMethod. (See https://github.com/dotnet/coreclr/pull/9883). +Currently, FixupPrecode are allocated in "chunks", that share a +MethodDesc pointer. For LCG, each chunk will have an additional set of +bytes allocated, to reserve space for one jump stub per FixupPrecode in +the chunk. When the FixupPrecode is patched, for LCG methods it will use +the pre-allocated space if a jump stub is required. + +For the non-LCG, non-FixupPrecode cases, we need a different solution. +It would be easy to similarly allocate additional space for each type of +precode with the precode itself. This might prove expensive. An +alternative would be to ensure, by design, that somehow shared jump stub +space is available, perhaps by reserving it in a shared area when the +precode is allocated, and falling back to a mechanism where the precode +reserves its own jump stub space if shared jump stub space cannot be +allocated. + +A possibly better implementation would be to reserve, but not allocate, +jump stub space at the end of the code heap, similar to how +CodeHeapReserveForJumpStubs works, but instead the reserve amount should +be computed precisely. + +## Ready2Run + +There are several DynamicHelpers class methods, used by Ready2Run, which +may create jump stubs (not all do, but many do). The helpers are +allocated dynamically when the helper in question is needed. + +### Failure mitigation + +These helpers could easily be changed to allocate additional, reserved, +unshared space for a potential jump stub, and that space could be used +when creating the rel32 offset. + +## Compact entrypoints + +The compact entrypoints implementation might create jump stubs. However, +compact entrypoints are not enabled for AMD64 currently. + +## Stress modes + +Setting `COMPlus_ForceRelocs=1` forces jump stubs to be created in all +scenarios except for JIT generated code. As described previously, the +VM doesn't know when the JIT is reporting a rel32 data address or code +address, and in addition the JIT reports relocations for intra-function +jumps and calls for which it doesn't expect the register used by the +jump stub to be trashed, thus we don't force jump stubs to be created +for all JIT-reported jumps or calls. + +We should improve the communication between the JIT and VM such that we +can reliably force jump stub creation for every rel32 call or jump. In +addition, we should make sure to enable code to stress the creation of +jump stubs for every mitigation that is implemented whether that be +using the existing `COMPlus_ForceRelocs` configuration, or the creation of +a new configuration option. -- 2.7.4