1 # ARM64 JIT frame layout
3 NOTE: This document was written before the code was written, and hasn't been
4 verified to match existing code. It refers to some documents that might not be
7 This document describes the frame layout constraints and options for the ARM64
10 These frame layouts were taken from the "Windows ARM64 Exception Data"
11 specification, and expanded for use by the JIT.
13 We will generate chained frames in most case (where we save the frame pointer on
14 the stack, and point the frame pointer (x29) at the saved frame pointer),
15 including all non-leaf frames, to support ETW stack walks. This is recommended
16 by the "Windows ARM64 ABI" document. See `ETW_EBP_FRAMED` in the JIT code. (We
17 currently don’t set `ETW_EBP_FRAMED` for ARM64.)
19 For frames with alloca (dynamic stack allocation), we must use a frame pointer
20 that is fixed after the prolog (and before any alloca), so the stack pointer can
21 vary. The frame pointer will be used to access locals, parameters, etc., in the
22 fixed part of the frame.
24 For non-alloca frames, the stack pointer is set and not changed at the end of
25 the prolog. In this case, the stack pointer can be used for all frame member
26 access. If a frame pointer is also created, the frame pointer can optionally be
27 used to access frame members if it gives an encoding advantage.
29 We require a frame pointer for several cases: (1) functions with exception
30 handling establish a frame pointer so handler funclets can use the frame pointer
31 to access parent function locals, (2) for functions with P/Invoke, (3) for
32 certain GC encoding limitations or requirements, (4) for varargs functions, (5)
33 for Edit & Continue functions, (6) for debuggable code, and (7) for MinOpts.
34 This list might not be exhaustive.
36 On ARM64, the stack pointer must remain 16 byte aligned at all times.
38 The immediate offset addressing modes for various instructions have different
39 offset ranges. We want the frames to be designed to efficiently use the
40 available instruction encodings. Some important offset ranges for immediate
41 offset addressing include:
43 * ldrb /ldrsb / strb, unsigned offset: 0 to 4095
44 * ldrh /ldrsh / strh, unsigned offset: 0 to 8190, multiples of 2 (aligned halfwords)
45 * ldr / str (32-bit variant) / ldrsw, unsigned offset: 0 to 16380, multiple of 4 (aligned words)
46 * ldr / str (64-bit variant), unsigned offset: 0 to 32760, multiple of 8 (aligned doublewords)
47 * ldp / stp (32-bit variant), pre-indexed, post-indexed, and signed offset: -256 to 252, multiple of 4
48 * ldp / stp (64-bit variant), pre-indexed, post-indexed, and signed offset: -512 to 504, multiple of 8
49 * ldurb / ldursb / ldurh / ldursb / ldur (32-bit and 64-bit variants) / ldursw / sturb / sturh / stur (32-bit and 64-bit variants): -256 to 255
50 * ldr / ldrh / ldrb / ldrsw / ldrsh / ldrsb / str / strh / strb pre-indexed/post-indexed: -256 to 255 (unscaled)
51 * add / sub (immediate): 0 to 4095, or with 12 bit left shift: 4096 to 16777215 (multiples of 4096).
52 * Thus, to construct a frame larger than 4095 using `sub`, we could use one "small" sub, or one "large" / shifted sub followed by a single "small" / unshifted sub. The reverse applies for tearing down the frame.
53 * Note that we need to probe the stack for stack overflow when allocating large frames.
55 Most of the offset modes (that aren't pre-indexed or post-indexed) are unsigned.
56 Thus, we want the frame pointer, if it exists, to be at a lower address than the
57 objects on the frame (with the small caveat that we could use the limited
58 negative offset addressing capability of the `ldu*` / `stu*` unscaled modes).
59 The stack pointer will point to the first slot of the outgoing stack argument
60 area, if any, even for alloca functions (thus, the alloca operation needs to
61 "move" the outgoing stack argument space down), so filling the outgoing stack
62 argument space will always use SP.
64 For extremely large frames (e.g., frames larger than 32760, certainly, but
65 probably for any frame larger than 4095), we need to globally reserve and use an
66 additional register to construct an offset, and then use a register offset mode
67 (see `compRsvdRegCheck()`). It is unlikely we could accurately allocate a
68 register for this purpose at all points where it will be actually necessary.
70 In general, we want to put objects close to the stack or frame pointer, to take
71 advantage of the limited addressing offsets described above, especially if we
72 use the ldp/stp instructions. If we do end up using ldp/stp, we will want to
73 consider pointing the frame pointer somewhere in the middle of the locals (or
74 other objects) in the frame, to maximize the limited, but signed, offset range.
75 For example, saved callee-saved registers should be far from the frame/stack
76 pointer, since they are going to be saved once and loaded once, whereas
77 locals/temps are expected to be used more frequently.
79 For variadic (varargs) functions, and possibly for functions with incoming
80 struct register arguments, it is easier to put the arguments on the stack in the
81 prolog such that the entire argument list is contiguous in memory, including
82 both the register and stack arguments. On ARM32, we used the "prespill" concept,
83 where we used a register mask "push" instruction for the "prespilled" registers.
84 Note that on ARM32, structs could be split between incoming argument registers
85 and the stack. On ARM64, this is not true. A struct <=16 bytes is passed in one
86 or two consecutive registers, or entirely on the stack. Structs >16 bytes are
87 passed by reference (the caller allocates space for the struct in its frame,
88 copies the output struct value there, and passes a pointer to that space). On
89 ARM64, instead of prespill we can instead just allocate the appropriate stack
90 space, and use `str` or `stp` to save the incoming register arguments to the
93 To support GC "return address hijacking", we need to, for all functions, save
94 the return address to the stack in the prolog, and load it from the stack in the
95 epilog before returning. We must do this so the VM can change the return address
96 stored on the stack to cause the function to return to a special location to
99 Below are some sample frame layouts. In these examples, `#localsz` is the byte
100 size of the locals/temps area (everything except callee-saved registers and the
101 outgoing argument space, but including space to save FP and SP), `#outsz` is the
102 outgoing stack parameter size, and `#framesz` is the size of the entire stack
103 (meaning `#localsz` + `#outsz` + callee-saved register size, but not including
106 Note that in these frame layouts, the saved `<fp,lr>` pair is not contiguous
107 with the rest of the callee-saved registers. This is because for chained
108 functions, the frame pointer must point at the saved frame pointer. Also, if we
109 are to use the positive immediate offset addressing modes, we need the frame
110 pointer to be lowest on the stack. In addition, we want the callee-saved
111 registers to be "far away", especially for large frames where an immediate
112 offset addressing mode won’t be able to reach them, as we want locals to be
113 closer than the callee-saved registers.
115 To maintain 16 byte stack alignment, we may need to add alignment padding bytes.
116 Ideally we design the frame such that we only need at most 15 alignment bytes.
117 Since our frame objects are minimally 4 bytes (or maybe even 8 bytes?) in size,
118 we should only need maximally 12 (or 8?) alignment bytes. Note that every time
119 the stack pointer is changed, it needs to be by 16 bytes, so every time we
120 adjust the stack might require alignment. (Technically, it might be the case
121 that you can change the stack pointer by values not a multiple of 16, but you
122 certainly can’t load or store from non-16-byte-aligned SP values. Also, the
123 ARM64 unwind code `alloc_s` is 8 byte scaled, so it can only handle multiple of
124 8 byte changes to SP.) Note that ldp/stp can be given an 8-byte aligned address
125 when reading/writing 8-byte register pairs, even though the total data transfer
126 for the instruction is 16 bytes.
128 ## 1. chained, `#framesz <= 512`, `#outsz = 0`
131 stp fp,lr,[sp,-#framesz]! // pre-indexed, save <fp,lr> at bottom of frame
132 mov fp,sp // fp points to bottom of stack
133 stp r19,r20,[sp,#framesz - 96] // save INT pair
134 stp d8,d9,[sp,#framesz - 80] // save FP pair
135 stp r0,r1,[sp,#framesz - 64] // home params (optional)
136 stp r2,r3,[sp,#framesz - 48]
137 stp r4,r5,[sp,#framesz - 32]
138 stp r6,r7,[sp,#framesz - 16]
141 8 instructions (for this set of registers saves, used in most examples given
142 here). There is a single SP adjustment, that is folded into the `<fp,lr>`
143 register pair store. Works with alloca. Frame access is via SP or FP.
145 We will use this for most frames with no outgoing stack arguments (which is
146 likely to be the 99% case, since we have 8 integer register arguments and 8
147 floating-point register arguments).
149 Here is a similar example, but with an odd number of saved registers:
152 stp fp,lr,[sp,-#framesz]! // pre-indexed, save <fp,lr> at bottom of frame
153 mov fp,sp // fp points to bottom of stack
154 stp r19,r20,[sp,#framesz - 24] // save INT pair
155 str r21,[sp,#framesz - 8] // save INT reg
158 Note that the saved registers are "packed" against the "caller SP" value (that
159 is, they are at the "top" of the downward-growing stack). Any alignment is lower
160 than the callee-saved registers.
162 For leaf functions, we don't need to save the callee-save registers, so we will
163 have, for chained function (such as functions with alloca):
166 stp fp,lr,[sp,-#framesz]! // pre-indexed, save <fp,lr> at bottom of frame
167 mov fp,sp // fp points to bottom of stack
170 ## 2. chained, `#framesz - 16 <= 512`, `#outsz != 0`
174 stp fp,lr,[sp,#outsz] // pre-indexed, save <fp,lr>
175 add fp,sp,#outsz // fp points to bottom of local area
176 stp r19,r20,[sp,#framez - 96] // save INT pair
177 stp d8,d9,[sp,#framesz - 80] // save FP pair
178 stp r0,r1,[sp,#framesz - 64] // home params (optional)
179 stp r2,r3,[sp,#framesz - 48]
180 stp r4,r5,[sp,#framesz - 32]
181 stp r6,r7,[sp,#framesz - 16]
184 9 instructions. There is a single SP adjustment. It isn’t folded into the
185 `<fp,lr>` register pair store because the SP adjustment points the new SP at the
186 outgoing argument space, and the `<fp,lr>` pair needs to be stored above that.
187 Works with alloca. Frame access is via SP or FP.
189 We will use this for most non-leaf frames with outgoing argument stack space.
191 As for #1, if there is an odd number of callee-save registers, they can easily
192 be put adjacent to the caller SP (at the "top" of the stack), so any alignment
193 bytes will be in the locals area.
195 ## 3. chained, `(#framesz - #outsz) <= 512`, `#outsz != 0`.
197 Different from #2, as `#framesz` is too big. Might be useful for `#framesz >
198 512` but `(#framesz - #outsz) <= 512`.
201 stp fp,lr,[sp,-(#localsz + 96)]! // pre-indexed, save <fp,lr> above outgoing argument space
202 mov fp,sp // fp points to bottom of stack
203 stp r19,r20,[sp,#localsz + 80] // save INT pair
204 stp d8,d9,[sp,#localsz + 64] // save FP pair
205 stp r0,r1,[sp,#localsz + 48] // home params (optional)
206 stp r2,r3,[sp,#localsz + 32]
207 stp r4,r5,[sp,#localsz + 16]
208 stp r6,r7,[sp,#localsz]
212 9 instructions. There are 2 SP adjustments. Works with alloca. Frame access is
215 We will not use this.
217 ## 4. chained, `#localsz <= 512`
220 stp r19,r20,[sp,#-96]! // pre-indexed, save incoming 1st FP/INT pair
221 stp d8,d9,[sp,#16] // save incoming floating-point regs (optional)
222 stp r0,r1,[sp,#32] // home params (optional)
226 stp fp,lr,[sp,-#localsz]! // save <fp,lr> at bottom of local area
227 mov fp,sp // fp points to bottom of local area
228 sub sp,sp,#outsz // if #outsz != 0
231 9 instructions. There are 3 SP adjustments: to set SP for saving callee-saved
232 registers, for allocating the local space (and storing `<fp,lr>`), and for
233 allocating the outgoing argument space. Works with alloca. Frame access is via
236 We likely will not use this. Instead, we will use #2 or #5/#6.
238 ## 5. chained, `#localsz > 512`, `#outsz <= 512`.
240 Another case with an unlikely mix of sizes.
243 stp r19,r20,[sp,#-96]! // pre-indexed, save incoming 1st FP/INT pair
244 stp d8,d9,[sp,#16] // save in FP regs (optional)
245 stp r0,r1,[sp,#32] // home params (optional)
249 sub sp,sp,#localsz+#outsz // allocate remaining frame
250 stp fp,lr,[sp,#outsz] // save <fp,lr> at bottom of local area
251 add fp,sp,#outsz // fp points to the bottom of local area
254 9 instructions. There are 2 SP adjustments. Works with alloca. Frame access is
259 To handle an odd number of callee-saved registers with this layout, we would
260 need to insert alignment bytes higher in the stack. E.g.:
263 str r19,[sp,#-16]! // pre-indexed, save incoming 1st INT reg
264 sub sp,sp,#localsz + #outsz // allocate remaining frame
265 stp fp,lr,[sp,#outsz] // save <fp,lr> at bottom of local area
266 add fp,sp,#outsz // fp points to the bottom of local area
269 This is not ideal, since if `#localsz + #outsz` is not 16 byte aligned, it would
270 need to be padded, and we would end up with two different paddings that might
271 not be necessary. An alternative would be:
275 str r19,[sp,#8] // Save register at the top
276 sub sp,sp,#localsz + #outsz // allocate remaining frame. Note that there are 8 bytes of padding from the first "sub sp" that can be subtracted from "#localsz + #outsz" before padding them up to 16.
277 stp fp,lr,[sp,#outsz] // save <fp,lr> at bottom of local area
278 add fp,sp,#outsz // fp points to the bottom of local area
281 ## 6. chained, `#localsz > 512`, `#outsz > 512`
283 The most general case. It is a simple generalization of #5. `sub sp` (or a pair
284 of `sub sp` for really large sizes) is used for both sizes that might overflow
285 the pre-indexed addressing mode offset limit.
288 stp r19,r20,[sp,#-96]! // pre-indexed, save incoming 1st FP/INT pair
289 stp d8,d9,[sp,#16] // save in FP regs (optional)
290 stp r0,r1,[sp,#32] // home params (optional)
294 sub sp,sp,#localsz // allocate locals space
295 stp fp,lr,[sp] // save <fp,lr> at bottom of local area
296 mov fp,sp // fp points to the bottom of local area
297 sub sp,sp,#outsz // allocate outgoing argument space
300 10 instructions. There are 3 SP adjustments. Works with alloca. Frame access is
305 ## 7. chained, any size frame, but no alloca.
308 stp fp,lr,[sp,#-112]! // pre-indexed, save <fp,lr>
309 mov fp,sp // fp points to top of local area
310 stp r19,r20,[sp,#16] // save INT pair
311 stp d8,d9,[sp,#32] // save FP pair
312 stp r0,r1,[sp,#48] // home params (optional)
316 sub sp,sp,#framesz - 112 // allocate the remaining local area
319 9 instructions. There are 2 SP adjustments. The frame pointer FP points to the
320 top of the local area, which means this is not suitable for frames with alloca.
321 All frame access will be SP-relative. #1 and #2 are better for small frames, or
324 ## 8. Unchained. No alloca.
327 stp r19,r20,[sp,#-80]! // pre-indexed, save incoming 1st FP/INT pair
328 stp r21,r22,[sp,#16] // ...
329 stp r23,lr,[sp,#32] // save last Int reg and lr
330 stp d8,d9,[sp,#48] // save FP pair (optional)
331 stp d10,d11,[sp,#64] // ...
332 sub sp,sp,#framesz-80 // allocate the remaining local area
334 Or, with even number saved Int registers. Note that here we leave 8 bytes of
335 padding at the highest address in the frame. We might choose to use a different
336 format, to put the padding in the locals area, where it might be absorbed by the
339 stp r19,r20,[sp,-80]! // pre-indexed, save in 1st FP/INT reg-pair
340 stp r21,r22,[sp,16] // ...
341 str lr,[sp, 32] // save lr
342 stp d8,d9,[sp, 40] // save FP reg-pair (optional)
343 stp d10,d11,[sp,56] // ...
344 sub sp,#framesz - 80 // allocate the remaining local area
347 All locals are accessed based on SP. FP points to the previous frame.
349 For optimization purpose, FP can be put at any position in locals area to
350 provide a better coverage for "reg-pair" and pre-/post-indexed offset addressing
351 mode. Locals below frame pointers can be accessed based on SP.
353 ## 9. The minimal leaf frame
356 str lr,[sp,#-16]! // pre-indexed, save lr, align stack to 16
357 ... function body ...
358 ldr lr,[sp],#16 // epilog: reverse prolog, load return address
362 Note that in this case, there is 8 bytes of alignment above the save of LR.