From 309ffc8dfa616d9469ff3dd76b6daa714792ba37 Mon Sep 17 00:00:00 2001
From: Katelyn Gadd <kg@luminance.org>
Date: Thu, 18 May 2023 15:20:58 -0700
Subject: [PATCH] [wasm] Update jiterpreter documentation (#86461)

* Add missing line breaks

* Update jiterpreter documentation with a sample annotated trace

* I love github flavored markdown
---
 docs/design/mono/jiterpreter.md | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/docs/design/mono/jiterpreter.md b/docs/design/mono/jiterpreter.md
index 32c4463d..b2eab2b 100644
--- a/docs/design/mono/jiterpreter.md
+++ b/docs/design/mono/jiterpreter.md
@@ -52,6 +52,7 @@ The first pass scans sequentially through the interpreter opcodes for a method,
 * A call out to a helper that implements the opcode, written in C
 * A call out to a libc function that implements the opcode
 * A "bailout" which returns control to the interpreter at the opcode's location in order to execute it
+
 During this first pass the compiler also records control flow information, keeping track of branches and branch targets in its "CFG" which will be used later to construct the WASM structures necessary for loops and other control flow.
 During compilation a running estimate is maintained of the trace's size, because web browsers impose an arbitrary 4KB limit on the total size of a synchronously compiled module, including the size of things like function names and type information. If we get too close to the 4KB limit, trace compilation will end at the current location. During this phase all of the generated Webassembly code is appended into a scratch buffer, with buffer offsets recorded in the CFG.
 
@@ -60,6 +61,7 @@ The second pass generates the final WebAssembly module including metadata like t
 * Blob segments, containing one or more WASM opcodes that execute sequentially. These can be copied directly into the result module
 * Branch block header segments, which represent a location targeted by forward or backward branches elsewhere in the code. We generate WebAssembly flow control structures at these locations based on the information we have about the entire trace.
 * Branch segments, which represent a conditional or unconditional branch that occurs after a blob. Conditional branch segments are surrounded by header and footer blobs, used to implement opcodes like conditional branches or null checks. These are translated into WebAssembly branch opcodes targeting a specific branch block header, and for backward branches we also set a dispatch index.
+
 For traces containing backward branches, each trace begins with a small "dispatch table" which performs a forward branch to a specific destination determined by a dispatch index. Upon trace entry the dispatch index points to the top of the trace, but when a backwards branch occurs we set a specific dispatch index and always jump to the dispatch table. This is necessary due to WebAssembly's heavily constrained flow control model that does not allow arbitrary jumps and encodes jumps based on nesting depths instead of as branches targeting specific code offsets.
 
 ### Opcode translation patterns
@@ -82,6 +84,32 @@ Transforming interpreter opcodes into WebAssembly opcodes comes with a few key p
 * All major browsers have tiering compilers for WebAssembly, so it is important to ensure that the code we generate will not cause significant performance issues in a given compiler's fast/naive tier(s) (for example, creating an enormous stack frame due to too many locals - this scenario caused stack overflows on iOS at one point.). We should also keep in mind that in some corner cases, our WebAssembly code may itself run in an interpreter.
 * Function pointers also have considerably higher overhead in WebAssembly (due to indirection and type checks), so we should take steps where possible to minimize the amount of indirection through vtables and function pointers, calling functions directly (as WebAssembly imports) where possible.
 
+### Generated trace sample
+
+To examine the generated WebAssembly for traces inside a specific method, you can add a substring of the method name to `instrumentedMethodNames` in `jiterpreter.ts`. At runtime when traces are generated, the full name of the containing method will be checked against all the substrings in the list and any matches will be generated in 'instrumented mode', where all the processed interpreter opcodes are recorded and dumped to the console along with the raw generated wasm bytes (you can paste those into a hex editor and save the resulting .wasm file, then run it through the analysis tool of your choice.) An example hand-annotated trace is below:
+
+| Interp Op | Generated Wasm | Notes |
+| ------------------ | -------------- | ----- |
+| | (module (memory $memory0 (import "m" "h") 1) | The module for each trace has to import the WebAssembly heap to access memory.|
+| | (table $table0 (import "f" "f") 1 funcref) | Each trace also needs to import the WebAssembly function pointer table in order to perform indirect calls (though these are rarely used) |
+| | (export "SequenceEqual:ae" (func $func0)) | We export traces with a short name to make it easier to identify them in the debugger. |
+| | (func $func0 (param $frame i32) (param $pLocals i32) (param $cinfo i32) (result i32) | `ptrdiff_t trace (void *frame, void *pLocals, JiterpreterCallInfo *cinfo);` |
+| | (local $var3 i32) (local $var4 i32) (local $cknull_ptr i32) (local $var6 i32) (local $var7 i32) (local $var8 i32) (local $var9 i64) (local $var10 i64) (local $var11 f32) (local $var12 f64) | Traces have a fixed set of locals used for storing temporary state like the back-branch flag or cached lhs/rhs for arithmetic operations. A critical one here is `cknull_ptr`, used for null checks. Names here are added for clarity, there are no names in the .wasm module due to the 4KB size limit. |
+| | block $label3 | A block starts here at the top of the trace in order to allow us to jump to a branch target later on. WebAssembly branches target a specific block (based on a numeric depth) and either jump to its top (for `loop`s) or its bottom (for all other block types). |
+| `ldind_off.i4 0, 104 -> 128` | &nbsp;block $label0<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load<br />&nbsp;&nbsp;local.tee $cknull_ptr<br />&nbsp;&nbsp;br_if $label0<br />&nbsp;&nbsp;i32.const 6<br />&nbsp;&nbsp;return<br />&nbsp;end $label0 | Most memory operations like `ldind_off` begin with a null check that loads the pointer from a local, then performs a bailout (the `i32.const; return` pair here) if the check fails. The resulting known-not-null is stored in `cknull_ptr` for later use. Note that in this case, the `i32.load` has no offset, because we're loading arg0.|
+| ... | &nbsp;local.get $pLocals<br />&nbsp;local.get $cknull_ptr<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=104<br />&nbsp;i32.add<br />&nbsp;i32.load align=1<br />&nbsp;i32.store offset=128 | Now that we have a known-non-null base pointer to perform an indirect load, we compute the address dynamically and then load from it, storing the result in the dreg (#128). This is the equivalent of `*(pLocals + 128) = *(*(pLocals + 0) + *(pLocals + 104))`. |
+| `ldind_off.i4 8, 104 -> 136` | &nbsp;block $label1<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load offset=8<br />&nbsp;&nbsp;local.tee $cknull_ptr<br />&nbsp;&nbsp;br_if $label1<br />&nbsp;&nbsp;i32.const 14<br />&nbsp;&nbsp;return<br />&nbsp;end $label1<br />&nbsp;local.get $pLocals<br />&nbsp;local.get $cknull_ptr<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=104<br />&nbsp;i32.add<br />&nbsp;i32.load align=1<br />&nbsp;i32.store offset=136 | another ldind_off, nothing new is happening here. note that the `ptrdiff_t` returned in this bailout is different - it's the offset of the opcode that failed. |
+| `bne.un.i4.s 128, 136` | &nbsp;block $label2<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load offset=128<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load offset=136<br />&nbsp;&nbsp;i32.ne<br />&nbsp;&nbsp;i32.eqz<br />&nbsp;&nbsp;br_if $label2<br />&nbsp;&nbsp;br $label3<br />&nbsp;end $label2 | The entire conditional branch is contained inside a block - we will branch to the end of the block if the conditional check fails, in order to skip the actual branch operation. The block begins by computing the branch condition (by loading two locals and performing an `ne` comparison), then if the condition value is 0 (`eqz`) we jump to the end of the block, which skips the actual branch. You can see the branch targets `label3`, which will skip the remainder of this basic block and go to our branch target. |
+| `add.i4.imm 104 -> 104` | &nbsp;local.get $pLocals<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=104<br />&nbsp;i32.const 4<br />&nbsp;i32.add<br />&nbsp;i32.store offset=104 | Note that we load pLocals here twice. This is because wasm store operations expect the destination pointer to precede the value, and wasm has no `dup` or `swap` instructions to allow us to rearrange the stack later. Since this is an `imm` arithmetic operation, we inline the immediate as an `i32.const`.
+| `bgt.un.i4.s 112, 104` | &nbsp;block $label4<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load offset=112<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load offset=104<br />&nbsp;&nbsp;i32.gt_u<br />&nbsp;&nbsp;i32.eqz<br />&nbsp;&nbsp;br_if $label4<br />&nbsp;&nbsp;i32.const 0<br />&nbsp;&nbsp;return<br />&nbsp;end $label4 | Here we have another conditional branch, but it targets a location outside of the trace. So instead of a branch forward or backward within the trace, we have a bailout with a displacement of 0. (If everything were working ideally, this would have generated a trace with a backward branch in it - something must have happened to prevent that.) The zero displacement means the interpreter will resume execution at the trace entry point, essentially restarting the trace for another loop iteration.
+| `ldind_off.i4 0, 112 -> 128` | &nbsp;block $label5<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load<br />&nbsp;&nbsp;local.tee $cknull_ptr<br />&nbsp;&nbsp;br_if $label5<br />&nbsp;&nbsp;i32.const 46<br />&nbsp;&nbsp;return<br />&nbsp;end $label5<br />&nbsp;local.get $pLocals<br />&nbsp;local.get $cknull_ptr<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=112<br />&nbsp;i32.add<br />&nbsp;i32.load align=1<br />&nbsp;i32.store offset=128 | We've all seen a `ldind_off` twice at this point. Note that while we're loading arg0 again, the null check is not eliminated - when we cross a branch the jiterpreter discards any optimization data to avoid introducing errors. |
+| `ldind_off.i4 8, 112 -> 136` | &nbsp;block $label6<br />&nbsp;&nbsp;local.get $pLocals<br />&nbsp;&nbsp;i32.load offset=8<br />&nbsp;&nbsp;local.tee $cknull_ptr<br />&nbsp;&nbsp;br_if $label6<br />&nbsp;&nbsp;i32.const 54<br />&nbsp;&nbsp;return<br />&nbsp;end $label6<br />&nbsp;local.get $pLocals<br />&nbsp;local.get $cknull_ptr<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=112<br />&nbsp;i32.add<br />&nbsp;i32.load align=1<br />&nbsp;i32.store offset=136 | Similarly, a human examining this trace might say 'we don't need to check local 8 for null, we already checked it', but since we crossed a branch afterward that information was discarded.
+| `ceq.i4 128, 136 -> 32` | &nbsp;local.get $pLocals<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=128<br />&nbsp;local.get $pLocals<br />&nbsp;i32.load offset=136<br />&nbsp;i32.eq<br />&nbsp;i32.store offset=32 | As we approach the end of the trace, we can now recognize the rough shape of what the trace was doing - the first two `ldind_off`s leading into a `bne_un_i4` were the equivalent of `if (pLhs[i] != pRhs[i]) break;`.  After that, the `add_i4_imm` increased the loop offset by 4, and the `bgt_un_i4` returned to the top of the loop body as long as the comparison size exceeded the loop offset. But what's going on here? Examining the source for SequenceEqual will reveal the truth. |
+| ... | ... | SequenceEqual's scalar search loop ends with `/* Do final compare as sizeof(nuint) from end rather than start */ result = (LoadNUInt(ref first, lengthToExamine) == LoadNUInt(ref second, lengthToExamine));`. We can now see that the previous two `ldind_off` operations are the two `LoadNUint` calls, and the `ceq_i4` is performing the comparison and storing its result into `result`. |
+| `br.s` | &nbsp;i32.const -34<br />&nbsp;return | This non-conditional branch travels to a point before the trace, so we return a negative displacement and bail out. |
+| `ret.i4.imm` |     end $label3<br />    i32.const 74<br />    return<br /> | We've reached the end of the trace, and we can see a few things: the `label3` block ends, which means this is the branch target for the earlier conditional branch - an early-out exit of the loop when the comparison inside the loop fails. Return opcodes are handled by the interpreter, so this `ret` becomes a bailout with a fixed displacement. |
+| |     i32.const 78<br />    return | All traces end with a final bailout indicating that execution ran off the end of the trace. In this case we can tell by looking that the final bailout is unreachable, but the jiterpreter currently doesn't perform that sort of analysis (it would only save a few bytes anyway).
+
 ## Interpreter entry thunks
 Interpreter entry points that meet some basic criteria (number of arguments, etc) are instrumented to notify the jiterpreter each time they are hit. After a certain number of hits, they are added to the "jit queue" and will be compiled asynchronously in small batches. If a specific entry point is hit an even larger number of times, the queue will immediately be flushed to compile it.
 
-- 
2.7.4