[wasm-ep] Implement DiagnosticServer and startup sessions for WebAssembly (#72482)
Add a diagnostic server for WebAssembly. Enable by building the runtime with `/p:WasmEnablePerfTracing=true` or `/p:WasmEnableThreads=true`.
To configure a project to start the diagnostic server, add this to the .csproj:
```xml
<WasmExtraConfig Include="diagnostic_options" Value='
{
"server": { "suspend": false, "connect_url": "ws://localhost:8088/diagnostics" }
}' />
```
The `connect_url` should be a WebSocket url serviced by `dotnet-dsrouter server-websocket` **from this branch** https://github.com/lambdageek/diagnostics/tree/wasm-server
Note that setting `"suspend": true` will hang the browser tab until a diagnostic tool such as `dotnet-trace collect` connects to the dsrouter.
---
Implement creating VFS file based sessions at runtime startup. Add the following to a .csproj:
```xml
<WasmExtraConfig Include="diagnostic_options" Value='
{
"sessions": [ { "collectRundownEvents": "true", "providers": "WasmHello::5:EventCounterIntervalSec=1" } ]
}' />
```
That will create and start one or more EventPipe sessions that will store their results into the VFS.
The startup session can be retrieved via `MONO.diagnostics.getStartupSessions()`. Each session `s` should be stopped via `s.stop()` and the data can then be extraced in a `Blob` using `s.getTraceBlob()`.
This is orthogonal to the diagnostic server support. You don't need `dotnet-dsrouter` running on the host. But you do need access to JavaScript on the main thread.
---
Notes/Issues:
* Tree shaking: I verified that if threads are not available, all the TypeScript diagnostics code is removed.
* Right now the server is not very robust to `dotnet-dsrouter` stopping, or starting after the runtime starts. The ideal order is to start `dotnet-dsrouter` first, and then open the browser
* Unrelated housekeeping fixes:
* Tell `wasm.proj` about all the subdirectories with .ts files - makes incremental builds notice changes in subdirectories.
* Add a rollup `dependencies` property to quiet a warning about `node/buffer`
* There's a mock implementation of a "websocket" that was used for protocol testing. I verified that tree-shaking removes this in thread-enabled Release builds.
* Bump `PTHREAD_POOL_SIZE` to `4` and set `PTHREAD_POOL_SIZE_STRICT=2` (returns `EAGAIN` from `pthread_create` if the pool needs to grow). The previous setting `PTHREAD_POOL_SIZE_STRING=1` (warn and try to grow the pool) seemed to lead to hangs. Usually that means the main thread is creating a thread and immediately calling `pthread_join` without returning to JS. We should investigate separately.
* The only implemented diagnostic server commands are `CollectTracing2`, `StopCollecting` and `ResumeRuntime`. None of the `Dump`, `Process` and `Profiler` commands are implemented and the server will crash if it receives them. It should be relatively straightforward to return a "command unsupported" reply (which would allow clients to gracefully disconnect), but it's not done yet.
* In some error states the runtime kills the browser tab with the following in a terminal window (if Chrome is started from a terminal: `FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory`). This probably means we're hitting a loop somewhere that rapidly exhausts JIT memory, but it's difficult to investigate when the JS console dies, too (happens with chrome stable v103 and chrome beta v104).
Fixes https://github.com/dotnet/runtime/issues/69674, contributes to https://github.com/dotnet/runtime/issues/72481
---
* [wasm] Enable the tracing component if threading is supported
* add a way to specify EP sessions in the MonoConfig
Currently not wired up to the runtime
* Add a mechanism to copy startup configs into the runtime and session IDs out
* checkpoint. Do more from JS
The issue is that once we're setting up streaming sessions, we will need to send back a DS IPC reply with the session id before we start streaming. So it's better to just call back to JS when we start up and setup all the EP sessions from JS so that when we return to C everything is all ready.
* checkpoint: starting a session at startup works
* checkpoint add a controller and a webworker for DS
* checkpoint: diagnostic server
* fix eslint
* [diagnostic_server] wasm-specific fn_table
We won't be using the native C version
* [wasm-ep] disable DS connect ports in C, too
we will implement DS in JS
* Start diagnostic server pthread
Clean up some of the old WIP code - we will probably not send configuration strings from the diagnostic server back to the main thread.
* checkpoint: try to start the server
It doesn't work right now because the MessagePort is not created until the server thread attaches to Mono, which doesn't happen because it's started before Mono.
Also it doesn't yet send a resume event, so the main thread just blocks forever
* Add a mock WebSocket connection to simulate the remote end
Start the diagnostic server and have it perform the open/advertise steps with the mock.
* wasm-mt: use a PThreadSelf struct instead of a raw MessagePort
* Move all the EP and diagnostic server modules to one directory
* Refactor; remove dead code; rationalize controller
the flow is now:
```
main -{creates pthread}-> server
. server creates event listener
. <-{sends diagnostic MessagePort}- .
main creates event listener .
. -{posts "start" message}-> .
. begins server loop
```
after the server loop is running, the main thread will get a "resume_startup" message once the diagnostic server receives the right command from the websocket.
next TODO: the runtime needs to send a "attach to runtime" message which will signal the server that it can attach to the runtime (in native) and start calling EP session creation functions.
* checkpoint: start adding queue from streaming thread to DS thread
We can't set up a shared MessagePort very easily (we need to bounce through the main thread but it probably won't be able to process our message until it's too late).
Also Atomics.waitAsync isn't available on many browsers (Chrome only).
So we use emscripten's dispatch mechanism to trigger an event in the diagnostic thread to wake up and service the streaming thread's queue. Right now the queue is dumb so we trigger on every write. and also the write is synchronous.
But it's simple to think about and it's implementable.
* [wasm] Incremental build and rollup warnings cleanups
- Add 'node/buffer' as an extrenal dependency. This doesn't do anything except quiet a rollup warning about the import.
- Add all the .ts files, and the tsconfig files (except node_modules) to the rollup inputs, to make sure we re-run rollup when anything changes.
* WIP: work on wiring up DS protocol commands (mock); resume hack
- start adding commands so that we can strt some sessions from DS
- we can't avoid a busy loop in ds_server_wasm_pause_for_diagnostics_monitor.
we can't make the main thread pause until we get a resume command
until after we're able to start an EP session (DS client won't send
a resume command until we send an EP session ID back). If the DS
pauses until it can attach to the runtime, and the runtime pauses
until DS tells it to resume, the main thread pause has to be after
we get EP and DS initialized. But that means it can't be async. So
we'll just have to busy wait on a condition variable in native.
* WIP: set up a WasmIpcStream, create EP sessions from DS
Seems to create the session, but not seeing write events
yet. possibly due to not flushing?
* WIP: starting to stream works; needs PTHREAD_POOL_SIZE bump
Looks like we can send the initial nettrace header some events.
We're starting more threads, so we need a bigger thread pool.
Also PTHREAD_POOL_SIZE_STRICT=1 (the default - warn if worker pool needs to grow,
but still try to grow it) seems to deadlock the browser-eventpipe
sample.
Set PTHREAD_POOL_SIZE_STRICT=2 (don't try to allocate a worker, make
pthread_create fail with EAGAIN) instead so we get some kind of
exception instead in other circumstances.
Set the pool size to 4.
* cleanup browser-eventpipe sample
* call mono_wasm_event_pipe_early_startup_callback from event_pipe init
instead of from the rundown_execution_checkpoint_2 function
* if diagnostics server isn't enabled, don't try to initialize it
* checkpoint: start parsing binary commands
* checkpoint: Can parse a CollectTracing2 command and attempt to create a
session!
* [wasm-ep] use the new PromiseController<T>
* get back to the server loop quicker by queueing the parsing in the microtask
* update mock for binary ADVR_V1 message
* sample: don't suspend, and use a mock url
* wasm_ipc_stream: wire up close command
Use a sentinal "buf" value (-1) to signal that the writer closed the stream
* Send proper OK messages in replies to binary protocol commands
* (testing) turn off the file session for now
* remove em_asm(console.log); simplify wasm EP init
Just call the EP JS callback directly from native
* remove debug output
* cleanup wasm ipc stream impl
* put diagnostics mocks behind a const flag
* don't build wasm-specific DS if threads are disabled
* refactor and cleanup
- Move the IPC parsing and serialization into separate files
- Try to have one responsibility per class
- update comments and docs
* help treeshaking
verified that all the DS and EP JS code is dropped if monoWasmThreads is false.
* update DS design notes
* use PromiseController in more places
* fix Windows build
* add MONO_WASM prefix to console logging outputs
* improve debug output for DS server
keep track of open/advertise counts and print them when receiving replies
* bugfix: don't confuse buf_addr for the value stored in it
the buf_addr is always the same for a given queue. the value in it is what we need to check to see if it's the sentinel value
* fix bug in queue_push_sync main thread detection
* merge fixup
* fix rollup warning when making the crypto worker
* add MONO_WASM: prefix to logging
* make diagnostic server mocking friendlier
Allow each test project to specify its own mock script.
Also provide TypeScript declarations for the mocking interfaces
Also always use binary protocol commands - don't send json for mocks.
* disable mocking in the sample project by default
* fixup after merge
* review feedback
- improve diagnostics mock README
- note that mocking just uses ES6 modules, testing with CJS is not supported right now.
- fix iteration over listeners when dispatching a one-shot event in the EventTargt polyfill
- use U32 getter in EP session creation