2024-10-06 Pistonight (Michael)
This is a design document/blog for my implementation of spawn
in Rust
and WebAssembly with wasm-bindgen
. Please see the GitHub README
or the documentation on crates.io to see how to use this library,
or check out the examples for the examples.
The dream is to be able to use std::thread::spawn
in WebAssembly and things “just work”.
However, this is still far from working for the wasm32-unknown-unknown
target.
Meanwhile, the underlying features required to implement threads in the browser environment are stable enough
that I want to look into implementing this myself.
The backbone of the design is explained in “Multithreading Rust and Wasm”. Essentially:
postMessage
, we want to utilize the
WebAssembly threads proposal
to share memory between threads, using a shared WebAssembly.Memory
object, which is backed by a SharedArrayBuffer
.atomics
feature:
Since the API is to mimic std::thread::spawn
, let’s first look at that:
// spawn a thread, returning a std::thread::JoinHandle for it
let handle = std::thread::spawn(|| {
println!("Hello from a thread!");
return 42;
});
// wait for thread to finish
let result = handle.join().unwrap();
assert_eq!(result, 42);
To model this pattern with Web Workers, we need to:
postMessage
join
is callednotify
when the thread is done.The web standard does not allow the main thread to block.
When the above is implemented, we get TypeError
when trying to call join
.
While this is inconvienient, it is not a big problem. The web page’s main thread needs to handle the UI updates, so we probably shouldn’t block it anyway. If multithreading is needed in the WASM module, it makes sense to first initialize it in a Web Worker and use it with Remote Procedure Call (RPC) pattern from the main thread with async/await.
After fixing the main thread blocking issue, we quickly observe that
a deadlock is created when calling join
, and the worker is never started.
This is because in most browsers (tested in Chrome/Edge/Firefox), workers don’t start executing immediately after construction, but are queued up in the event loop. Therefore, we must wait until the worker starts executing the closure before we can start blocking.
This requires us to interface with the event loop with a Promise that resolves when the worker is ready, something like:
///// main thread
const promise = new Promise(resolve => {
const worker = new Worker('worker.js');
worker.onmessage = (e) => {
if (e.data === 1) {
resolve();
worker.postMessage(/*...*/)
}
};
});
promise.then(() => {
// start blocking
});
///// worker.js
importScript(/* wasm_bindgen output */);
self.onmessage = async (e) => {
const { /*...*/ } = e.data;
// initialize wasm module and shared memory
await wasm_bindgen(/*...*/);
// calling into wasm to execute the closure
await wasm_bindgen.__worker_main(/*...*/);
};
self.postMessage(1);
As it turns out, it’s not just the Worker
constructor that queues up
the execution in the event loop. postMessage
also doesn’t make the other side
receive the message immediately. Essentially, we run into the dilemma:
The solution above to problem 2 has 2 major problems that I don’t like:
spawn
and join
be async, which propagates and
makes everything async in the WASM module. This requires interop
with JavaScript’s Promise (for example, using wasm-bindgen-futures
),
and makes the API more cumbersome and doesn’t feel like std::thread
.Worker
constructor and postMessage
works in the browser
defeats multithreading entirely. If everything is properly synchronized, the
threads can only run one at a time.When I realized this, I stopped and went back to the drawing board to rethink the designl
And the solution? - Don’t use postMessage
!
When the worker is created, we have to use postMessage
to initiate the communication.
But once the WASM module is initialized, we can start using shared memory to communicate
the rest to the worker, which does not have the same restrictions with regards to the event loop.
So, I came up with the Dispatcher. It is a dedicated Web Worker that is just used to spawn threads. A one-time cost is paid to create the dispatcher and wait for it to be ready using the event loop.
Once the dispatcher is ready, the spawn
and join
flow is as follows:
spawn
with a closure and can immediately blockrecv
and block againJoinHandle
in the spawning thread to unblock it.This is the final design that I went with.
Because each new thread (i.e. Worker) requires initializing the WASM module and
asynchronous communication via postMessage
, it is VERY slow to spawn a new thread.
In my testing, it could take hundreds of milliseconds.
However, after the threads are up and running, sending messages between them is very fast using
channels. This is because we no longer rely on postMessage
. The speed is dependent on
how the Web Workers are scheduled by the browser/runtime.
Fortunately, the same is true for threads on any platform and a solution already exist
It is worth noting that Firefox limits the number of workers per domain to 20 by default, which could be lower than the the number of cores. The Dispatcher design allows extra workers to be queued up and started when the previous worker is done. However, if the limit is reached and all workers are blocked by something needed in an extra worker, a dead lock will happen.
Firefox also appears to report navigator.hardwareConcurrency
as the number of physical cores,
whereas Chrome reports it as number of logical cores on CPUs with SMT/Hyperthreading.
Unwinding is Rust’s mechanism for recovering from panics. It’s not supported for wasm32-unknown-unknown
target, so the panic behavior is abort
. This means any panic will leave the WASM module in an inconsistent state and it should not be used again.
The implementation puts one thread per worker, and panicking/aborting from a thread will also terminate that worker. So it’s safe to panic.
However, because there’s no unwinding, mutex guards will not poison when the thread panics. Instead, the guard is not dropped and any subsequent access to the mutex will dead lock.
This can be improved when the exception handling proposal becomes stable and enabled by default in most browsers, from when we could catch the panic with unwinding and even send the panic payload back to the thread that called join
.
wasm-mt
projectwasm-bindgen-rayon
project, which helped me understanding some prerequisiteswasm-bindgen
] and [wasm-bindgen-futures
] that made this manageable