Implementing WebAssembly in Rust for fun and education!
If you're looking for a industry-strength Wasm runtime, look at Wasmtime.
A semi-regularly updated dev log.
It's spooky season!
And what could be spookier than ALL THE TESTS PASSING?
Well, plenty of things. But we're at a really interesting turning point: we now have to pay down the debts we incurred getting to this point. Luckily, we can rely on Wasm-R3 to give us apples-to-apples pure-Wasm replays for benchmarking. (For more see this post.)
I've been doing this for a few days now (I'm late to updating the dev log, for shame.)
But here's where we stand today:
This is a stack trace taken from running the fib.wasm
replay. This takes
about 6 minutes to generate on my M1 Pro Max (2021). Of the 88.5% of the time
we spend in the call_funcidx
(ahem); of that time
we spend a little over half the time manipulating the stack. My current
tack is to dial that data structure in as much as I can, focusing on the
largest contributors to slowdown first. This has yielded some nice speedups
already.
However! There are a few big changes looming:
- My instruction IR is massively inefficient. Considering that in-place interpretation is possible, I think we're likely to see speedups from refining it.
- My dispatch is pretty inefficient. We're in one big
for
loop with amatch
. I'd like to experiment with tail call dispatch per this blog post and/or treating the bytecode as table of function pointers.- Being able to modify the table of function pointers would enable some pretty interesting capabilities if I'm understanding this talk from mraleph.
- My "store" implementation is straight-up wrong. I'm getting away with it for now, but as soon as I start implementing a Real public interface with WASI support, I'll need to have fixed it.
- SIMD support.
In the near-term I'm also thinking about further refining the value stack -- moving from
rust's linked list to a flat [u8; 0x10000]
with a Box<*mut Stack>
pointer at the lowest
position.
One last thing on my mind: I want to track test results and performance progress a la arewefastyet and Porffor's Test262 status displays. Having a display that lists Wasmtime, WAMR, Wasmer, Wazero, V8, et al's results alongside uuasm's would be motivational!
Well, writing a new stack was easier than expected. Integrating it, however...
Something I am noodling on: locals are kind of a copy-on-write view of a value stack (extending the existing stack with a number of non-parameter locals.) On write those values can be copied, but until that point they can be read off of the same stack as everything else.
Ok. The time has come to fix our representation of values, and that means
changing how we represent our stack. For reference: today we represent values
as enum Value
, with variants for F32
, I32
, V128
, et al, each of which
hold the corresponding "native" type: f32
, i32
, u128
, etc. Our "stack"
today is a vector of these values. This was a conscious decision on my part. I
wanted something "usefully wrong": an abstraction that could take uuasm
some
distance but perhaps not quite all the way to a complete Wasm implementation.
So why change this now?
Two of the five remaining failing test suites involve floating point values.
One of them involves function references. Both of these have one thing in
common: the representation inside of Value
is wrong.
In the case of floats, we have to do something called "NaN canonicalization" when floating point operations interact with the stack. This is much easier to do if we're storing floating point values as unsigned integers of corresponding size on the stack. It's a lot easier to trap on invalid operations before native machinery takes over, too. So that's one point.
For function references, not only do we have to store the function index that's
referenced, but the originating module index of that function. The elem
suite
includes a particular test. The test checks that a child module can be linked
to a parent module, and that the child module's active elem
can store
values into a table imported from the parent. When the test invokes exports on
the parent, it expects those functions to dispatch into the child module. A
reasonable test! I'm handwaving a bit about the solution, but suffice it to
say, we're not storing a Value::RefFunc(FuncIdx(u32))
on the stack. (It might
be something more like a u64
pointer into a global function table or a pair
of pointers against module and function tables, but I digress.)
Finally, for completeness' sake, it bears mentioning that the current stack representation is about as inefficient as you can get. Since values are a typed enum, we're paying the cost of the largest enum variant for every value on the stack plus the discriminant. Rust reports that each value costs 32 bytes (!!) which is not very cache-friendly.
So, what do we do? Pack 'em up, pack 'em in.
Let me begin: instead of a vector of enums, we represent the stack as a byte buffer aligned to the size of the largest type (v128). From validation, we know the maximum stack size and stack height of a given block, so we can expand our byte buffer as we enter blocks. Pushing values becomes a matter of bringing the stack pointer into alignment with the new type, then writing the value into the byte buffer at that location. However, when popping, we have to have enough information about the last items on the stack that we can shift the pointer back past any padding we added for alignment.
There are 7 alignment changes possible:
u32
→u64
u32
→u128
u64
→u128
u128
→u64
u128
→u32
u64
→u32
- (no change)
We can encode this in 3 bits. We have two bits for alignment transitions stepping from a smaller type to a larger type, plus one bit to indicate "reversal." The zero value indicates "no padding adjustment." Whenever we push a value, we also push this alignment change type to a separate stack. When we pop a value, we reverse the alignment change; based on the alignment change and the pointer value, we reclaim a number of alignment bytes.
Labor day weekend!
Loose focus:
- Finish fixing table tests.
- Make a separate "typechecker" IR generator that passes through to an internal IR generator
- Maybe fix the parser? I'm so tempted.
"They say time is the fire in which we burn."
... which sums up how I'm feeling about getting the if
opcode tests
to 100% passing!
So. If you're keeping score at home, I've been working on implementing type
checking in the ir
sub-crate for a week or two now. I could have saved myself
some time at the outset: the WebAssembly spec has a useful appendix entry which
describes how to implement the type checker. I spent about a day
or two at the outset trying to build my own checker from first principles before
stumbling on this appendix entry. The specification, unlike human anatomy, has
a pretty useful appendix.
In the process of implemeting type checks, I learned something new about
constant expressions: I knew that global.get
was a valid "constant"
instruction, but I did not know that there were additional requirements on
the referent: in particular the target must be an immutable import. Neat.
(Module validation contains a lot of these little pearls of wisdom baked into
the specification. I feel like "This is actually my second rodeo" might be a
fitting motto for the specification.)
So what does the work look like right now? Well, a lot of wasm-tools dump
,
wasm-tools print
, and debug loglines. Since I'm spending a lot of time with
the parser and IR generator, I am (of course) starting to feel like the
abstractions aren't quite right. The IR generator's type checker bleeds into
the default IR generation, the default IR is clunky, and the parser both
communicates with an IR generator (good!) and constructs final parsed
productions (bad!) However. Getting the tests to 100% is the first step, and
everything else is secondary.
In other news: in true crustacean fashion, Rust is getting underfoot. I'm feeling the pinch.
As I mentioned, I have a default IR generator, called with a &mut self
reference. It owns definitions of the types, locals, globals, and tables. These
are built iteratively so several of them are behind Option<Vec<[T]>>
-style
references. (Sometimes they're Box<[T]>
.) Anyway, the type checker needs
access to this information when handling new instructions. As a result, I had a
lot of self.<foo>_types.as_ref().map(|xs| xs as &[_]).unwrap_or_default()
chains inlined as parameters to type_checker.trace(instr)
. I thought I'd add
a method to get a &[T]
record for these hairy types -- a little helper.
But you have to pay the crab tax for this. The borrow checker does a bunch of
work to make mutably borrowing one self
attribute while immutably borrowing
adjacent self
attributes work -- when inside of a single function. However,
they stop at method call boundaries, so helper functions that return references
make it impossible to also use self
mutably. There are ways
around this but they're pretty heavyweight right now. (See more
recently on this.)
Kind of a long form update -- apologies -- but we're at 64 passing tests to 29 failing tests. The tide is turning! Soon we'll break all of the tests again by changing the value stack!
We're still knee-deep in module validation. In particular, type-checking instructions. In exacting detail: we have to make the type checker work with instruction sequences stored outside of functions.
Where might instruction sequences be stored if not in functions? A number of
places, as it turns out! Wasm uses sequences of instructions to represent
element values (arrays of function pointers), data and element offsets, and
global values. To support these constructions, the IR generator needs hooks
that bookend these expressions --
start_element_value_expr
/end_element_value_expr
, etc.
This is a long, long yak shave: I started this process more than a month ago. If I knew then what I know now: I'd try to drive more of the parser and validator generation from a table data structure. Something stored in a non-executable format (YAML? Ron? CSV? JSON?) Something that could be processed by a series of macros.
It'd maybe store:
- An instruction "page" and "subcode" -- where subcode is 0 for all single-byte instructions.
- Mnemonics for the instruction:
i32.add
, for example. Maybe add a column for acceptable aliases? - A sequence of instruction args, to be grouped up into
nullary
,unary
,binary
, etc. - A little mini-language for pre-condition / post-condition types (
i32 -> i32
,(i32, i32, i32) -> ()
, etc.) - Maybe inlining some implementation?
i32.add
might include something like%0 = %1 + %2
? Or some way to include an implementation defined elsewhere in code? (I'm greenspunning something between yacc and llvm ir, here.)
Anyway, I'm going to stop daydreaming about adding phi values before I talk myself into rewriting this into a pile of SSA-generating macros-- as motivating as I find that. There's a balance that needs to be struck between looking up at where we're going and down at one's feet and --for now, at least-- it's time to continue plodding forward.
I'm backfilling the dev log a bit here: let's talk a bit about this project and its goals.
I have three concrete, personal goals in starting this project:
- Create a webassembly project that doesn't start with "w" (hence the double-"u".)
- Learn more about WebAssembly by building a runtime soup-to-nuts: a parser, interpreter, formatters, etc.
- Learn more about how to structure interpreters in Rust.
My loosely-held hope is to develop this into a runtime that's focused on developer debugging. But that's a ways off.
I have been focused on passing the WebAssembly test suite. The current thrust is to implement WebAssembly validation so that I can restructure how the runtime models WebAssembly values.
I made a simplifying assumption early on that I've found is no longer valid --
representing values naively using a Rust enum (Value::F32(f32)
.) My thinking
is that implementing validation will allow me to make assertions about the
types on the interpreter stack so that I can represent them as variable-length
bytes. Flattening the representation this way will help us pass floating point
NaN canonicalization tests (and some of the hairier ref.func
tests.)
I work on this project for about half an hour to an hour most nights; progress is steady but by no means rapid.
Anyway, since I'm going to try to start working on this in the open, it's time to do some spring cleaning:
- I should delete the old
src/
directory. I recently split up the parser, IR, and runtime sections into crates. - Rename the
nodes
crate toir
, because A) it's short and B) it's more descriptive - Move all of the
.wat
and.wasm
that's committed to the repo into acorpus
directory. - Consider moving the tests into their own crate
Still working on the items from yesterday (still! omg!)
Mid-breaking-up the crate into sub-crates there was a strange set of errors
that suggested that I had accidentally blown away my rust stdlib? but I think
that was illusory. Fixing the improper cross-crate impl <struct>
's fixed the
problem.
Other notes:
- I think "nodes" might become an "ir" crate.
- here's the dumb ir idea: instructions are 32-bit; 12-16 bits are instr ids, the remaining 20-18 bits
represent an index into a instruction arg table
- I wanna make room for this idea but not implement it yet because well, we still don't pass all tests yet, let alone having any benchmarks for the thing.
- here's the dumb ir idea: instructions are 32-bit; 12-16 bits are instr ids, the remaining 20-18 bits
represent an index into a instruction arg table
- the parser might keep state
- some of that state might be an InternMap (and we might make Name take a Cow/cow-like-struct)
- editor stuff: pressing tab or jumping paragraphs sometimes sends me on a wild goose chase and it's driving me up a wall. it seems lsp-related, and in particular I've read that snippets might cause the problem.
Up next:
- fix the Take
::advance function (it currently faults on a subtraction)
- finish type section
- break up the workspace into crates: nodes, parser, runtime
- break the parser up into files
- target sub-productions to make it easier to write tests
- look into injecting parser middleware
- ask ourselves: is there an easy way to support "read me a vec of this type"?
Apache 2.0