6

Microvium Status Update – Coder Mike

 2 years ago
source link: https://coder-mike.com/blog/2020/06/08/microvium-status-update/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Microvium Status Update
June 2020

Microvium Status Update
June 2020

I’ve made a lot of progress on Microvium over the past month and will share some of the details here for those who are interested. I think it’s really close to being ready for an alpha release, with the garbage collector being the only remaining piece of work before it can start being used.

Summary

In no particular order, here are some of the recent developments, which I’ll discuss in more detail below for those who are interested.

  • Dynamically-sized arrays!
  • Bytecode disassembly (now you can see what bytecode Microvium is producing!)
  • Code coverage analysis of the native engine implementation
  • Control over integer overflow behavior

Additionally, Microvium now has a lot more “flesh” than it used to — a wide variety of scripts now run and compile correctly which didn’t a month ago, making the engine usable for real-world scripts.

For readers interested in the details, read on! For everyone else, thanks for stopping by, and see you next time!

Dynamic Arrays

Microvium now has the much-needed support for dynamically-sized arrays.

Arrays are simple to create. For example, the following code creates an array of 3 numbers:

arr = [1, 2, 3];

As with full JavaScript, you can change the length of the array simply by assigning to indexes beyond the current upper bound of the array, for example arr[3] = 4 to extend the array to 4 elements. Nice and simple. (You can also assign to the length property of the array, or call arr.push)

Arrays are implemented internally using a structure like the following diagram, where each square is a 2-byte word (this example array consumes 20 bytes of memory to hold 4 numbers):

An array consumes two allocations on the heap: one which has the metadata describing the array, and another allocation for the array elements, with the former pointing to the latter.

Originally, I was planning to use a single allocation, but the issue is that when arrays grow, they need to be reallocated, which involves updating all the pointers to them. This is possible, and I was brainstorming some ways of doing so, but in the end the cleaner solution seemed to be the one above with two distinct allocations, and only one of them needing to grow.

To make the growth of arrays efficient, arrays have both a “capacity” and a “length”. The capacity of an array is the number of available slots in memory before the array will need to be reallocated. The capacity is invisible to the user. The length is the .length property of arrays and accessible to the user.

For array literals, the allocated capacity is exactly the length of the literal array. For example, the literal [1, 2, 3] has both a length and a capacity of 3. Thereafter, every time an array’s capacity needs to grow, it will be reallocated with twice the previous capacity. In our previous example, extending the array from 3 to 4 elements causes the capacity to go from 3 to 6, resulting in 2 words of padding (the 2 gray squares in the diagram). Until adding a 7th element (or higher index), the capacity would not need to grow again.

Advanced readers may observe that the memory structure here has no space for non-index properties. This is an intended design limitation of Microvium, that you can’t add arbitrary properties to arrays as you can in JavaScript. For example, you can’t say arr.p = someValue. I think this is totally acceptable for most applications.

I’m quite happy with this design. Initially, the 8-byte (4-word) overhead really plagued me, but I’ve come to terms with it, and certainly, this design is very clean.

Bytecode Disassembly

A really great feature that I added to Microvium recently is the ability to decode snapshot bytecode and, in particular, to generate disassembly for bytecode files. This is great because it allows us to really get a close look at the structure of a snapshot bytecode file.

Let’s say we have the following script:

vmExport(42, run);
function run() {
  console.log('Hello, World');
}

We can get a bytecode disassembly for this by running the following command in the terminal:

microvium microvium script.js --map-file bytecode.txt 

I called it a “map-file” because it creates a human-readable map of the bytecode. You can see the generated output here if you’re interested.

The main thing is that it allows us to see byte-for-byte what’s occupying a bytecode file. Remember that these bytecode files fully describe the virtual machine state at the time the snapshot was taken, including all the variables, functions, and heap state. This is one of reasons I developed this feature: so I can see inside the working state of a virtual machine, taking snapshots at various points to see a complete breakdown of memory usage.

I think this feature will also help to demystify snapshots for end users, since it makes it easy to see what’s inside them.

Code Coverage Analysis

One of the concerns that was plaguing the back of my mind was that I had done a lot of development work for the native engine but had little idea how much of it was tested and working correctly. For something as critical as a virtual machine, this was unacceptable.

I thought that something like code coverage would probably resolve this. I have a test suite with a bunch of self-testing Microvium scripts, and it would be great for some off-the-shelf tool to tell me how much of my C code was exercised by these tests.

Unfortunately, I couldn’t find any off-the-shelf solutions that were adequate while not being out of my price range. But the solution I came up with is even better, if I may say so myself!

It’s not the prettiest, but you get used to the syntax quite quickly. Here’s a sample from the C code that shows what I’m talking about. This is the code that handles the ObjectSet opcode, which is for setting a property on an object. Take a look at the CODE_COVERAGE macro calls:

Basically, I’ve generously sprinkled these CODE_COVERAGE markers all over the code. When running in production, these marker macros don’t expand to anything and so don’t consume resources. When running in the testing environment, these macros expand to function calls which record that each marker is hit at runtime.

But wait! That’s not all!

I have a script which parses the C code in search of these markers, and automatically fills in information about each:

  • It automatically fills in the unique identifier for each marker (in the selected line in the example, this is 124). I used identifiers instead of just relying on line numbers because they remain consistent during code movement. The script automatically cleans up duplicate or missing identifiers.
  • It automatically appends the // Hit or // Not hit comment, depending on whether the marker is hit or not during the tests (and you can see in the screen capture that I’ve tweaked my IDE to colorize these distinctly).

But even better than typical code coverage tests, we extend the idea to test for value cases as well. For example, there is a bytecode opcode for loading some common literal values, and there are 8 different possible literals that it can produce, such as null, undefined, true, false, etc. (the exact values are not relevant to the point I’m about to make). These options might have been implemented in a switch statement (handling the null case, the true case, etc), in which case a CODE_COVERAGE marker on each switch case would tell us whether the particular literal value has been tested. However, it was more efficient to implement this as a table of literal values, with a simple lookup. To get test coverage data for this kind of implementation, I created a different macro which I call TABLE_COVERAGE:

The details are not important. What’s relevant here is that the test runner has identified that there are 8 values in the lookup table, and that the code is hitting all 8 of them (see the highlighted portion of the image above). This is really valuable information, to know that there exists some code in the test suite which is exercising all these options, even though they don’t appear as distinct code paths.

Integer Overflow Control

Microvium is all about being micro — having a small footprint in terms of RAM and ROM. When it comes to squeezing the most out of the ROM footprint of the engine itself, the overflow behavior of numbers in C and JavaScript is frustrating, to say the least.

In JavaScript, all number operations are conceptually performed on 64-bit floating point numbers. In a C implementation of these semantics, we don’t need to store the full 64-bit float. The Microvium engine represents the number 1.0 with a 14-bit integer and the number value 2_000_000_000 (2 billion) with a 32-bit integer in memory. It’s not until you exceed the signed 32-bit range that Microvium needs to resort to the full floating-point representation in memory, and expectation is that this will almost never happen in practice.

However, the challenge is that most operations on numbers theoretically need to check for overflow in order to correctly do the promotion from int to float in the rare case it occurs, and this takes up precious program space as well as degrading performance.

To make this more concrete with an example, consider that the smallest signed 32-bit integer is -0x80000000. If we negate this using the operator -x, in JavaScript we should get the number 0x80000000, but this is no longer within the signed 32-bit number space, and so in C, we would consider this to be an overflow. The overflow in C is technically undefined behavior, but in practice, -0x80000000 will almost always give us exactly what we started with, which is -0x80000000.

Let me say that again: in C, -(-0x80000000) == -0x80000000, when dealing with signed 32-bit integers.

This is the only integer in C that can overflow an int32 during negation (-). To implement the correct JavaScript semantics in Microvium, we need to have a test for exactly this case.

To make things worse, C does not provide any standard way of checking for overflow. GCC provides some language extensions for such, but a cross-platform engine can’t assume they exist.

If this were the only overflow case, then a few extra checks and branches wouldn’t be worth worrying about. But the reality is that many or most number operations can overflow, and they sometimes do so in cases or ways that you wouldn’t necessarily expect or care about.

I’ll briefly give 3 more examples to drive home the point:

Negating Zero

I said above that 0x80000000 is the only integer that can overflow in negation in C. But in Microvium, this isn’t true. In particular, in Microvium, negative zero is not an integer, because it is distinct from positive zero. The behavior of Microvium in the case of negative zero is best illustrated by the corresponding test case:

  // Without overflow checks enabled, the negation of integer zero is integer zero
  assertEqual(8 / -(1-1), overflowChecks ? -Infinity : Infinity);

As you can see above, if overflow checks are enabled, then negating integer zero results in the floating point -0, which when used in the denominator above will give us the answer -Infinity.

But if overflow checks are disabled, then the result of negating zero is still zero, as most C programmers would probably expect.

The reason for the (1-1) gymnastics is because -0 is treated as a literal with value negative zero, whereas -(1-1) is treated as the negating operator acting on integer zero.

Side note: You might think that division by any kind of zero is an overflow, but in Microvium, the naked division operator / is always a floating-point operator. There is a separate integer division opcode in Microvium which is syntactically represented as x / y | 0. I might discuss this more at a later stage.

Integer Addition

We all know that integer addition can overflow, but I bring this up as an example of how hard it can be to test for integer overflow. After a lot of research, the fastest way I could find to check for overflow on integer addition seemed to be presented in a comment on this blog post. Basically, if a = b + c, then overflow can be tested as:

if ( (a ^ c) & (b ^ c) ) < 0) /* overflow */;

On a 16-bit machine, doing four 32-bit operations and a branch is not the cheapest, when all you wanted to do was add two numbers together.

To make matters worse, this particular solution is determining overflow retroactively, which is technically undefined behavior. But pragmatically, I expect this to work in an overwhelming number of real-world C environments, so I’m using it anyway.

Right Shift

In JavaScript, the left and right shift operators (<<, >> and >>>) always yield a result that fits into the signed 32-bit range, except for one case, which is zero-fill right shift by zero (>>> 0) of a negative number . Take a look at these examples, and see if you can spot the odd one out (hint: it’s the last one!):

// For positive numbers
1 | 0   // = 1          Bitwise OR
1 & 1   // = 1          Bitwise AND
1 << 0  // = 1          Bitwise left shift
1 >> 0  // = 1          Bitwise sign-propagating right shift
1 >>> 0 // = 1          Bitwise zero-fill right shift

// For negative numbers
-1 | 0   // = -1          Bitwise OR
-1 & -1  // = -1          Bitwise AND
-1 << 0  // = -1          Bitwise left shift
-1 >> 0  // = -1          Bitwise sign-propagating right shift
-1 >>> 0 // = 0xFFFFFFFF  Bitwise zero-fill right shift

Since all other cases yield correct results when interpreting the result as if cast to a signed 32-bit integer, Microvium treats the last case an overflow. If overflow checks are disabled, the result of the last case will be -1 instead of 0xFFFFFFFF.

I can understand the reasoning behind the defined behavior in JavaScript, since doing a zero-fill right shift then always consistently produces an unsigned result, no matter how much you’re shifting by. But it breaks my intuition about how bitwise operators work in JavaScript. Why does 0xFFFFFFFF | 0 not also produce 0xFFFFFFFF (it produces -1)? Why is it such that the result of most bitwise operators is interpreted as if the 32nd bit was a sign bit, even when the operator has no special understanding of signedness? And then why break that rule with the zero-fill right shift, which is equally just a bit manipulation operation like bitwise AND and OR?

In Microvium overflow behavior is configurable

My solution in Microvium, if you haven’t already figured it out from some of these examples, is to make overflow checks optional. It’s a compilation flag presented as a #define in the port file, which is accessible specifically to the unit tests for reflection so it can determine the expected result of various operations.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK