A closer look at the interpreter
source link: http://blog.erlang.org/a-closer-look-at-the-interpreter/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A closer look at the interpreter
In my previous post we had a look at BEAM, and now that we’re more familiar with it it’s time for us to look at the reference implementation: the interpreter.
The interpreter can be thought of as an endless loop that looks at the current instruction, executes it, and then moves on to the next one.
In normal builds our code is laid out as directly threaded code, where each instruction consists of the machine code address of its handler, followed by its arguments, which are in turn followed by the next instruction:
Instruction address
First argument
Second argument
... and so on.
Instruction address
First argument
Second argument
... and so on.
When an instruction finishes it reads the next instruction address and jumps to it, forever following the “thread” of instructions.
With very few exceptions, the instructions are “pure” in the sense that they always do the same thing with the same input and that they only affect BEAM, either through changing the control flow or writing a result to a register. This makes the instructions very easy to read and reason about in isolation, so for the sake of brevity we’ll only have a look at one of them:
is_nonempty_list(Label, Src) {
/* Check if our $Src is not a list. */
if (is_not_list($Src)) {
/* Invoke the $FAIL macro, jumping to our
* $Label. */
$FAIL($Label);
}
/* Execute the next instruction. */
}
The above is written in a domain-specific language (DSL) that is a superset of
C, where $
-prefixed macros are expanded and the rest is kept as is.
The beam_makeops
script takes this definition and generates code for
the parts that are cumbersome to write by hand, such as jumping to the next
instruction. It also hides argument handling which allows us to reduce the code
size by packing small arguments together behind the scenes, which we’ve
explored in an earlier post on the subject.
For performance reasons it also generates different variants based on the
argument types outlined in ops.tab
to reduce the amount of work we need to
do at runtime.
Let’s have a look at the generated code for this instruction, specifically the
variant for X
registers:
/* (This has been modified slightly for readability) */
case is_nonempty_list_fx:
{
Eterm arg_word, term;
/* Read the argument word from the instruction
* stream. */
arg_word = I[1];
/* Unpack the offset of our source register (upper
* 32 bits) and then read its contents.
*
* Note that we address the X registers directly;
* had this instruction not been specialized, we
* would first need to determine whether the
* argument was an X or a Y register. */
term = x_registers[arg_word >> 32];
/* is_not_list(term) */
if (term & (_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST)) {
/* Unpack our fail label (lower 32 bits) and add
* it to our current instruction pointer. This
* is an offset and may be negative for backward
* jumps. */
I += (Sint32)arg_word;
/* Jump to the instruction at the fail label using
* the "labels as values" GCC extension. */
goto *I;
}
/* Skip the current label address and argument word,
* then jump to the next instruction. This was
* automatically generated by beam_makeops. */
I += 2;
goto *I;
}
There’s a bit more to it, and those who would like to know more can read the
documentation for the beam_makeops
script, but the above covers the gist of
it.
While the above is quite efficient, there’s significant overhead from dispatch and argument handling. Here’s a slightly altered assembly listing of the above:
; Read the argument word from the instruction ; stream. mov rdx, [rbx + 8] ; Unpack the offset of our source register (upper 32 ; 32 bits). mov rcx, rdx shr rcx, 32 ; X registers live in machine register r15, and we ; placed our offset in rcx, so we can find our term at ; [r15 + rcx]. ; ; Perform a non-destructive bitwise AND on the term ; using the `test` instruction, and jump to the fail ; label if the result is non-zero. test byte [r15 + rcx], (_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST) jne jump_to_fail_label ; Skip the current label address and argument word, ; then jump to the next instruction. add rbx, 16 jmp [rbx] jump_to_fail_label: ; Unpack our fail label (lower 32 bits) and add ; it to our current instruction pointer. movsxd rdx, edx lea rbx, [rbx + rdx * 8] ; Jump to the instruction at the fail label. jmp [rbx]
The bold section is the meat of the instruction and the rest is argument
unpacking and instruction dispatch. While this is not much of a problem for
large instructions, its effect on short ones like this is very large, and when
looking through a profiler (e.g. perf
) it’s not unusual for the final jmp
to dominate the rest.
To lessen this effect, the loader combines commonly used instruction sequences
into a single instruction. For example, two independent moves may fuse into
move2_par
, and is_nonempty_list
followed by get_list
might be fused to
is_nonempty_list_get_list
.
This reduces the cost of short instructions, but only works to the extent we’re able to identify common patterns and comes at a significant maintenance cost as each combination must be implemented manually. Even so, the effect tends to be moderate and the dispatch overhead remains significant.
Another, albeit lesser, downside with the interpreter is that modern processors are very optimized for patterns commonly found in “ordinary” native code. For example, nearly all of them have a special branch predictor just for calls and returns. Assuming that every call has a corresponding return lets it predict returns perfectly unless an exception is thrown, but since the interpreter does not use native calls and returns, it cannot make use of this optimization.
Unfortunately there’s not a whole lot that can be done about this, and after over two decades of refinement it’s becoming increasingly difficult to optimize it in meaningful ways.
Because of that our quest to improve performance has instead focused on two areas: improving the compiler and implementing a JIT. We’ve made great strides with both as of late, and are very proud to have finally merged the latter.
Stay tuned for our next post, where we’ll have a look at the JIT and see how it avoids these issues.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK