4

Why does an extraneous build step make my Zig app 10x faster?

 1 month ago
source link: https://mtlynch.io/zig-extraneous-build/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Why does an extraneous build step make my Zig app 10x faster?

March 19, 2024

11-minute read

For the past few months, I’ve been curious about two technologies: the Zig programming language and Ethereum cryptocurrency. To learn more about both, I’ve been using Zig to write a bytecode interpreter for the Ethereum Virtual Machine.

Zig is a great language for performance optimization, as it gives you fine-grained control over memory and control flow. To motivate myself, I’ve been benchmarking my Ethereum implementation against the official Go implementation.

At the beginning of this process, my hobby Ethereum Zig implementation underperformed the official Go implementation by about 40%.

Recently, I made what I thought was a simple refactoring to my benchmarking script, but my app’s performance tanked. I identified the relevant change as the difference between these two commands:

$ echo '60016000526001601ff3' | xxd -r -p | zig build run -Doptimize=ReleaseFast
execution time:  58.808µs
$ echo '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/eth-zvm
execution time:  438.059µs

zig build run is just a shortcut command for compiling a binary and executing it. It should be equivalent to the following two commands:

zig build
./zig-out/bin/eth-zvm

How could an additional build step cause my program to run almost 10x faster?

Creating a minimal reproduction of the phenomenon 🔗︎

To debug the performance mystery, I tried simplifying my app until it was no longer a bytecode interpreter and was just a program that counted the number of bytes it read from stdin:

// src/main.zig

const std = @import("std");

pub fn countBytes(reader: anytype) !u32 {
    var count: u32 = 0;
    while (true) {
        _ = reader.readByte() catch |err| switch (err) {
            error.EndOfStream => {
                return count;
            },
            else => {
                return err;
            },
        };
        count += 1;
    }
}

pub fn main() !void {
    var reader = std.io.getStdIn().reader();

    var timer = try std.time.Timer.start();
    const start = timer.lap();
    const count = try countBytes(&reader);
    const end = timer.read();
    const elapsed_micros = @as(f64, @floatFromInt(end - start)) / std.time.ns_per_us;

    const output = std.io.getStdOut().writer();
    try output.print("bytes:           {}\n", .{count});
    try output.print("execution time:  {d:.3}µs\n", .{elapsed_micros});
}

With the simplified app, I could still see the performance difference. When I ran the byte counter with zig build run, it ran in 13 microseconds:

$ echo '00010203040506070809' | xxd -r -p | zig build run -Doptimize=ReleaseFast
bytes:           10
execution time:  13.549µs

When I ran the compiled binary directly, it took 12x as long to run, completing in 162 microseconds:

$ echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes
bytes:           10
execution time:  162.195µs

My test consisted of three commands in a bash pipeline:

  1. echo prints a sequence of ten hex-encoded bytes (0x00, 0x01, …).
  2. xxd converts echo’s hex-encoded bytes to binary-encoded bytes.
  3. zig build run compiles and executes my byte counter program, counting the number of binary-encoded bytes that xxd emitted.

The only difference between zig build run and ./zig-out/bin/count-bytes was that the second command runs the already-compiled app, whereas the first one recompiles the app.

Again, I was dumbfounded.

How could does an extra compilation step make the program faster? Does a Zig app somehow run quicker when it’s fresh out of the oven?

Asking the Zig community for help 🔗︎

At this point, I was stumped. I had read my source code over and over, and I couldn’t understand how compiling and running an application could be faster than running the already-compiled binary.

Zig is still a new language, so there had to be something about Zig I’d misunderstood. Surely, if experienced Zig programmers looked at my code, they’d spot my error instantly.

I posted my question on Ziggit, a discussion forum for Zig. The first few responses said I had a problem with “input buffering” but they didn’t have concrete suggestions to fix it or investigate further.

Andrew Kelly, Zig’s founder and lead developer made a surprise appearance in the thread. He couldn’t explain the phenomenon I was seeing, but he pointed out that I was making a different performance mistake:

Finally, my friend Andrew Ayer saw my post about this on Mastodon and solved the mystery:

Andrew Ayer got it exactly right, and I’ll break it down below.

Sidenote: Andrew Ayer also had the key insight that solved my last performance mystery.

My mental model of bash pipelines is wrong 🔗︎

I had never thought too carefully about bash pipelines, but Andrew’s comment made me realize my mental model was wrong.

Imagine a simple bash pipeline like the following:

./jobA | ./jobB

My mental model was that jobA would start and run to completion and then jobB would start with jobA’s output as its input.

My incorrect mental model of how jobs in a bash pipeline work

It turns out that all commands in a bash pipeline start at the same time.

The actual way that jobs in a bash pipeline work

To demonstrate parallel execution in a bash pipeline, I wrote a proof of concept with two simple bash scripts.

jobA starts, sleeps for three seconds, prints to stdout, sleeps for two more seconds, then exits:

#!/usr/bin/env bash

function print_status() {
    local message="$1"
    local timestamp=$(date +"%T.%3N")
    echo "$timestamp $message" >&2
}

print_status 'jobA is starting'

sleep 3

echo 'result of jobA is...'

sleep 2

echo '42'

print_status 'jobA is terminating'
download jobA

jobB starts, waits for input on stdin, then prints everything it can read from stdin until stdin closes:

#!/usr/bin/env bash

function print_status() {
    local message="$1"
    local timestamp=$(date +"%T.%3N")
    echo "$timestamp $message" >&2
}

print_status 'jobB is starting'

print_status 'jobB is waiting on input'
while read line; do
  print_status "jobB read '${line}' from input"
done < /dev/stdin
print_status 'jobB is done reading input'

print_status 'jobB is terminating'
download jobB

If I run jobA and jobB in a bash pipeline, exactly 5.009 seconds elapse between the jobB is starting and jobB is terminating messages:

$ ./jobA | ./jobB
09:11:53.326 jobA is starting
09:11:53.326 jobB is starting
09:11:53.328 jobB is waiting on input
09:11:56.330 jobB read 'result of jobA is...' from input
09:11:58.331 jobA is terminating
09:11:58.331 jobB read '42' from input
09:11:58.333 jobB is done reading input
09:11:58.335 jobB is terminating

If I adjust the execution so that jobA and jobB run in sequence instead of a pipeline, only 0.008 seconds elapse between jobB’s starting and terminating messages:

$ ./jobA > /tmp/output && ./jobB < /tmp/output
16:52:10.406 jobA is starting
16:52:15.410 jobA is terminating
16:52:15.415 jobB is starting
16:52:15.417 jobB is waiting on input
16:52:15.418 jobB read 'result of jobA is...' from input
16:52:15.420 jobB read '42' from input
16:52:15.421 jobB is done reading input
16:52:15.423 jobB is terminating

Revisiting my byte counter 🔗︎

Once I understood that all commands in a bash pipeline run in parallel, the behavior I was seeing in my byte counter made sense:

$ echo '00010203040506070809' | xxd -r -p | zig build run -Doptimize=ReleaseFast
bytes:           10
execution time:  13.549µs

$ echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes
bytes:           10
execution time:  162.195µs

It looks like the time to run the echo '00010203040506070809' | xxd -r -p part of the pipeline takes about 150 microseconds. The zig build run step must take at least 150 microseconds.

By the time the count-bytes application actually begins in the zig build version, it doesn’t have to wait for the previous jobs to complete. The input is already waiting on stdin.

With zig build run, there’s a delay before my application executes, so previous jobs in the pipeline have already completed by the time count-bytes starts.

When I skip the zig build step and run the compiled binary directly, count-bytes starts immediately and the timer begins. The problem is that count-bytes has to sit around waiting ~150 microseconds for the echo and xxd commands to deliver input to stdin.

When I run count-bytes directly, it has to wait around for ~150 microseconds until echo and xxd feed input to stdin.

Fixing my benchmark 🔗︎

Fixing my benchmark was simple. Instead of running my application as part of a bash pipeline, I split the preparation stage and the execution stage into separate commands:

# Convert the hex-encoded input to binary encoding.
$ INPUT_FILE_BINARY="$(mktemp)"
$ echo '60016000526001601ff3' | xxd -r -p > "${INPUT_FILE_BINARY}"

# Read the binary-encoded input into the virtual machine.
$ ./zig-out/bin/eth-zvm < "${INPUT_FILE_BINARY}"
execution time:  67.378µs

My benchmark dropped from the 438 microseconds I was seeing before down to just 67 microseconds.

Difference in measured performance of my Zig app after I fixed my benchmarking script

Applying Andrew Kelly’s performance fix 🔗︎

Recall that Andrew Kelly pointed out that I was doing one syscall for every byte I read.

var reader = std.io.getStdIn().reader();
...
while (true) {
      _ = reader.readByte() { // Slow! One syscall per byte
          ...
      };
      ...
  }

So, every time my application called readByte in the loop, it had to halt execution, request an input read from the OS and then resume when the OS delivered the single byte.

The fix was simple. I had to use a buffered reader. Instead of reading a single byte at a time from the OS, I’d use Zig’s built-in std.io.bufferedReader, which causes my application to read large chunks of data from the OS. That way, I only have to make a fraction of the syscalls.

Here’s the entire change:

diff --git a/src/main.zig b/src/main.zig
index d6e50b2..a46f8fa 100644
--- a/src/main.zig
+++ b/src/main.zig
@@ -7,7 +7,9 @@ pub fn main() !void {
     const allocator = gpa.allocator();
     defer _ = gpa.deinit();

-    var reader = std.io.getStdIn().reader();
+    const in = std.io.getStdIn();
+    var buf = std.io.bufferedReader(in.reader());
+    var reader = buf.reader();

     var evm = vm.VM{};
     evm.init(allocator);

I re-ran my example, and it sped up performance by another 11 microseconds, a modest 16% speedup.

$ zig build -Doptimize=ReleaseFast && ./zig-out/bin/eth-zvm < "${INPUT_FILE_BINARY}"
execution time:  56.602µs

Buffering input reads increased performance by another 16%.

Benchmarking a larger input 🔗︎

My Ethereum interpreter currently only supports a small subset of Ethereum’s opcodes. The most complex computation my interpreter can do at this point is add numbers together.

For example, here’s an Ethereum application that counts to three by pushing 1 to the stack three times and then adding the values together:

PUSH1 1    # Stack now contains [1]
PUSH1 1    # Stack now contains [1, 1]
PUSH1 1    # Stack now contains [1, 1, 1]
ADD        # Stack now contains [2, 1]
ADD        # Stack now contains [3]

The largest application I tested in my benchmarks was Ethereum bytecode that counted to 1,000 by adding 1 values together.

After Andrew Kelly’s tip helped me reduce syscalls, my “count to 1,000” application’s runtime dropped from 2,024 microseconds to just 58 microseconds, a 35x speedup. I was now beating the official Ethereum implementation by almost a factor of two.

Buffering my input reads allowed my Zig implementation to run about 2x faster than the official Ethereum implementation on the largest Ethereum application in my test set.

Cheating my way to maximum performance 🔗︎

I was excited to see my Zig implementation finally outperforming the official Go version, but I wanted to see just how much I could leverage Zig to improve performance.

One common bottleneck in software is memory allocation. The program has to stop and wait for the OS to allocate RAM, which may involve shuffling around data to find enough contiguous space.

Zig has a memory allocator called the fixed buffer allocator. Instead of the memory allocator requesting memory from the OS, you provide the allocator a fixed buffer of bytes, and it uses only those bytes to allocate memory.

I can cheat my benchmarks by compiling a version of my Ethereum interpreter that’s limited to 2 KB of memory allocated from the stack:

diff --git a/src/main.zig b/src/main.zig
index a46f8fa..9e462fe 100644
--- a/src/main.zig
+++ b/src/main.zig
@@ -3,9 +3,9 @@ const stack = @import("stack.zig");
 const vm = @import("vm.zig");

 pub fn main() !void {
-    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
-    const allocator = gpa.allocator();
-    defer _ = gpa.deinit();
+    var buffer: [2000]u8 = undefined;
+    var fba = std.heap.FixedBufferAllocator.init(&buffer);
+    const allocator = fba.allocator();

     const in = std.io.getStdIn();
     var buf = std.io.bufferedReader(in.reader());

I call this a “cheat” as I’m optimizing for my specific benchmarks. There are certainly valid Ethereum programs that require more than 2 KB of memory, but I’m just curious how fast I can go with this optimization.

Let’s see what performance looks like if I know my max memory requirement at compile time:

$ ./zig-out/bin/eth-zvm < "${COUNT_TO_1000_INPUT_BYTECODE_FILE}"
execution time:  34.4578µs

Cool! With a fixed memory buffer, my Ethereum implementation runs my “count to 1,000” bytecode in 34 microseconds, nearly 3x faster than the official Go implementation.

If I know the maximum memory requirements of my Ethereum interpreter at compile time, I can outperform the official implementation by 3x.

Conclusion 🔗︎

My takeaway from this experience is to benchmark performance early and often.

By adding a benchmarking script to my continuous integration and archiving the results, it was easy for me to identify when my measurements changed. Had I relegated benchmarking to a manual, periodic task, it would have been difficult for me to identify exactly what caused the difference in my measurements.

This experience also underscores the importance of understanding your metrics. Before hitting this bug, I hadn’t considered that my benchmark included the time waiting for other processes to fill stdin.

Source code 🔗︎

  • eth-zvm: My hobby Ethereum Virtual Machine, implemented in Zig

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK