LLVM is Smarter Than Me

Published on 2024-04-19

I was reading Algorithms for Modern Hardware recently, specifically the chapter on SIMD, and was impressed by auto-vectorization. The basic idea is that modern CPUs have so-called SIMD1 instructions that allow applying an operation to several values at once, which is much faster than performing the equivalent scalar instructions one at a time. Modern compilers can detect certain programming patterns where an operation is performed repeatedly on scalar values and group the operations to make use of the faster instructions.2

The book uses primarily C++ for its examples, but I was curious if the same pattern would work in Rust as well, without having to manually annotate anything. So I opened up the compiler explorer and typed in my first test, which looked like this:

#[no_mangle]
fn sum() -> u32 {
    (0..1000).sum()
}

To make sure I the compiler would feel safe to use SIMD instructions, I used -C opt-level=3 -C target-cpu=skylake, telling it that the target CPU supports them. But instead of SIMD instructions I got this assembly:

sum:
        mov     eax, 499500
        ret

Turns out I was outsmarted by the compiler, which used constant folding to directly return the final value it computed at compile time. To avoid this optimization from happening we can pass an argument into the function, so that the compiler cannot know the final result:

#[no_mangle]
fn sum(n: u32) -> u32 {
    (0..n).sum()
}

Which generates this assembly:

sum:
        test    edi, edi
        je      .LBB0_1
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        shr     rcx
        lea     eax, [rdi + rcx]
        dec     eax
        ret
.LBB0_1:
        xor     eax, eax
        ret

Still there are no SIMD instructions in here, which would usually start with the letter v, such as vpaddd which adds integers in parallel. To compare, I wrote the equivalent program in C++:

unsigned int sum(unsigned int n) {
    unsigned int rv = 0;
    for (unsigned int i = 0; i < n; i++) {
        rv += i;
    }
    return rv;
}

Compiler explorer defaults to GCC to compile C++. I used the equivalent -O3 -march=skylake, and sure enough, I got back 106 lines of assembly (omitted for brevity), including the desired SIMD instructions. I then switched to Clang, the LLVM-based C/C++ compiler that is also used by Rust, and it produced the exact same assembly as it did for the Rust version. This tells us the difference is not one of languages, but one of compilers.3

A quick comparative benchmark revealed the LLVM version to be significantly faster than GCC's vectorized loop, especially for large values of n, which is unsurprising when looking at the instructions, about 10 scalar instructions will beat hundreds of loops over vector instructions.

The key here is that the sum of consecutive integers from zero to n has a closed form solution, which I only realized after looking more closely at the assembly:

∑i=0ni=n(n−1)2∑i=0ni=n(n−1)2

LLVM apparently detects that that is exactly what we are trying to do, and as a result it can do away with the looping altogether and directly calculate the result in one step, changing sum from O(n) to O(1). Colour me impressed.

Finally, replicating the book's example more closely, I managed to get LLVM to automatically vectorize a loop for me, using the following code:

#[no_mangle]
fn sum(arr: &[u32]) -> u32 {
    arr.iter().sum()
}

This works because LLVM has no idea about the contents of the array, so it cannot deduce a more efficient way of calculating their sum than to actually add them all up.

My takeaway here is that LLVM is even better at generating optimal code than I would have expected, at least for trivial examples. I am happy that I can continue writing idiomatic code without feeling like I am trading off performance for readability.

LLVM is Smarter Than Me

LLVM is Smarter Than Me

Recommend

Navigating Big Data’s Impact on College Education and Student Life - Big Data An...

5000字 AIGC 爆肝测评！最新最全 AI 资讯及实用测评合集

How I Revamped Min/Max’s Combat Interface | UX Planet

Android开发中，SparseArray的高效存储与查找机制详解

Topic Clusters for SEO: What They Are & How to Create Them

探秘Android系统：dumpsys命令获取系统服务详细信息

LTX Studio 最新测评! 颠覆传统的一站式 AI 视频创作神器

How do I become a UX Designer?

One Big Thought

On authoring tools in EN 301 549

About Joyk