LLVM is Smarter Than Me
source link: https://blog.sulami.xyz/posts/llvm-is-smarter-than-me/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
LLVM is Smarter Than Me
I was reading Algorithms for Modern Hardware recently, specifically the chapter on SIMD, and was impressed by auto-vectorization. The basic idea is that modern CPUs have so-called SIMD1 instructions that allow applying an operation to several values at once, which is much faster than performing the equivalent scalar instructions one at a time. Modern compilers can detect certain programming patterns where an operation is performed repeatedly on scalar values and group the operations to make use of the faster instructions.2
The book uses primarily C++ for its examples, but I was curious if the same pattern would work in Rust as well, without having to manually annotate anything. So I opened up the compiler explorer and typed in my first test, which looked like this:
#[no_mangle] fn sum() -> u32 { (0..1000).sum() }
To make sure I the compiler would feel safe to use SIMD instructions, I used -C opt-level=3 -C target-cpu=skylake
, telling it that the target CPU supports
them. But instead of SIMD instructions I got this assembly:
sum: mov eax, 499500 ret
Turns out I was outsmarted by the compiler, which used constant folding to directly return the final value it computed at compile time. To avoid this optimization from happening we can pass an argument into the function, so that the compiler cannot know the final result:
#[no_mangle] fn sum(n: u32) -> u32 { (0..n).sum() }
Which generates this assembly:
sum: test edi, edi je .LBB0_1 lea eax, [rdi - 1] lea ecx, [rdi - 2] imul rcx, rax shr rcx lea eax, [rdi + rcx] dec eax ret .LBB0_1: xor eax, eax ret
Still there are no SIMD instructions in here, which would usually start with the
letter v
, such as
vpaddd
which adds
integers in parallel. To compare, I wrote the equivalent program in C++:
unsigned int sum(unsigned int n) { unsigned int rv = 0; for (unsigned int i = 0; i < n; i++) { rv += i; } return rv; }
Compiler explorer defaults to GCC to compile C++. I used the equivalent -O3 -march=skylake
, and sure enough, I got back 106 lines of assembly (omitted for
brevity), including the desired SIMD instructions. I then switched to Clang, the
LLVM-based C/C++ compiler that is also used by Rust, and it produced the exact
same assembly as it did for the Rust version. This tells us the difference is
not one of languages, but one of compilers.3
A quick comparative benchmark revealed the LLVM version to be significantly
faster than GCC's vectorized loop, especially for large values of n
, which is
unsurprising when looking at the instructions, about 10 scalar instructions will
beat hundreds of loops over vector instructions.
The key here is that the sum of consecutive integers from zero to n has a closed form solution, which I only realized after looking more closely at the assembly:
LLVM apparently detects that that is exactly what we are trying to do, and as a
result it can do away with the looping altogether and directly calculate the
result in one step, changing sum
from O(n) to O(1). Colour me impressed.
Finally, replicating the book's example more closely, I managed to get LLVM to automatically vectorize a loop for me, using the following code:
#[no_mangle] fn sum(arr: &[u32]) -> u32 { arr.iter().sum() }
This works because LLVM has no idea about the contents of the array, so it cannot deduce a more efficient way of calculating their sum than to actually add them all up.
My takeaway here is that LLVM is even better at generating optimal code than I would have expected, at least for trivial examples. I am happy that I can continue writing idiomatic code without feeling like I am trading off performance for readability.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK