Learning SIMD with Rust by Finding Planets (2018)

R ust 1.27.0 has brought SIMD (Single Instruction Multiple Data), also known as vectorization, to stable Rust. If you read the announcement , you will see that SIMD should bring performance enhancements to our applications if we learn how to use it properly. But, for that let's first dive into how SIMD works.

If you feel fairly comfortable with Rust but are still having issues following this text, you might want to read my book about improving the performance of your Rust applications. It should give you all the previous knowledge required for this read.

Imagine you have the following code:

let a = 10;
let b = 20;
let c = 30;
let d = 40;let ab = a * b;
let cd = c * d;

As you can see, there are only 2 operations being done with the variables: 2 multiplications ( ab and cd ). This will take at least 2 instructions in the CPU, depending on the CPU model, but with SIMD, we can do it in only one instruction.

You might be thinking why do we care about one or two instructions in our whole program, right? Well, we usually don’t have only one multiplication in our code, we most of the time will do these operations in iterations, so it would be nice to be able to perform them in parallel with only one instruction every 2, 4, 8 or even more of them.

Also, we sometimes have time/money constraints for our code, and we need to be able to run a high performance implementation of our code. Different kinds of SIMD instructions will allow us to do that for our various operations.

Let’s learn how to do this with an example.

VSOP87 algorithm

If we want to know where the planets of the solar system are at a given point in time (around 2,000 AD± 4,000 years, depending on the planet), a great tool we can use is the VSOP87 algorithm . This algorithm has 6 variations, but for our example here, we will just take the main variant into account.

The algorithm just computes a series of polynomials for each of the orbital elements of the planet in question. For each planet, we will get 6 orbital parameters, that can easily be later converted to Keplerian orbital elements . We won’t go into much detail on which parameters get generated, but you can learn more in the documentation for the parameters in my Rust VSOP87 library.

The 6 parameters are named, by the variables given in the algorithm paper, as a , l , k , h , q and p . We don’t really need to know what they mean, but let’s see how they get calculated:

a = a₀t + a₁t² + a₂t³ + a₃t⁴ + a₄t⁵ l = l₀t + l₁t² + l₂t³ + l₃t⁴ + l₄t⁵ k = k₀t + k₁t² + k₂t³ + k₃t⁴ + k₄t⁵ h = h₀t + h₁t² + h₂t³ + h₃t⁴ + h₄t⁵ q = q₀t + q₁t² + q₂t³ + q₃t⁴ + q₄t⁵ p = p₀t + p₁t² + p₂t³

Many new things here, I know, but, as you can see, the calculation is simple, only a few polinomials, and only done once, so even if we can optimize these calculations with SIMD, it doesn’t make much of a difference. At least if we don’t need to calculate the position for the planet many times per second, which could be a real use case in a simulation, for example.

In any case, let’s see what those variables are. The t variable is the time variable, the variable that tells the algorithm for what moment does it need to calculate the position of the planet. It’s a Julian year , and can be calculated from a Julian date .

Then, for each variable, we have from 3 to 5 coefficients, depending on the variable being calculated, and then are multiplied by the t variable with different orders. Those coefficients depend on the t variable, as we will see now.

First, we should know that the VSOP87 algorithm provides some huge data-sets of constants that are used in the calculation of those variables. For each variable ( a₀ , a₁ , a₂ , a₃ , a₄ , l₀ , l₁ , l₂ …) we have one bi-dimensional matrix or array for each planet. Each matrix, has 3 columns and n rows. For example, here you can see the ones for Mars.

Then, to calculate each variable, we need to apply the following formula:

7fuii2f.png!web

Where v is one of a₀, a₁, a₂, l₀, l₁… and n is the number of rows in the matrix / array.

This formula might be a bit complex, but let’s see what it’s doing. For each 3 elements in each matrix / array row (we call them Vᵢ₀ , Vᵢ₁ and Vᵢ₂ , or simply a , b and c in the code) in , we calculate a * (b + c * t).cos() , (note that this is Rust notation) and then we just sum all of them. And this is where what we saw before gets handy: this function can be optimized with SIMD, since we are performing multiple operations that could be done in parallel. Let’s learn how to do it.

Optimizing variable generation with SIMD

SIMD is the general name that receive multiple parallel computing implementations for different CPUs. In the case of Intel, we have SSE and AVX implementations, each of them with different versions (SSE, SSE2, SSE3, SSE4, AVX, AVX2 and AVX-512), ARM has Neon instructions, and so on.

Rust enables SSE and SSE2 optimizations for x86 and x86_64 targets by default. These are pretty old and any x86 processor being used today should handle them properly. In any case, these optimizations are done by the compiler, and it’s not as good as we as programmers can be.

With Rust 1.27, we can use SSE3, SSE4, AVX and AVX2 manually in the stable channel. AVX-512 is not yet included in the standard library, but it should come soon enough . In any case, only specialized processors, and processors coming later this year bring that instruction set.

If we want to use vectorization in our Rust code, we have to use the std::arch or core::arch modules (depending if we are using std or not). In there, we have modules for different architectures. For this example, nevertheless, we will be using the AVX instruction set in the x86 and x86_64 sub-modules.

Why AVX, you might ask? Well, it has all the instructions we need to compute the calculations 4 by 4 (we will be working with 64-bit floating point numbers) and we don’t have access to AVX-512, that would allow 8 by 8 computations.

AVX has 256-bit registers, that can compute 4 64-bit computations at the same time, or 8 32-bit computations, or 16 16-bit computation, or even 32 8-bit computations. We will be using 2 functions: multiplication and addition. AVX functions start with _mm256_ , then, they get the name of the operation ( add , mul or abs , for example ) and then the type they will be used on ( _pd for doubles or 64-bit floats, _ps for 32-bit floats, _epi32 for 32-bit integers and so on).

We will therefore be using _mm256_add_pd() and _mm256_mul_pd() functions in this example. We will also use a Rust macro that will enable us to compile the code for CPUs that don’t support AVX, and we will decide to use AVX at runtime , if supported. Let’s start by defining the equation above with a nice Rust iterator:

#[inline]
fn calculate_var(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    var.iter()
        .fold(0_f64, |term, &(a, b, c)| term + a * (b + c * t).cos())
}

I added the #[inline] attribute to ask the compiler to inline the function whenever possible, it’s just one expression. This will iterate through the V array, called var, and will for each row, add the result of a * (b + c * t).cos() , just what we need. This will be compiled with some SSE2 optimizations, but we want to do more if AVX is detected. Let’s see how to do it:

#[inline]
#[allow(unsafe_code)]
fn calculate_var(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    if is_x86_feature_detected!("avx") {
        // Safe because we already checked that we have
        // AVX instruction set.
        unsafe { calculate_var_avx(t, var) }
    } else {
        var.iter()
            .fold(0_f64, |term, &(a, b, c)| {
                term + a * (b + c * t).cos()
            })
    }
}

The is_x86_feature_detected!() macro will check at runtime if the current CPU has the AVX instruction set. If it does, it will execute the calculate_var_avx() unsafe function. If not, it will just fall back to the default, non-AVX implementation. This makes the code portable: compile once, run everywhere.

Beware, using SIMD in Rust is unsafe, so make sure you check every line of code for safety, like you do in C++, right? ;)

Now, let’s first import some functions we will use. Note that some of this code will be much nicer once stdsimd gets stabilized.

use std::{f64, mem};#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;#[cfg(target_arch = "x86")]
use std::arch::x86::*;

Now, let’s define the SIMD function that will be called for each 4 elements:

unsafe fn vector_term(
    (a1, b1, c1): (f64, f64, f64),
    (a2, b2, c2): (f64, f64, f64),
    (a3, b3, c3): (f64, f64, f64),
    (a4, b4, c4): (f64, f64, f64),
    t: f64,
) -> (f64, f64, f64, f64) {
    unimplemented!()
}

This function, as you can see, receives 4 tuples (aᵢ, bᵢ, cᵢ) and the t variable. It will return the 4 intermediate terms after computing aᵢ * (bᵢ + cᵢ * t).cos() for each of the tuples. For that, we will follow the strategy of computing first cᵢ * t , with the 4 tuples, then bᵢ + cᵢ * t , then, (bᵢ + cᵢ * t).cos() , and finally, we will multiply aᵢ by the result of the cosine.

We will need to use core::arch::x86_64::__m256d as the type holding 4 f64 , since the _mm256_add_pd() and _mm256_mul_pd() functions only understand that type. Let’s see how to create those types:

let a = _mm256_set_pd(a1, a2, a3, a4);
let b = _mm256_set_pd(b1, b2, b3, b4);
let c = _mm256_set_pd(c1, c2, c3, c4);
let t = _mm256_set1_pd(t);

The _mm256_set_pd() function will receive 4 f64 and create one __m256d . The _mm256_set1_pd() function will just repeat the given f64 in the 4 positions of a newly created __m256d , so it's the same as _mm256_set_pd(t, t, t, t) . So, now that we have the 4 vectors, let’s start the computation:

// Safe because both values are created properly and checked.
let ct = _mm256_mul_pd(c, t);
// Safe because both values are created properly and checked.
let bct = _mm256_add_pd(b, ct);

Here, ct will be the vector containing:

Then, bct will add the 4 b variables to the vector, so bct will be this:

uyMRFvV.png!web

Then, we need to compute the cosine of the 4 results, but Rust does not provide the Intel _mm256_cos_pd() instruction yet. This means that we will need to unpack the vector, calculate the 4 cosines one by one and then pack them again in a vector to calculate the addition of all the a variables. Let’s do it:

// Safe because bct_unpacked is 4 f64 long.
let bct_unpacked: (f64, f64, f64, f64) = mem::transmute(bct);// Safe because bct_unpacked is 4 f64 long, and x86/x86_64 is little endian.
let bct = _mm256_set_pd(
    bct_unpacked.3.cos(),
    bct_unpacked.2.cos(),
    bct_unpacked.1.cos(),
    bct_unpacked.0.cos(),
);

Here, we have to take something into account: x86/x86_64 is a little endian architecture, which means that the bytes will be stored with the highest value at the lowest index. What means that when we unpack the bct vector, the first element will be b₄ + c₄ * t₄ , instead of b₁ + c₁ * t₁ .

Finally, we can compute the terms:

// Safe because both values are created properly and checked.
let term = _mm256_mul_pd(a, bct);
let term_unpacked: (f64, f64, f64, f64) = mem::transmute(term);

And we just return that tuple. Let's now define the calculate_var_avx() function. This function should receive the whole matrix and return the value of the given variable. The way to do it, is to use the chunks() iterator in the array, so that we can get 4 rows each time. Let's first see how the definition of the function would look like:

#[target_feature(enable = "avx")]
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[allow(unsafe_code)]
unsafe fn calculate_var_avx(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    unimplemented!()
}

We are asking the Rust compiler to enable the AVX feature for this particular function. This means that the function must be an unsafe function: we will need to check if the current CPU supports AVX before calling it safely. If you remember from before, we were already doing it.

Then, we can iterate through the var array:

var.chunks(4)
    .map(|vec| match vec {
       &[(a1, b1, c1), (a2, b2, c2), (a3, b3, c3), (a4, b4, c4)] => {
            // The result is little endian in x86/x86_64.
            let (term4, term3, term2, term1) =
                vector_term((a1, b1, c1), (a2, b2, c2),
                            (a3, b3, c3), (a4, b4, c4), t);            term1 + term2 + term3 + term4
        }
        _ => unimplemented!(),
    })
    .sum::<f64>()

As you can see, using the chunks() iterator, we get arrays that we can pattern-match since Rust 1.26. The first and obvious pattern is having a chunk of 4 tuples that we can directly use in the vector_term() function we defined earlier. The issue with the chunks() iterator is that it will return non-complete chunks if the array length is not a multiple of the chunk size — in this case, 4. It wouldn't happen with exact_chunks() iterator, but it would discard the extra tuples. At the end of the iterator, that will return intermediate terms, we call the sum() iterator to add everything into a f64 . Note that this could be SIMD-optimized by taking elements 8 by 8, adding them 4 by 4, and then 2 by 2 and so on, but it's out of the scope of this explanation.

To manage those non-even chunks cases, we can do something like this:

&[(a1, b1, c1), (a2, b2, c2), (a3, b3, c3)] => {
    // The result is little endian in x86/x86_64.
    let (_term4, term3, term2, term1) = vector_term(
        (a1, b1, c1),
        (a2, b2, c2),
        (a3, b3, c3),
        (f64::NAN, f64::NAN, f64::NAN),
        t,
    );    term1 + term2 + term3
}
&[(a1, b1, c1), (a2, b2, c2)] => {
    a1 * (b1 + c1 * t).cos() + a2 * (b2 + c2 * t).cos()
},
&[(a, b, c)] => a * (b + c * t).cos(),

For the case of 3 tuples, we can just add some NaN on the fourth tuple and discard the result when calling vector_term() . For the case of 2 tuples, we just compute the terms and let the compiler try to optimize it, and for one tuple, we just do it right away.

Benchmarks

We have learned how to use SIMD in our code, but is it worth it? We cannot improve what we cannot measure, so let’s dive into benchmarking. Using criterion , we can compare the before and after of the change. We will need these lines in our Cargo.toml file:

[dev-dependencies]
rand = "0.5.2"
criterion = "0.2.3"[[bench]]
name = "vsop87"
harness = false

And then, in benches/vsop87.rs :

#[macro_use]
extern crate criterion;
extern crate rand;
extern crate vsop87;use criterion::Criterion;
use rand::{thread_rng, Rng};fn vsop87_mars(c: &mut Criterion) {
    let mut rng = thread_rng();
    c.bench_function("VSOP87 Mars", move |b| {
        b.iter(|| vsop87::mars(rng.gen_range(990930.5, 3912521.5)))
    });
}criterion_group!(
    vsop87_benches,
    vsop87_mars
);
criterion_main!(vsop87_benches);

Then, checking first without AVX optimizations and later with AVX optimizations, the difference I get in my i7–5600U is the following:

After running benchmarks with all the variants and planets, the improvement is about 9% to 12% . And this was only optimizing part of the loop and only with some AVX functions. AVX-512 should clearly improve this benchmark, and being able to compute the cosine in AVX should also help.

There are libraries such as faster and simdeez that can help you develop this kind of code for different situations. In the case of faster , though, it will use SIMD for the compiling processor, which makes the code run fast in the processor the code is being compiled in, but can have portability issues in other processors.

VSOP87 algorithm

Optimizing variable generation with SIMD

Benchmarks

Recommend

Words that do Handstands

Putting an end to Retadup: A malicious worm that infected hundreds of thousands

Reverse Taint Analysis Using Binary Ninja

Decision Trees — Introduction (ID3)

Flutter Platform Channel 使用与源码分析

为什么会产生微服务架构

R, Docker and Checkpoint: A Route to Reproducibility

Data Science Made Easy: Image Analytics using Orange

Ruslan Spivak: Let’s Build A Simple Interpreter. Part 17: Call Stack and Activat...

GNOME Firmware Updater

About Joyk