Updated guide for using Neon intrinsics in Unity Burst

Header image showing someone working in Unity

Topics covered

Unity recently released Burst 1.5, with a focus on the addition of Arm’s Neon intrinsics. Neon intrinsics let you specify precise vector commands to get the most efficient code possible for processing workloads on Arm CPUs. While they’re normally in C/C++, Unity has brought them through to C#.

Example chart of neon intrinsic multiply accumulate

Figure 1: Example of Neon intrinsic, Multiply Accumulate

Nonetheless, understanding the best commands for your aims can be tricky, so Arm produced a guide for using Neon intrinsics in Unity, and an accompanying Unity project with code available. This guide serves to help you structure Burst code for the automatic use of Neon, which gives you a large performance boost without having to get into the nitty-gritty of writing intrinsics yourself. Let’s take a look at some best practices to make the most of Neon intrinsics.

Auto-vectorization

With the Burst compiler, you don’t need to put in the intrinsics yourself to maximize performance gain. However, there are ways that you can support Burst to further improve performance. For instance, you can adjust data and loop structure to facilitate auto-vectorization, along with the large performance gain that comes with it.

To ensure that the compiler turns four instructions into one Neon SIMD instruction, make short, simple loops, without breaks, but with functions still in line. Additionally, use the [NoAlias] attribute on any pointers passed to a Burst function to speed it up 4x, as shown in our case study on physics collisions.

Figure 2: Demo scene

A developer story

This sample case study zeroes in on physics collisions – hence the plain capsule and cuboid graphics above. Two different collision types were optimized here: Axis-Aligned Bounding Boxes (AABB) for character-wall collisions, and Radius-based for character-character collisions.

Handcrafting intrinsics that will beat the compiler isn’t simple, but this case study demonstrates various approaches to do just that. Performance improvement is more than a target – it’s a process. Once you reach the measure-optimize cycle, you can profile to see how long the routine takes, then make adjustments and time it again. Use Profile Analyzer, or put in your own timing, to accomplish this.

Now you can turn your attention to making adjustments. In the case study, we moved out of Burst jobs to Burst static functions, which made timing easier to achieve. In a final game job, asynchronicity is a great asset, even though performance timing adds a layer of complexity. For a real game, you’d use ProfilerMarker, ProfilerRecorder, and ProfileAnalyzer to time within jobs. But here, the move to Burst static functions actually helped force the changes needed for auto-vectorization. If jobs are set up to use NativeArrays of structs with Burst static functions, it becomes less complicated to use pointers for basic types. This breaks up the data into more easily vectorizable pieces. And once the [NoAlias] attribute is added to the pointers, it tells the compiler whether there was overlap in the data that the pointers were used for. In our case study, the performance of the normal Burst was so strong that it required some very good Neon coding to beat it. To fully leverage Neon, the two different collision types each required proper structuring of data and logic.

The vectorization works best when four or eight objects can be compared simultaneously, so that it completes the same operation for them at once (with the appropriate Neon command). The updated guide takes you through examples for maintaining maximum performance.

Take a look at code from the wall collision example, with AABB comparison:

In plain code, directly:

From being called with a Burst static function:

Or, through a Burst static function of Neon intrinsics’ instructions:

Check out the complete guide

To go through the full process – to see an example in action and determine how to apply it to your own project – please read the guide and look through the specified code. It can make a difference, with one of the optimizations even getting Burst code 6x faster than well-written, non-Burst code, and accelerating the handcrafted Neon code 10x.

For further information on Neon, visit Arm’s Neon microsite. While most of it is aimed at C/C++ intrinsics, the same principles apply. Additionally, be sure to take a look at the Unity list of implemented Neon intrinsics and Arm’s Neon intrinsics search engine.

This blog was co-authored by Ben Clark, a Developer Advocate at Arm.

Recommend

Exploring .NET Core platform intrinsics: Part 2 - Accelerating AES encryption on...

Exploring .NET Core platform intrinsics: Part 4 - Alignment and pipelining

Burst your bubble: using machine learning to change the world

Neon Intrinsics各函数介绍

🔥 Amazing Neon Card Hover Effect using only CSS

SIMD[2]: NEON Intrinsics

Improving performance using WebAssembly SIMD Intrinsics

Using Docker To Deploy Neon Serverless PostgreSQL

Neon Intrinsics in Rust

China Police Burst Gangs Using ChatGPT for Fake News & Videos

About Joyk