

Updated guide for using Neon intrinsics in Unity Burst
source link: https://blog.unity.com/games/updated-guide-for-using-neon-intrinsics-in-unity-burst
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.


Unity recently released Burst 1.5, with a focus on the addition of Arm’s Neon intrinsics. Neon intrinsics let you specify precise vector commands to get the most efficient code possible for processing workloads on Arm CPUs. While they’re normally in C/C++, Unity has brought them through to C#.

Unity recently released Burst 1.5, with a focus on the addition of Arm’s Neon intrinsics. Neon intrinsics let you specify precise vector commands to get the most efficient code possible for processing workloads on Arm CPUs. While they’re normally in C/C++, Unity has brought them through to C#.
Nonetheless, understanding the best commands for your aims can be tricky, so Arm produced a guide for using Neon intrinsics in Unity, and an accompanying Unity project with code available. This guide serves to help you structure Burst code for the automatic use of Neon, which gives you a large performance boost without having to get into the nitty-gritty of writing intrinsics yourself. Let’s take a look at some best practices to make the most of Neon intrinsics.
With the Burst compiler, you don’t need to put in the intrinsics yourself to maximize performance gain. However, there are ways that you can support Burst to further improve performance. For instance, you can adjust data and loop structure to facilitate auto-vectorization, along with the large performance gain that comes with it.
To ensure that the compiler turns four instructions into one Neon SIMD instruction, make short, simple loops, without breaks, but with functions still in line. Additionally, use the [NoAlias] attribute on any pointers passed to a Burst function to speed it up 4x, as shown in our case study on physics collisions.
This sample case study zeroes in on physics collisions – hence the plain capsule and cuboid graphics above. Two different collision types were optimized here: Axis-Aligned Bounding Boxes (AABB) for character-wall collisions, and Radius-based for character-character collisions.
Handcrafting intrinsics that will beat the compiler isn’t simple, but this case study demonstrates various approaches to do just that. Performance improvement is more than a target – it’s a process. Once you reach the measure-optimize cycle, you can profile to see how long the routine takes, then make adjustments and time it again. Use Profile Analyzer, or put in your own timing, to accomplish this.
Now you can turn your attention to making adjustments. In the case study, we moved out of Burst jobs to Burst static functions, which made timing easier to achieve. In a final game job, asynchronicity is a great asset, even though performance timing adds a layer of complexity. For a real game, you’d use ProfilerMarker, ProfilerRecorder, and ProfileAnalyzer to time within jobs. But here, the move to Burst static functions actually helped force the changes needed for auto-vectorization. If jobs are set up to use NativeArrays of structs with Burst static functions, it becomes less complicated to use pointers for basic types. This breaks up the data into more easily vectorizable pieces. And once the [NoAlias] attribute is added to the pointers, it tells the compiler whether there was overlap in the data that the pointers were used for. In our case study, the performance of the normal Burst was so strong that it required some very good Neon coding to beat it. To fully leverage Neon, the two different collision types each required proper structuring of data and logic.
The vectorization works best when four or eight objects can be compared simultaneously, so that it completes the same operation for them at once (with the appropriate Neon command). The updated guide takes you through examples for maintaining maximum performance.
To go through the full process – to see an example in action and determine how to apply it to your own project – please read the guide and look through the specified code. It can make a difference, with one of the optimizations even getting Burst code 6x faster than well-written, non-Burst code, and accelerating the handcrafted Neon code 10x.
For further information on Neon, visit Arm’s Neon microsite. While most of it is aimed at C/C++ intrinsics, the same principles apply. Additionally, be sure to take a look at the Unity list of implemented Neon intrinsics and Arm’s Neon intrinsics search engine.
This blog was co-authored by Ben Clark, a Developer Advocate at Arm.
Recommend
-
71
Previous posts in the series: Exploring .NET Core platform intrinsics: Part 1 - Accelerating SHA-256 on ARMv8 This is...
-
60
Previous posts in the series: Exploring .NET Core platform intrinsics: Part 1 - Accelerating SHA-256 on ARMv8
-
6
using machine learning to change the world — Xebia Blog This website stores cookies on your computer. These cookies are used to improve your website and provide more personalized services to you, both on this website a...
-
7
Neon Intrinsics各函数介绍
-
9
-
8
旭穹の陋室SIMD[2]: NEON Intrinsics发表于2022-06-14|更新于
-
8
Improving performance using WebAssembly SIMD IntrinsicsAugust 26, 2022...
-
8
Announcement I will be speaking at Percona Live 2023 about serverless PostgreSQL. Join us at this event if you are interested! Introduction Recently, P...
-
4
Neon Intrinsics in Rust
-
5
China Police Burst Gangs Using ChatGPT for Fake News & Videos
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK