Three fundamental flaws of SIMD

According to Flynn’s taxonomy SIMD refers to a computer architecture that can process multiple data streams with a single instruction (i.e. “Single Instruction stream, Multiple Data streams”). There are different taxonomies, and within those several different sub-categories and architectures that classify as “SIMD”. In this post, however, I refer to the type of SIMD that is most common in contemporary consumer grade instruction set architectures, and which most people think about when they hear the term “SIMD”: packed SIMD.

Packed SIMD

The common trait of packed SIMD architectures is that several data elements are packed into a single register of a fixed width. Here is an example of possible configurations of a packed 128 bits wide SIMD register:

For instance, a 128-bit register can hold sixteen integer bytes or four single precision floating-point values.

This type of SIMD architecture has been wildly popular since the mid 1990s, and some packed SIMD ISA:s are:

x86: MMX, 3DNow!, SSE, SSE2, …, AVX, AVX2, AVX-512
ARM: ARMv6 SIMD, NEON
POWER: AltiVec (a.k.a. VMX and VelocityEngine)
MIPS: MDMX, MIPS-3D, MSA, DSP
SPARC: VIS
Alpha: MVI

The promise of all those ISA:s is increased data processing performance, since each instruction executes several operations in parallel. However, there are problems with this model.

Flaw 1: Fixed register width

Since the register size is fixed there is no way to scale the ISA to new levels of hardware parallelism without adding new instructions and registers. Case in point: MMX (64 bits) vs SSE (128 bits) vs AVX (256 bits) vs AVX-512 (512 bits).

Adding new registers and instructions has many implications. For instance, the ABI must be updated, and support must be added to operating system kernels, compilers and debuggers.

Another problem is that each new SIMD generation requires new instruction opcodes and encodings. In fixed width instruction sets (e.g. ARM) this may prohibit any new extensions, since there may not be enough opcode slots left for adding the new instructions. In variable width instruction sets (e.g. x86) the effect is typically that instructions get longer and longer (effectively hurting code density). Paradoxically each new SIMD generation essentially renders the previous generations redundant (except for supporting binary backwards compatibility), so a large number of instructions are wasted without adding much value.

Finally, any software that wants to use the new instruction set needs to be rewritten (or at least recompiled). What is worse, software developers often have to target several SIMD generations, and add mechanisms to their programs that dynamically select the optimal code paths depending on which SIMD generation is supported.

Flaw 2: Pipelining

The packed SIMD paradigm is that there is a 1:1 mapping between the register width and execution unit width (e.g. 128 bits for NEON and SSE). At the same time many SIMD operations are pipelined and require several clock cycles to complete (e.g. floating-point arithmetic and memory load instructions). The side effect of this is that the result of one SIMD instruction is not ready to be used until several instructions later in the instruction stream.

Consequently, loops have to be unrolled in order to avoid stalls and keep the pipeline busy. This can be done in advanced (power hungry) hardware implementations with register renaming and speculative out-of-order execution, but for simpler (usually more power efficient) hardware implementations loops have to be unrolled in software. Many software developers and compilers aiming to support both in-order and out-of-order processors simply unroll all SIMD loops in software.

However, loop unrolling hurts code density (i.e. makes the program binary larger), which in turn hurts instruction cache performance (fewer program segments fit in the instruction cache, which reduces the cache hit ratio).

Loop unrolling also increases register pressure (i.e. more registers must be used in order to keep the state of multiple loop iterations in registers), so the architecture must provide enough SIMD registers to avoid register spilling.

Flaw 3: Tail handling

When the number of array elements that are to be processed in a loop is not a multiple of the number of elements in the SIMD register, special loop tail handling needs to be implemented in software. For instance if an array contains 99 32-bit elements, and the SIMD architecture is 128 bits wide (i.e. a SIMD register contains four 32-bit elements), 4*24=96 elements can be processed in the main SIMD loop, and 99-96=3 elements need to be processed after the main loop.

This requires extra code after the loop for handling the tail. Some architectures support masked load/store that makes it possible to use SIMD instructions to process the tail, while a more common scenario is that you have to use scalar (non-SIMD) instructions to implement the tail (in the latter case there may be problems if scalar and SIMD instructions have different capabilities and/or semantics, but that is not an issue with packed SIMD per se, just with how some ISA:s are designed).

Usually you also need extra control logic before the loop. For instance if the array length is less than the SIMD register width, the main SIMD loop should be skipped.

The added control logic and tail handling code hurts code density (again reducing the instruction cache efficiency), and adds extra overhead (and is generally awkward to code).

Alternatives

One alternative to packed SIMD that addresses all of the flaws mentioned above is a Vector Processor. Perhaps the most notable vector processor is the Cray-1 (released 1975), and it has served as an inspiration for a new generation of instruction set architectures, including ARM SVE and RISC-V RVV.

Several other (perhaps less known) projects are pursuing a similar vector model, including Agner Fog’s ForwardCom and my own MRISC32. An interesting variant is Libre-SOC (based on OpenPOWER) and its Simple-V extension that maps vectors onto the scalar register files (which are extended to include some 128 registers each).

A completely different approach is taken by Mitch Alsup’s My 66000 and its Virtual Vector Method (VVM), which transforms scalar loops into vectorized loops in hardware with the aid of special loop decoration instructions. That way it does not even have to have a vector register file.

Another interesting architecture is the Mill, which also has support for vectors without packed SIMD.

Three fundamental flaws of SIMD

Three fundamental flaws of SIMD

Packed SIMD

Flaw 1: Fixed register width

Flaw 2: Pipelining

Flaw 3: Tail handling

Alternatives

Further reading

Recommend

为什么 EOA 钱包对区块链的未来是个威胁

Engineer turns error detection into 'secret language' for data security

When Zero Cost Abstractions Aren’t Zero Cost | Considerations on Codecrafting

OpenVPN使用教程

The factory of the future speaks our language

Extending as-a-Service Modeling to Edge Event-Driven Applications

Sensor can detect when firefighters' protective clothing is no longer safe

TikTok executive talks shopping and the famous feta pasta

平均诊疗费用逾2200元/次！伊美尔冲击港股，因发布违规广告被罚8次

AI researchers trust international, scientific organizations most, study finds

About Joyk