43

Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

 5 years ago
source link: https://www.tuicool.com/articles/hit/YF3QnuY
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

When receiving bytes from the network, we often assume that they are unicode strings, encoded using something called UTF-8. Sadly, not all streams of bytes are valid UTF-8. So we need to check the strings. It is probably a good idea to optimize this problem as much as possible.

In earlier work, we showed that you could validate a string using a little as 0.7 cycles per byte, using commonly available 128-bit SIMD registers (in C). SIMD stands for Single-Instruction-Multiple-Data, it is a way to parallelize the processing on a single core.

What if we use 256-byte registers instead?

Reference naive function 10 cycles per byte fast SIMD version (128-bit) 0.7 cycles per byte new SIMD version (256-bit) 0.45 cycles per byte

That’s good, almost twice as fast.

A common problem is that you receive as inputs ASCII characters. That’s a common scenario. It is much faster to check that a string in made of ASCII characters than to check that it is made of valid UTF-8 characters. Indeed, to check that it is made of ASCII characters, you only have to check that one bit per byte is zero (since ASCII uses only 7 bits per byte).

It turns out that only about 0.05 cycles are needed to check that a string is made of ASCII characters. Maybe up to 0.08 cycles. That makes us look bad.

You could start checking the file for ASCII characters and then switch to our function when non-ASCII characters are found, but this has a problem: what if the string starts with a non-ASCII character followed by a long stream of ASCII characters?

A quick solution is to add an ASCII path. Each time we read a block of 32 bytes, we check whether it is made of 32 ASCII characters, and if so, we take a different (fast) path. Thus if it happens frequently that we have long streams of ASCII characters, we will be quite fast.

The new numbers are quite appealing when running benchmarks on ASCII characters:

new SIMD version (256-bit) 0.45 cycles per byte new SIMD version (256-bit), w. ASCII path 0.088 cycles per byte ASCII check (SIMD + 256-bit) 0.051 cycles per byte

My code is available .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK