The simple beauty of XOR floating point compression

I did something similar in a messaging application... We had a data structure consisting of several tables describing various message queues in 4GB of persistent storage. The initialization of those tables was quite regular, as they were just pointers into blocks in the persistent memory. When the application was ported to run on AWS instances, people decided to try to run the application on instances with the crappiest 20MB/s storage. I had specified during implementation that the journalling layer would not work well on crappy storage, as the actual persistent memory the system was originally designed for had gobs of bandwidth (it was supercap backed DRAM on a PCIe card built using an FPGA and a pair of 10Gbps SFP+ ports to mirror to another host -- you had a couple of GB/s of write throughput). Customers ignored the requirement, and opened a bug saying the system couldn't reinitialize the table within the 30 second limit for commands on the CLI. To fix the "bug", I ended up delta encoding the persistent memory tables to make the data more regular, ran it through libz at the lowest compression level, then wrote the compressed data out to disk. 4GB of data ended up taking less than 100MB in the journal. gzip compression on its own was not able to get under the 600MB requirement. It was a total hack, but it worked and took a few hours to implement. Domain specific knowledge really helps improve compression ratios!

If you zig zag encode the deltas you might be able to get it down to 10mb.

δzd is the best. https://justine.lol/sizetricks/#dzd

For anyone wondering, zig-zag encoding is just encoding integers with a sign bit in the LSB instead of twos-complement, hence the name. See also https://news.ycombinator.com/item?id=31382907

People in the HPC/classical supercomputing space have done this sort of thing for a while. There's a fair amount of literature on lossless floating point compression, such as Martin Burtscher's work or stuff out of LLNL (fpzip):

https://userweb.cs.txstate.edu/~burtscher/ https://computing.llnl.gov/projects/floating-point-compressi...

but it tends to be very application specific, where there tends to be high correlation / small deltas between neighboring values in a 2d/3d/4d/etc floating point array (e.g., you are compressing neighboring temperature grid points in a PDE weather simulation model; temperature differences in neighboring cells won't differ by that much).

In a lot of other cases (e.g., machine learning) the floating point significand bits (and sometimes the sign bit) tends to be incompressible noise. The exponent is the only thing that is really compressible, and the xor trick does not help you as much because neighboring values could still vary a bit in terms of exponents. An entropy encoder instead works well for that (encode closer to the actual underlying data distribution/entropy), and you also don't depend upon neighboring floats having similar exponents as well.

In 2022, I created dietgpu, a library to losslessly compress/decompress floating point data at up to 400 GB/s on an A100. It uses a general-purpose asymmetric numeral system encoder/decoder on GPU (the first such implementation of general ANS on GPU, predating nvCOMP) for exponent compression.

We have used this to losslessly compress floating point data between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in training massive ML models to speed up overall wall clock time of training across 100s/1000s of GPUs without changing anything about how the training works (it's lossless compression, it computes the same thing that it did before).

https://github.com/facebookresearch/dietgpu

I've talked to several world class computer scientists/bioinformaticians who realized, on their own, that decompression is the fastest way to fill the cache with data (because the amount of data streaming in is smaller, and CPUs are VERY fast for decompression).

One of the key innovations in the AMBER MD engine that made it work OK on cheaper systems was lossless floating point compression. It still impresses me that you can compress floats, send them over MPI, and decompress them, all faster/lower latency than the transport can send the uncompressed data.

Not just MPI over a network. We can compress floats, send them over NVLink or PCIe to another GPU in the same host, and decompress and it can be faster than sending data raw between GPUs, that's the premise behind dietgpu even (it's cheap compression, not a great compression ratio, like 0.6-0.9x of original size, but it's extremely fast, 100s of GB/s throughput, with the idea that you're trying to race something that is similarly as fast. General floating point data could be quite incompressible or highly compressible, it really just depends upon what is being passed around).

The interconnects are improving at a slower rate in general than compute on the CPU/GPU is and it can be exploited.

I'm drawing my entire VRAM worth of Morton sorted points, do you think this compression technique lends itself well to streaming decompression? Ie decompress one warp of points at a time right before drawing? Or do I need larger batches to make it viable?

I'm not exceptionally good with compression, but I did a few experiments.

I've noticed that array-of-structs form is harder to compress than struct-of-arrays form. This is relevant because a floating-point number is really a struct containing 3 values: 1-bit sign, 8-bit exponent, and 23-bit mantissa.

--------

If you have 1000 floats, I would expect 1000-sign bits in order to be more compressible (whatever compression algorithm you use, it would better determine the pattern of signs).

Then 8000-bits of exponent would also be more compressible, because all the exponents would likely follow its own pattern.

Finally, the 23-bits of mantissa would probably be difficult to impossible for compression. But "extracting it out" before running a compression algorithm will give the sign + exponents more opportunity for comrpession.

--------

So yeah, just think about your data. And favor structure-of-arrays form for better compression.

Totally, SoAs are often easier to compress. I hadn’t thought about floats as being AoSs, and then unpacking the sign, exponent and mantissas of a batch of float into separate arrays, that’s an interesting way to look at it. You could probably take that 1 step further with floats or ints, and treat an array of numbers as an array of structures of 32 or 64 fields where every field is 1 bit. So to compress a batch of 1000 32-bit numbers, take all bit 31s in a row, then all bit 30s, etc.. Or another way to look at it is to stack the numbers in a column, transpose it, and read out the bits.

It’s worth noting that the algorithm in the article, and it’s cousin, delta encoding, both do already take advantage of any cases where the sign & exponent bits don’t change with every sample, and/or where the mantissa ULPs are constant for a while at a time. It’d be interesting to compare the explicit SoA of floats compression idea to XOR or delta compression, but I think it’s also interesting in a meta kinda of way that these two seemingly different ideas have similar outcomes.

One of my fun projects was compressing all community levels for SRB2Kart (based on the Doom II engine) into a single archive suitable for WebGL rendering of map previews.

I was able to get all 220MB of the default SRB2kart maps down into 12.9MB, and could even get down to 5MB with heavy texture compression. LZMA alone could only do 37MB, so I was pretty pumped.

One of the harder parts is compressing the vertex data and adjacency information (which linedefs connect which vertices). I never figured out a satisfying way of doing that, but this makes me want to try again...

Waves across I-35 to Reality Labs.

Sounds like something Numba should include perhaps.

I retired from HPC around the era of iPython, Open MPI, Infiniband, OFED, GPFS/Panasas/Lustre, Rocks Cluster, and Condor/PBS/SGE.

Wow, cool to see Dr. Burtscher's name here! I was a student in his undergraduate parallel computing course and was invited to work in his lab. Very smart guy.

The biggest issue with this approach is that it's bit-aligned, and the alignment is determined by fields within the data. It's hard to make this run fast on most CPUs; a typical implementation ends up having to maintain a bit buffer, and shifts bits in/out of it a couple times for every codeword.

I bet you could improve the performance of this algorithm significantly, without harming its compression too badly, by one or both of:

A) Adjusting the sizes of some bit fields to make them evenly divide the system word size, e.g. requiring the run length of the 11-prefix to be a multiple of 4, so that it can be encoded in 4 bits instead of 6. (This is a net space savings of half a bit; you have to include an average of 1.5 more bits of nonzero data, but the length is two bits shorter.)

B) Writing the streams of bits for e.g. "mode", "number of leading zeroes", and "bit subsequences" separately for each block of compressed data, rather than combining them into a single stream. As a bonus, each of those streams will now be more amenable to compression by generic compressors.

One of the things we've learned from columnar data stores and compression formats is that choosing the correct stride order really matters.

I noticed this a decade ago when I stumbled on a voxel engine that used some complex height map compression scheme for game map data. It seemed gratuitous to me, so I tried the simplest thing I could come up with: gzipping a 3D array of voxel colors. It didn't really work until I chose the right stride order (xyz, zyx, yzx, etc) and then suddenly it was 5x better.

The parquet format has the BYTE_STREAM_SPLIT encoding strategy which basically splits an array of floats/doubles into 4/8 arrays of bytes so they can be compressed independently. https://parquet.apache.org/docs/file-format/data-pages/encod...

Just choosing the correct stride order before compression is still underutilized. It's also undersupported in our tools too; e.g. that voxel engine used numpy, but it was a pita to get numpy to transparently change the stride order in the memory representation without also affecting the indexing order.

The BYTE_STREAM_SPLIT is a Z-order (also known as Morton encoding). As it is better a preserving locality, it usually performs really well compared to classical "orthogonal" orders. It also is really simple to implement as it boils down to interleaving bits or bytes.

It means the order in which you serialise a multi-dimensional array. So for this 2D table:

Because it’s 2D it has two stride orders, xy and yx, i.e. row-major and column-major, which would be 126234189076 and 139240617286 respectively.

Its largely a variation of delta encoding (https://en.wikipedia.org/wiki/Delta_encoding) which is using xor to determine the variation between values. Both are exploiting the fact that in certain kinds of series/samples the range (relative entropy) between values is far less than the range of the underlying value representation.

Whether or not the resulting ranges are encoded in a space saving manner (as this algorithm is doing) or used as input into a BW transform/Huffman/etc sorta depends on the resulting value sequences.

Much of this though should apply the understanding that data being compressed can frequently lead to data specific preprocessing that can significantly increase the compression ratios over generic encoders. Just taking the output of this routine and running it through deflate might yield another couple percent if it were tuned. Or put another way, all the bit prefix/shifting decisions might be redundant for certain cases because the 0's would end up being single bit tokens in a Huffman encoder.

> Its largely a variation of delta encoding

Amiga famously used a version of hardware backed delta encoding [1] to display more colours on the screen than normally possible within its memory limits.

[1] https://en.wikipedia.org/wiki/Hold-And-Modify

HAM would have worked even better if it had been in the intended hue-saturation-value (HSV) colour space as Jay Miner originally intended. Sadly, HSV support was dropped, but due to schedule constraints, there wasn't enough time to rework the design and remove HAM. I found it fascinating to learn these things long after I was no longer using my Amiga for day to day work.

As the article states, you need to apply some heuristics to choose between cases (2) and (3). I'm not a big fan of that, because it means the encoding is not deterministic (as in, it leaves room for interpretation/optimization by encoders).

Just for fun I implemented this algorithm: https://go.dev/play/p/72GyYmVnwyc

The algorithm described in the article uses prefixes 0, 10, and 11. My algorithm uses different prefixes: 0, 10, 110, 1110, 11110, [...]. By default, the subsequence that is used is identical to that containing differing bits during the previous iteration. The prefixes describe previously can be used to widen the subsequence:

- 0: Keep the subsequence the same width.

- 10: Make the subsequence (1<<1)-1 = 1 bit wider, both on the left and right side.

- 110: Make the subsequence (1<<2)-1 = 3 bits wider, both on the left and right side.

- 1110: Make the subsequence (1<<3)-1 = 7 bits wider, both on the left and right side.

For the sequence of numbers presented under "Tweaking the algorithm", the algorithm in the article produces 969 bits of data. My algorithm produces just 823.

Edit: Tiny bug in original code.

Edit 2: maybe it’s smarter to widen the subsequence using triangular numbers. Doing it exponentially is a bit too aggressive, it seems.

What you're working towards here is essentially Golomb coding. The optimal divisor will depend on how the data is distributed.

https://en.wikipedia.org/wiki/Golomb_coding

As I'm not allowed to edit the post above: using a unary number system instead of exponential/triangular makes the implementation even simpler, and cuts down the output of the example output even further to 798 bits.

> The prefixes describe previously can be used to widen the subsequence

So you can only widen the subsequence, never shrink it. Shrinking would address the issue raised by TFA:

> If a series contains an outlier value that requires a very large window to encode, and all subsequent values have nonzero values only in a much smaller subwindow, the inefficient larger window will get locked in because we never hit the condition that would trigger a reset of the window size.

Shrinking happens automatically, because every iteration takes the actually meaningful width of the previous iteration.

> By default, the subsequence that is used is identical to that containing differing bits during the previous iteration

If I'm understanding correctly, this means that if the first choice a sized subsection is oversized, this will carry through to future samples (unless the encoder chooses to re-size it explicitly). I wonder if, with your cheaper "loosen" encoding, it is worth automatically tightening the bit delta. For example, if I go to an initial window of "5 zeros, the 7 bits 1001101, remainder zeros" (which I'll call 5;7), and the next encoding is "same, 0101110", the leading and trailing zeros adjust the window to (in this example) 6;5, "tightening" around the non-zero bits. With your low-cost "loosen" doing this aggressively (instead of, say, if it happens for multiple samples in a row) seems sane. It also means that the fact that your loosen encoding is symmetric (which makes sense in terms of bit efficiency) is somewhat mitigated because the implicit tighten can break the symmetry.

> > By default, the subsequence that is used is identical to that containing differing bits during the previous iteration

> If I'm understanding correctly, this means that if the first choice a sized subsection is oversized, this will carry through to future samples (unless the encoder chooses to re-size it explicitly).

And that's exactly what I do: I resize it. The subsequence that is picked during round n is the one that actually contained the differing bits in round n-1.

I wonder how it compares to traditional compression?

I've tried their "Integer that increments by 1 at every timestep." case on 10000 integer. The algorithm described said "6.98x compression"

bzip2 gave 80000/8642 = 9.2x compression

zstd gave 80000/11316 = 7.1x compression

I'm not sure that's a very fair comparison, as this algorithm does point-wise streaming compression/decompression. I could see myself using something like this in a FPGA context or within a tight acquisition inner loop for example.

Here's discussion of an article from 2020 that explains this and other time series compression approaches: https://news.ycombinator.com/item?id=31379344

For integer values, just subtracting bytes is enough to allow much better compression ratios. See the table at https://opensource.quarq.com/zfit/zfit.pdf

Very handy to do with 7-Zip: -mf=Delta:4 (or just f=Delta:4 in the GUI config window). This is for 4 byte little endian integers, which works pretty well with 16-bit stereo PCM even if there's some unwanted overflow between the channels, if you can't use flac in some context (raw data, etc).

For those interested in the topic, I highly recommend the series of posts from Charles Bloom:

http://cbloomrants.blogspot.com/2023/07/notes-on-float-and-m...

> TLDR:

> For S16 deltas, bias by 0x8080 , for S32 deltas, bias by 0x80808080.

> De-interleaving multi-byte integers into homogenous streams can help, particularly with weaker back-end compressors.

> TLDR:

> For floats, just reinterpret F32 as U32.

> If the floats have a mix of positive and negative, convert the sign bit to two's-complement signed S32.

> Consider lossy elimination of negative zero -0.f

> If you didn't actually want the huge precision for floats near zero, a simple lossy encoding is just to do float += constant, which works for non-negative floats where you don't know the high end of the range so you can't just use fixed point.

> TLDR:

> The best way to compress numeric data that is larger than bytes (F32,F16,S32,S16) is usually to delta them in their original size integer, then de-interleave after the delta.

> Sometimes no filter or no deinterleave is best, particularly with stronger compressors, so being able to select filter on/off per-file can give big wins.

What if the floating point number is, for example, a number of seconds coming from a millisecond time source? Binary floating point numbers notoriously can't represent decimals precisely.

You would obviously be better off timing in the proper unit so that you have an integer number of milliseconds but that's the kind of situation where performance starts to drop, no one understands why until some dev starts digging deeply into the encoding and maybe even write a blog post about how he got a massive speedup by multiplying numbers by 1000.

In the first several examples, the input seems to have a mantissa with all zeros beyond the first 15 bits. Why isn't the author using float32 for this instead of float64?

Not having to make that kind of decision in the first place is inherently valuable in of itself, which I believe is what's being showcased here.

I guess, but most compression algorithms like DEFLATE or zstd would compress those big sequences of zeros basically to nothing.

This appears to be working because the data are highly correlated, i.e. similar magnitudes across the vector. I wonder how well it would do with completely random input, not just uniform floats in the [0, 1] interval, but floats across all exponent combinations.

XOR'ing consecutive elements is the mark of people used to sequential, rather than parallel, processing.

Most of the approaches described in that paper should work well enough with a base + offset scheme, where the base is RLE-compressed or just set once every N elements.

This algorithm is pretty neat. I wonder if for the use case outlined by the post they could get even better compression through the use of fewer significant digits. For the purposes of recording times, 0. 00768 is likely just as useful as 0.0076792240142822266

That's already (one of) the main property that the algorithm is exploiting: That f64 is an horribly overkill representation for the data since it was generated using a microsecond clock, leading to a lot of trailing zeros.

So yeah, dropping more significant digits should lead to even better compression.

But if we are going to be massaging the data based on what we know/need, there are better ways to go about compressing it. e.g. I'd expect that simple fixed-point quantization with a delta encoding would compress better than this. What I find cool about this is that it dynamically figures things out without having to deal with any priors.

yes! the very last example in the post shows what happens if you truncate the mantissa to just the 4 most significant bits

One question: it is possible for the XOR of two consecutive floating-point numbers to have 32-63 leading zeros; the numbers 32-63 do not fit in 5 bits. I imagine this is treated by Gorilla like 31 leading zeros?

> We then immediately write out the subsequence of the XORed value that has the same starting index and length as the last subsequence we wrote.

I don't really understand this. Is there an example?

IIRC, there was a WAV-era audio format that used PCM delta compression by recording differences rather than absolute values.

Differential PCM did this basically versus the predicted value. Delta modulation is the other scheme.

That's probably it. Thanks for easing my perpetual forgetfulness!

How does this compare to regular dictionary compression?

Dictionary compression isn't really applicable to floating-point data, because it isn't made up of commonly repeated "words" like text is.

Dictionary compression, combined with a form of entropy coding can work. As shown in the article the exponents often repeat themselves, and the low order bits of the mantissa are often zero.

Advanced context modeling compression algorithms can work even better than the scheme presented in the article. These are like mini-AIs, and they can recognize patterns more complex than just a string of unchanged bits, and with the appropriate entropy coder, give excellent compression. Or course the processing cost is huge compare to a few bit manipulations, and wouldn't make much sense unless you need extreme compression.

But context modelling isn't dictionary compression. Parent is correct that dictionary compression itself (e.g. LZ4) isn't very useful for floating point.

A normal network runs at 10Gbps or 1.25GiB/s.

200MiB/s is too slow by a factor of 7. Though obviously some people have optimised it.

According to https://www.speedtest.net/global-index, 200MiB/s is over five times faster than Singapore's broadband speed – and that's the country with the fastest broadband in the world.

200Mbps puts you in the top 15 countries by broadband speed (above Japan) and the top 4 by mobile speed (above South Korea). Not that that's relevant to anything. (I haven't edited out any mistakes. The child comment is lying.)

You're conflating MiB/s with Mbps. Though I believe the GP was actually talking about local server networks, not internet speeds.

If you need to stream sequential floating point data out an internet link something more space efficient (and probably chunked instead) would probably be worthwhile over the simplicity.

I'd expect an actually optimized implementation to be at least that much faster.

Uh no? Most not-corporate networks are gigabit to 2.5gbit at best. Higher end home users may be doing 5gbit and 10gbit+.

Even top end wifi tends to only be around 1.2gbit.

Higher end home users are doing 25, 50 and 100 Gbps. Hardware is actually rather affordable now.

Although there's just not enough PCIe bandwidth in current consumer hardware for 100 Gbps...

How many homes have fibre runs everywhere. Most homes don't even have cat-5

Your high end is like the top 0.0001% of home users.

You can get to this under $1k nowadays. This Mikrotik switch changed everything: https://mikrotik.com/product/crs504_4xq_in

You can buy second hand 100g networking cards pretty cheaply. The caveat is that they only support older PCIe standards, so you need PCIe 16x to use them at full speed.

Please also don't forget typical NVMe drives are pretty fast now. Just two of them can nearly saturate a 100g link.

Personally I know a lot of people in IT field who have single mode fiber SFP28 networks. I'd say 0.001% - 0.01% is closer to reality.

Remember we were talking about high end.

Like 99% of the people I know who work in IT don’t have anything near what you’re mentioning. They spend $100 on all their networking equipment, not just a switch. They’re not laying fiber in their house. And outside of that group the number is basically 0.

I guess typical gateway drug is to get a 10g internet connection or 10g NAS at home and find out copper based cat6/7 is a lousy solution to actually make use of it.

Surprisingly high latencies (compared to fiber, that is) and noisy fans due to a lot of heat generated.

Shrug. I guess your mileage may vary.

It's hard for me to picture doing something with an internet/NAS connection and caring about the extra percents of a millisecond copper gives you. And all the 10g cards amazon is showing me are fanless. Even if you need a fan, it should be enough at whisper-quiet speeds.

I say this because I am almost at this point (I run copper through my house) but was scared off on going full fiber. I know some who have, but it’s definitely really, really rare.

I had (well, still have) cat 7 in the house, while venturing into fiber as well. 10GBASE-T switches are noisy and surprisingly power hungry.

One big reason for fiber is a home lab. Transferring large VM images etc. is so much faster. Ability to run iSCSI fast is also amazing.

Single mode fiber is pretty nice. It's fast, has low power requirements and no need to worry about electrical issues. Prices have declined significantly. The only minus is that you can't have PoE, so the best option is to wire both.

(Ok, there's actually Power over Fiber, but that's just crazy stuff https://en.wikipedia.org/wiki/Power-over-fiber. A huge fire risk.)

Yeah but that is the extreme minority. Claiming it isnt is absurd.

x86 motherboards usually come with a 10Gbps onboard NIC.

It's the standard. I don't know what kind of world you live in to convince yourself it's exclusive.

100Gbps is also available pretty cheap, but typically only as a separate card.

The simple beauty of XOR floating point compression

Recommend

What next for TikTok in the US?

Aesthetic Cannulas: Everything You Need to Know

李彦宏选错了夏一平？

Team helps rebuild WW2 Spitfire lost for 76 years in Norwegian wilderness

How to Move from Story Points and Magical Thinking to Cycle Time for Decisions

Framework won’t be just a laptop company anymore

高瓴，又募了60亿

IBM Q1 earnings report 2024: IBM to acquire HashiCorp

掘金2万亿的低空经济市场，倍飞智航研制300+公里航程的eVTOL航空飞行器

Shepherd: How Stripe adapted Chronon to scale ML feature development

About Joyk