A Time Series Storage for Coordinates
source link: https://blog.nobugware.com/post/2019/time_series_storage_for_coordinates/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
TL;DR;Knowing your data helps compress them better than common algorithms.
Problem
For one of my side projects, an IoT database , I wanted a specialized time series to store timestamps coupled with coordinates.
I needed a simple solution which allows live and cold compressed storage with gaps in it: IoT devices can be off for days then reappear.
But couldn’t find anything fitting my needs, the Gorilla Paper from Facebook is really nice but expects a 4 hours maximum gap between time events.
Solution
Time series for geospacial data have particular properties:
- events are occurring often regularly
- object is not moving at all or
- object is often moving at a constant speed
It means we could store deltas between events to reduce the space needed.
Storage
-
One
uint32
is needed to encode an uncompressed time using Unix ts -
Two
float32
are needed to encode an uncompressed latitude/longitude
(stored as uint32: float32 * 10e5, 0.00001 degree represents ~1 meter)
By storing the delta between two events, we need less space:
1201986030
then 10 seconds later 1201986040
, the delta is 10
, which can be stored in an uint8
, or in an uint16
for a bigger delta value but still reducing the original needed uint32
.
The next event will probably occur 10 seconds later at 1201986050
, by storing deltas of deltas (so we need a signed integer), we can store 0
(zero is nice since it takes zero space) or the delta +/- !
Many time series, Prometheus included, are using this trick.
Same applies to the coordinates, a moving car or a not moving car, can be compressed by this technic.
Because we may have large gap between times or between coordinates, larger than a uint16
, we may have to store a fully uncompressed value.
We still have to store something for every events indicating if it’s a zero, uint8
, uint16
delta or a full value.
Delta0 DeltaEncoding = 0b00 Delta8 DeltaEncoding = 0b01 Delta16 DeltaEncoding = 0b10 Full32 DeltaEncoding = 0b11
Using two bits we can tell the timestamp encoding, same goes for latitude and longitude, reading it back with &
mask.
func (d DeltaEncoding) TSDelta() DeltaEncoding { return d & 0b000011 } func (d DeltaEncoding) LatDelta() DeltaEncoding { return d & 0b001100 >> 2 } func (d DeltaEncoding) LngDelta() DeltaEncoding { return d & 0b110000 >> 4 }
Compression
Let’s recap with an example:
# timestamp lat lng encoded 11201986030
48.22222
2.33333
47a4d9ee004994ce00038f75
2
1201986040
48.22222
2.33333
010a
3
1201986050
48.22222
2.33333
00
4
1201986060
48.22221
2.33334
14ff01
5
1201986070
48.22220
2.33335
00
6
1201986080
48.22219
2.33336
00
From #1 to #3, you can see a not moving device, with a constant timestamp will end consume only 1 byte.
From #4 to #6, a moving device with a constant speed will take 1 byte per coordinate, then end consume and share the exact same 1 control byte.
By using these simple solutions on real data , it’s compressing better than common compressors, being very easy on CPU consumption:
Let’s compare the compressed size to other algorithms:
Size: 64776 Snappy 61007 LZ4 57225 TSD 23948 Size: 18864 Snappy 18870 LZ4 18267 TSD 9883 Size: 15768 Snappy 12721 LZ4 11860 TSD 4775
Conclusion
There is nothing revolutionary here and it can be improved a lot (6 bits used on the control byte, XORing values, RLE…) but it demonstrates a simple specialized piece of software could perform better than someone else compression library.
The code for tsd is up on Github !
Hiring
Looking for a remote full time employee or contractor, contact me.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK