26

Building columnar compression in a row-oriented database

 4 years ago
source link: https://www.tuicool.com/articles/6ji6VzA
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How we achieved 91%-96% compression in the latest version of TimescaleDB

Today, we’re excited to announce a new native compression capability for TimescaleDB , a time-series database on PostgreSQL. This new feature, which has been in private beta for a number of months, uses best-in-class compression algorithms along with a novel method to create hybrid row/columnar storage. During our beta period, we invited community members to try it and give us feedback -- and as a result, we’re now seeing up to 96% lossless compression rates for various real-world and simulated time-series workloads.

With this release, TimescaleDB’s storage footprint is now on par with custom-built and more limited NoSQL stores – without sacrificing any of our unique capabilities. TimescaleDB still offers full SQL, relational JOINs and functions, powerful automation capabilities, and the reliability and huge ecosystem that comes from leveraging PostgreSQL’s foundation. We know that storage may have been a limiting factor for some people interested in TimescaleDB in the past, but we encourage you to try out native compression and let us know what you think.

TimescaleDB achieves these compression rates by deploying best-in-class algorithms for compressing various types of data. Users can specify which algorithms to use themselves, but by default we employ the following:

  • Gorilla compression for floats
  • Delta-of-delta + Simple-8b with run-length encoding compression for timestamps and other integer-like types
  • Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)
  • LZ-based array compression for all other types

We extended Gorilla and Simple-8b in order to handle decompressing data in reverse order, which allows us to speed up queries that use backwards scans. For super technical details, please see our compression PR .

(We have found this type-specific compression quite powerful: In addition to higher compressibility, some of the techniques like Gorilla and delta-of-delta can be up to 40x faster than LZ-based compression during decoding, leading to much improved query performance.)

We plan in the future to provide advanced algorithms for other native types, such as JSON data, but even today, using the above approaches, all PostgreSQL data types can be used in TimescaleDB’s native compression.

Native compression (and TimescaleDB 1.5) are widely available today for download across all our distribution channels, and will be available in about one week via Timescale Cloud . This capability is released under our Timescale Community license (so it’s fully free to use).

Row-oriented vs Columnar databases

Traditionally databases fall into one of two categories: row-oriented and column-oriented (aka “columnar”) databases.

Here is an example: Say we have a table that stores the following data for 1M users: user_id, name, # logins, last_login . So we effectively have 1M rows and 4 columns. A row-oriented data store will physically store each user’s data (i.e., each row) contiguously on disk. By contrast, a columnar store will store all of the user_id's together, all of the names together, and so forth, so that each column’s data is stored contiguously on disk.

As a result, shallow-and-wide queries will be faster on a row store (e.g., “fetch all data for user X”), while deep-and-narrow queries will be faster on a column store (e.g., “calculate the average number of logins for all users”).

In particular, columnar stores do really well with narrow queries over very wide data. With such storage, only the designated columns need to be read from disk (rather than bringing in pages of data from disk with the entire rows, then selecting one or a few columns just in memory).

Additionally, because individual columns of data are typically the same type and are often drawn from a more limited domain or range, they typically compress better than an entire wide row of data comprising many different data types and ranges. For example, our column of number of logins would all be of an integer type and may cover a small range of numeric values.

Yet columnar stores are not without trade-offs. First of all, inserts take much longer: the system needs to split each record into the appropriate columns and write it to disk accordingly. Second, it is easier for row-based stores to take advantage of an index (e.g., B-tree) to quickly find the appropriate records. Third, with a row-store it is easier to normalize your dataset, such that you can more efficiently store related datasets in other tables.

As a result, the choice of row-oriented vs. columnar database greatly depends on your workload. Typically, row-oriented stores are used with transactional (OLTP) workloads, while columnar stores are used with analytical (OLAP) workloads.

But time-series workloads are unique

If you’ve worked with time-series data before, you know the workloads are unique in many ways:

  • Time-series queries can be shallow-and-wide, where an individual query accesses many columns of data, as well as data across many different devices/servers/items. For example, “ What’s happening across my deployment in the last K minutes?
  • Time-series queries can also be deep-and-narrow, where an individual query selects a smaller number of columns for a specific device/server/item across a longer time period. For example, “ What is the average CPU usage for this server over the last 24 hours?
  • Time-series workloads are usually insert-heavy. Insert rates of hundreds of thousands of writes per second are normal.
  • Time-series datasets are also very granular, effectively collecting data at a higher resolution than either OLTP or OLAP, leading to much larger datasets. Terabytes of time-series data are also quite normal.

As a result, the optimal time-series store needs to:

  • Sustain high-insert rates, easily in the hundreds of thousands of writes per second
  • Efficiently process both shallow-and-wide and deep-and-narrow queries across that large dataset
  • Efficiently store, i.e. compress, that large dataset so that it is manageable and cost-effective

That is what we have done with the latest version of TimescaleDB.

Combining the best of both worlds

TimescaleDB is architected as a time-series database built on top of PostgreSQL. In doing so, it inherits everything that’s great about PostgreSQL: full SQL, huge query and data model flexibility, battle-tested reliability, an active and ardent developer and user base, and one of the largest database ecosystems around.

But TimescaleDB’s low-level storage uses PostgreSQL’s row-oriented storage format, which adds modest per-row overhead and reduces compressibility, because contiguous data values are of many different types -- strings, integers, floats, etc. -- and are drawn from different ranges. And by itself, PostgreSQL to date doesn’t offer any native compression (except for very large objects stored in their own “TOASTed” pages, which isn’t applicable for most content).

Alternatively, some users run TimescaleDB on a compressed file system like ZFS or BTRFS for storage savings, often in the 3x-9x range. But, this leads to some deployment challenges given that it is an external dependency, and its compressibility is still impacted by the row-oriented nature of the underlying database (as data is mapped to disk pages).

Now, with TimescaleDB 1.5, we have been able to combine the best of both worlds: (1) all of the benefits of PostgreSQL, including the insert performance and shallow-and-wide query performance for recent data from a row store, combined with (2) the compression and additional query performance -- to ensure we only read the compressed columns specified in a query -- for deep-and-narrow queries of a columnar store.

Here are the results.

Results: 91-96% storage savings (seen from independent beta testing)

Prior to releasing, we asked some members of the community and existing TimescaleDB customers to beta test the new compression features with some of their actual datasets, as well as tested compression against Time-Series Benchmarking Suite datasets .

Below are results, which include the type of workload, total uncompressed bytes, the compressed bytes (size they saw after compression), and compression savings. And these savings are with only lossless encodings for compression.

Workload Uncompressed Compressed Storage Savings IT metrics (from Telco beta tester) 1396 GB 77.0 GB 94% savings Industrial IoT monitoring data (from beta tester) 1.445 GB 0.077 GB 95% savings IT metrics (DevOps dataset from TSBS) 125 GB 5.5 GB 96% savings IoT monitoring data (IoT dataset from TSBS) 251 GB 23.8 GB 91% savings

“Compression ratio is jaw-droppingly high :)” - Tamihiro Lee, Network Engineer, Sakura Internet

More results: Cost savings and faster queries

But such compression is not just academic, it leads to two real benefits:

  • Cost. Storage at scale is expensive. A 10TB disk volume in the cloud is more than $12,000 per year itself (at $0.10/GB/month for AWS EBS storage), and additional HA replicas and backups can grow this number by another 2-3x. Achieving 95% storage can save you over $10K-$25K per year in storage costs alone (for example, $12K/10TB * 10TB/machine * 2 machines [one master and one replica] * 95% savings = $22.8K ).
  • Query performance. Compression leads to immediate performance improvements for many types of queries. As more data fits in less space, fewer disk pages (with compressed data) need to be read to answer queries. (A benchmarking sneak peek is below, with a deeper dive in an upcoming post.)

Next Steps

Native compression is widely available in TimescaleDB 1.5 today. You can either install TimescaleDB or upgrade your current TimescaleDB deployment . (Within the next week, this feature will be available via Timescale Cloud and we will update this post when that happens.)

We also encourage you to sign up for our upcoming webinar “ How to Reduce Your Database Total Cost of Ownership with TimescaleDB ” to learn more.

Now, if you would like to learn more about the fun technical bits -- about building columnar storage on a row-based systems, indexing and querying on compressed data, and some benchmarks -- please continue reading.

All credit for these results to some of our great engineers and PMs: Josh Lockerman , Gayathri Ayyapan , Sven Klemm , David Kohn , Ante Krešić , Mat Arye , Diana Hsieh , and Bob Boule . (And yes, we’re hiring worldwide. )

Building columnar storage on a row-based system

Recognizing that time-series workloads access data in temporal order, our high-level approach to building columnar storage is to convert many wide rows of data (say, 1000) into a single row of data. But now, each field (column) of that new row stores an ordered set of data comprising the entire column of the 1000 rows.

So, let’s consider a simplified example using a table with the following schema:

Timestamp Device ID Status Code Temperature 12:00:01 A 0 70.11 12:00:01 B 0 69.70 12:00:02 A 0 70.12 12:00:02 B 0 69.69 12:00:03 A 0 70.14 12:00:03 B 4 69.70

After converting this data to a single row, the data in “array” form:

Timestamp Device ID Status Code Temperature [12:00:01,
12:00:01,
12:00:02,
12:00:02,
12:00:03,
12:00:03] [A,
B,
A,
B,
A,
B] [0,
0,
0,
0,
0,
4] [70.11,
69.70,
70.12,
69.69,
70.14,
69.70]

Even before employing data compression, this format immediately saves storage by greatly reducing our internal per-row overhead. PostgreSQL typically adds ~27 bytes of overhead per row (e.g., for MVCC versioning). So even without any compression, if our schema above is say 32 bytes, then a 1000 rows of data which previously took [1000 * (32 + 27)] ~= 59 kilobytes now takes [1000 * 32 + 27] ~= 32 kilobytes in this format.

But given a format where similar data (timestamps, device ids, temperature readings, etc.) is stored contiguously, we can employ type-specific compression algorithms to it, so that each array is separately compressed.

Then, if a query asks for a subset of these columns:

SELECT time_bucket(‘1 minute’, timestamp) as minute
	AVG(temperature)
FROM table 
WHERE timestamp > now() - interval ‘1 day’
ORDER BY minute DESC
GROUP BY minute

The query engine can fetch (and decompress at query time) only the timestamp and temperature columns to compute and return this aggregation.

But given that Postgres’s MVCC-style storage format can write multiple rows on the same disk page, how can we help ensure that we only fetch the desired compressed arrays from disk, rather than a broader set of surrounding data? Here we leverage non-inline disk pages to store these compressed arrays, i.e., they are TOASTed so that the in-row data now points to a secondary disk page that stores the compressed array (the actual row in the main heap table becomes very small, because it’s just pointers to the TOASTed data). In such a way, only the compressed arrays for the required columns are brought in from disk, further improving query performance by reducing disk I/O. (Remember that each array might have 100s-1000s of data items, rather than 6 as shown.)

Indexing and querying on compressed data

However, this format by itself has a major issue: which rows should the database fetch and decompress in order to resolve a query? In the above schema, the database can’t easily determine which rows contain data from the past day, as the timestamp itself is in a compressed column. Do we need to decompress all data in a chunk (or even the entire hypertable) to determine which rows correspond to the latest day? And similarly, user queries might typically filter or group by a specific device (e.g., SELECT temperature … WHERE device_id = ‘A’ ).

Decompressing all data would be very inefficient. But because we are optimizing this table for time-series queries, we can do more, and automatically include more information in this row to improve query performance.

TimescaleDB does this by automatically building data hints and including additional groupings when converting data to this columnar format. When performing compression on the uncompressed hypertable (either via a specific command or using an asynchronous policy), the user specifies “order by” columns and optionally “segment by” columns. ORDER BY columns specify how the rows that are part of a compressed patch are ordered. Typically this is by timestamp, as in our running example, although it can also be a composite, e.g., ORDER BY time then location.

For each “ORDER BY” column, TimescaleDB automatically creates additional columns that store the minimum and maximum value of that column. This way, the query planner can look at this special column that specifies the range of timestamps in the compressed column -- without first performing any decompression -- in order to determine whether the row could possibly match a time predicate specified by a user’s SQL query.

We can also segment compressed rows by a specific column, so that each compressed row corresponds to a data about a single item, e.g., a specific device_id. In the following example, TimescaleDB segments by the device_id, so that separate compressed rows exist for device A and B, and each compressed row contains data from 1000 uncompressed rows about that device.

Device ID Timestamp Status Code Temperature Min Timestamp Max Timestamp A [12:00:01,
12:00:02,
12:00:03] [0,
0,
0] [70.11,
70.12,
70.14] 12:00:01 12:00:03 B [12:00:01,
12:00:02,
12:00:03] [0,
0,
0] [70.11,
70.12,
70.14] 12:00:01 12:00:03

Now, a query for device ‘A’ between a time interval is quite fast: The query planner can use an index to find those rows for ‘A’ that contain at least some timestamps corresponding to the specified interval, and even a sequential scan is quite fast since evaluating predicates on device ids or min/max timestamps does not require decompression. Then, the query executor only decompresses the timestamp and temperature columns corresponding to those selected rows.

This capability is powered by TimescaleDB’s built-in job scheduler framework. We’ve previously used it for various data lifecycle management tasks like data retention policies, data reordering, and continuous aggregations. Now, we leverage it to asynchronously convert recent data from an uncompressed row-based form to this compressed columnar form across chunks of TimescaleDB hypertables: Once a chunk is old enough, the chunk will be transactionally converted from the row to columnar form.

Query performance sneak peek

At this point, one logical question to ask would be, “How does compression impact query performance?”

What we find is that compression also leads to immediate performance improvements for many types of queries. As more data fits into less space, fewer disk pages (with compressed data) need to be read to answer queries.

Given the length of this post so far we’ll cover query performance in depth in another, upcoming blog post, including looking at performance for queries both touching disk and accessing in-memory data, and for DevOps and IoT workloads.

But for now, we thought we’d provide a sneak peek on our results.

Query performance benchmarks

We use the open-source Time Series Benchmark Suite (TSBS) with TimescaleDB running on cloud VMs with remote SSD storage (specifically, Google Cloud n1-highmem-8 instance types with 8vCPU and 52GB Memory using local NVMe SSD disk).  

In this set of queries, we focus specifically on disk-bound performance, which one often encounters when performing more ad-hoc or randomized queries against large datasets; in some sense, these results serve as a “worse case” compared to warm data which may already be cached in memory.  To do so, we ensured that all queries were made against data residing on disk, so that the OS virtual memory subsystem had not already cached disk pages into memory.

As you can see from the below table (which reports the average of 10 trials), virtually all of the TSBS queries are faster with native compression.

Cold Queries (from TSBS) Uncompressed (ms/query) Compressed (ms/query) Improvement cpu-max-all-1 42.517 42.314 1.00 cpu-max-all-8 46.657 40.342 1.16 groupby-orderby-limit 1373.309 6065.812 0.23 high-cpu-1 46.657 40.342 1.16 high-cpu-all 3551.953 8084.623 0.44 single-groupby-1-1-12 49.546 38.46 1.29 single-groupby-1-1-1 33.54 25.695 1.31 single-groupby-1-8-1 50.805 40.495 1.25 single-groupby-5-1-12 49.406 42.013 1.18 single-groupby-5-1-1 30.734 27.674 1.11 single-groupby-5-8-1 45.91 43.002 1.07 double-groupby-1 5925.591 1823.033 3.25 double-groupby-5 7568.038 2980.089 2.54 double-groupby-all 9286.914 4399.367 2.11 lastpoint 1674.194 264.666 6.33

Table above contains latency of “cold” TSBS DevOps queries to TimescaleDB, with all data residing on disk, against both for uncompressed and compressed data. "Improvement" defined as "uncompressed query latency / compressed query latency."

That said, one can construct queries that perform slower on compressed data. In particular, TimescaleDB’s compression currently limits the types of indexes that can be built on the compressed data; notably, b-trees can only be built only on segment-by columns. But in practice, we find that queries which would be faster with these indexes tend to be rare (e.g., they also require a large number of distinct indexed items so that any one item are not present in most disk pages).

Limitations and future work

The initial release of TimescaleDB native compression is quite powerful, with custom advanced compression algorithms for various data types and delivered through our continuous asynchronous scheduling framework. In addition, we already have some improvements already planned, e.g., better compression for JSON data.

One of the main limitations of our initial release in v1.5 is that once chunks are converted into compressed column form, we do not currently allow any further modifications of the data (e.g., inserts, updates, deletes) without manual decompression. In other words, chunks are immutable in compressed form. Attempts to modify the chunks’ data will either error or fail silently (as preferred by users).

That said, given that time-series workloads primarily insert (or less commonly update) recent data, this is much less a limitation for time-series that it would be for a non-time-series use case. Further, users can configure the age of chunks before they are converted to this compressed columnar form, which allows flexibility for moderately out-of-order data, or during a planned backfill. Users can also expressly decompress chunks before modifying them. We also plan to weaken/remove this limitation in future releases.

Summary

We’re very excited about this new capability and how it will bring both greater cost savings, query performance, and storage scalability to TimescaleDB and our community.

As we mentioned above, if you are interested in trying out native compression today, you can install TimescaleDB or upgrade your current TimescaleDB deployment . (Within the next week, this feature will be available via Timescale Cloud .) You can also sign up for our upcoming webinar “ How to Reduce Your Database Total Cost of Ownership with TimescaleDB ” to learn more.

In the past couple months, we’ve announced both scale-out clustering and native compression for TimescaleDB. Taken together, they help realize our vision of TimescaleDB as a powerful, performant, and cost efficient platform for time-series data, from the small scale to the very large, from the edge to the cloud.

We’ve all repeatedly heard the misconception that one needs to sacrifice SQL, relational capabilities, query and data model flexibility, and battle-hardened dependability and reliability in time-series databases in order to achieve the needed scale, performance, and efficiency. Similarly, we've all heard skepticism around PostgreSQL: that while PostgreSQL is an amazing and reliable database foundation, it can’t possibly work for time-series data.

With TimescaleDB 1.5, we’re continuing to disprove those notions, and demonstrate that through dedicated focus and engineering for time-series data problems, one doesn’t have to make these tradeoffs.

If you have time-series data, please give the latest version of TimescaleDB a try. We welcome your feedback. And together, let's build the only time-series database that doesn't force you to make difficult trade-offs. Come, have your cake and eat it too.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK