57

A Bunch of Money on AWS and Some Benchmark Results

 4 years ago
source link: https://www.tuicool.com/articles/hit/qeqQnmb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Editor’s note: Running these benchmarks and documenting the results is truly a team effort. In addition to John Sherwood, Eric Hanson and Nick Kline authored this blog post.

If you’re anything like us, you probably roll your eyes at “company benchmarks its own software, writes a post about it.” This of course raises a question: if we’re cynical industry veterans, numb from the constant deluge of benchmarketing … why are we writing this? Simple: we wanted to prove to ourselves that we’re the only modern, scalable database that can do a great job on the three well-known database benchmarks, TPC-C, TPC-H, and TPC-DS, which cover both OLTP and data warehousing. Now, we hope to get you to suspend your skepticism long enough for us to prove our capabilities to you.

MemSQL’s primary focus is on what we call operational analytics (you might have heard this referred to as HTAP , translytical , or HOAP ), running analytical queries across large, constantly changing datasets with consistently high performance. This performance is provided through the use of scale-out, compilation of queries to machine code, vectorized query execution, and use of single instruction, multiple data (SIMD) instructions. Our operational analytics capability blurs OLTP and data warehouse (DW) functionality, and different use cases take advantage of different slices of our spectrum of capabilities.

To test some of the new features in our upcoming 7.0 release, we used multiple TPC benchmarks to push MemSQL beyond what our normal everyday stress tests can reach. Our results show that we can do both transaction processing and data warehousing well, and we scale well as workload size increases. No other scale-out database running on industry-standard hardware can do this.

Playing Fair

TPC benchmarks have a certification process, and we did our best to follow the specifications, but we did not work with TPC to certify the benchmarks, so these are informal results. But we hope it goes without saying that we didn’t cheat. In fact, MemSQL does not have any “benchmark specials” (features designed just to make the benchmark result better, but that nobody would ever use for a real application) built into the code.

The TL;DR

This is a long post because we wanted to write an in-depth overview of how we ran the benchmarks so you can see how we achieved all these results. But if you want just a quick summary, here’s how we did on the benchmarks.

  • TPC-C: MemSQL gave up to 8 times more throughput than other databases on comparable amounts of hardware.
  • TPC-H: MemSQL’s performance against other databases varied on this benchmark, but was faster than multiple modern scale-out database products that only support data warehousing.
  • TPC-DS: MemSQL’s performance ranged from similar, to as much as two times faster than other databases.

What we accomplished was solid performance across three distinct benchmarks, establishing MemSQL as a top choice database for operational analytics and cloud-native applications . Now let’s move on to the details of the benchmarks.

TPC-C: Scale-out OLTP

On the massively transactional side of the spectrum, we have the TPC-C benchmark. To quote the 1997 SIGMOD presentation announcing it, a benchmark is a “distillation of the essential attributes of a workload,” and TPC-C distills the absolute hell out of a sharded transactional workflow. As the scale (defined by the number of warehouse facilities being emulated) increases, the potential parallelism increases as well. From our perspective, this is ideal for discovering any bottlenecks in our own code.

Our TPC-C database design used single tables for the large data sets, allowing us to use an existing driver targeting MySQL. Unlike some other official results published by major database vendors, we did not artificially split the data into multiple separate tables to reduce locking and contention. The way we ran the benchmark is much simpler, and shows how our scale-out architecture can make application development easier.

While experimenting with smaller datasets, we quickly discovered that driving significant load required us to nearly match the CPU of the aggregators with the leaves in the cluster. This configuration, strong all over (“very Schwarzenegger”, as one of our engineers put it) is quite unusual; it’s rare that customers require such a ratio of aggregators to leaves.

In part, our internal benchmarking service colocates drivers on aggregators, which required us to size up boxes to have the extra CPU. Additionally, under the pure OLTP TPC-C workload, aggregators are primarily coordinating transaction state and are effectively concurrency bound.

As we would recommend for a real workload of this nature, we had redundancy and synchronous replication enabled. This means that there are two copies of every data partition on two different nodes for high availability (HA), and transaction commits are not acknowledged to the client until both the primary and secondary replicas are updated. This of course requires twice the memory compared to running without HA, and adds to transaction overhead, since row updates must be applied on the replica node before the commit. This cost is to be expected if HA with fast failover is required. If you want HA, you need to pay.

After experimenting with smaller datasets, we drove just over 800,000 transactions per minute (tpmC) against 10,000 warehouses, using a cluster with 18 leaves and 18 aggregators, all r3.8xlarge instances. This led us to scale up, moving onto TPC-C at the 50,000 scale. This required what we would term a formidable cluster; storing a single replica of the data (including indexes) required more than 9 TB of RAM.

For HA, we stored two copies of the data, which, after some fairly straightforward math, took us to almost 20 TB. We decided to use r5.metal instances, using 2x the leaf instances, for a total of 6x the cores compared to the 10,000 warehouse cluster. Notably, the only change we made in running the benchmark itself was in tuning the partition count and hash index buckets; from the perspective of the open source benchmark driver we were using, originally written for MySQL, everything just worked.

The last notable thing about our 50,000 warehouse run was that we ran out of hardware from AWS while trying to reach the limit of the cluster; as we had taken 36 hosts for leaves and 21 for aggregators, we suppose that’s pretty reasonable.

With 1,728 physical cores on the leaf nodes for the megacluster (to use the scientific term), it’s hard to argue with calling this “throwing hardware at the problem.” However, as discussed above, we were able to easily meet the same per-core performance as at smaller scales. We feel that our overall results are a strong validation of the performance improvements to our new intra-cluster replication implementation in MemSQL 7.0.

7vAZN3f.png!web

Figure 1 – TPC-C comparison of MemSQL, CockroachDB, and Amazon Aurora.

Scale vCPUs MemSQL CockroachDB Amazon Aurora 1,000 48 48 32 (single host) 10,000 576 480 N/A 50,000 3,456 not disclosed N/A

Table 1   – Leaf Cores, normalized to vCPUs. This table only includes leaf vCPUs for MemSQL, as we also had to use aggregator CPU to run the benchmark driver instances. Full details are included in Appendix A.

Some comparisons of our results to other scale-out TPC-C results are shown in Figure 1. Table 1 shows a comparison of the amount of hardware used to obtain results shown in Figure 1 to the limits of what is known about how the results were obtained.  We use the term vCPU to refer to the notion of a CPU and its hyperthreading partner, similar to how AWS refers to vCPU counts.

The results for CockroachDB are taken from [ CR18 ]; regarding Figure 1, the amount of hardware CockroachDB used for the 50,000 warehouse result was not fully specified, so it’s not possible to directly compare performance per unit of hardware. Additionally, the Aurora figure is from [ Ver17 ], and the only comparable run described was for the 1,000 warehouse scale factor. Furthermore, given the scale-out architecture of the Aurora storage layer, presumably, additional vCPUs are required for log and commit processing. It’s not possible to discern how many vCPUs were needed for this from [Ver17].

In Table 1, for MemSQL, we show both the number of leaf cores and the total number of cores, including aggregator cores. The way our benchmark kit is designed, it was easiest for us to run the benchmark with this configuration. In addition, our benchmark setup ran the application threads driving the transaction work on the aggregator machines. Normally that is not accounted for in the benchmark results.

The aggregator nodes were not fully utilized, and significantly over-provisioned to drive this test, though we were not able to determine the minimum amount of equipment needed to obtain performance at the level shown in Figure 1. Despite the caveats, these results show the dramatic benefits of our distributed architecture, in-memory skip list and hash storage structures, and compilation of queries to machine code.

Per-Core Throughput for OLTP

It’s comforting to know that you’re getting your money’s worth as you scale a cluster and workload to larger sizes. Figure 2 illustrates that MemSQL’s per-core throughput for TPC-C, measured in transactions per minute (tpmC) per core, actually increases as we go from 1,000 to 50,000 warehouses.

uUJruqj.png!web

Figure 2 — Per vCPU transaction throughput (includes aggregator vCPUs) . When increasing the dataset size and the hardware to match, MemSQL TPC-C throughput per vCPU held within a 30% band. We had plenty of headroom on the leaves at the 50,000 scale, and could have driven more throughput, but AWS ran out of r5.metal hosts for us to use as benchmark drivers, which was a pretty cool consolation prize in our opinion.

Figure 2 illustrates the ability to maintain throughput per core for a transaction processing workload as data and cluster size increase. This translates to roughly linear scalability in throughput as the size of the transaction processing workload increases. Yet you can still run analytical queries across all the data, unlike the common approach of manually sharding large data sets across multiple databases instances with a single-node DBMS, such as Postgres, MySQL, or SQL Server. The nature of MemSQL allows for running analytic-style queries on this data in the same instance that is also handling transaction processing.

TPC-H and TPC-DS: DW Scalability and Then Some

On the opposite end of the spectrum, we have TPC-H (H) and TPC-DS (DS), categorized by the TPC as “decision support” benchmarks. H and DS use similar datasets, and DS is effectively the next-generation version of H. While H generates fairly straightforward queries and tends to be shard-friendly, DS thrills in its use of advanced SQL features and functions and exhilarates in its lopsided filters. Running DS is notoriously, purposefully difficult, and MemSQL can run all 99 DS queries, while many decision support-oriented databases cannot.

TPC-DS

At scale factor 10,000, the largest TPC-DS table contains just shy of 29 billion rows, with some 24 billion others spread out across the rest of the tables.

NJVNJjq.png!web Figure 3 — comparative TPC-DS runtimes at 10 TB scale factor — lower is better.  For TPC-DS, we performed a ‘power run,’ where the sequential runtime of each query is summed. Lower is better, unless “the benchmark is still running” is an excuse to goof off at work.

Product Physical Cores (not VCPUs) Power Run Time (s) MemSQL 640 6,495 Cloud DW Product 32 nodes, unknown size 4,481 Alibaba AnalyticsDB 1024 3,935 Alibaba Hadoop 520 16,207 Transwarp Data Hub 1024 21,615

Table 3 – comparative hardware used for TPC-DS runs . We normalized to physical cores rather than vCPUs to match TPC usage.

As mentioned, DS involves a formidable array of operators and features. There are not many TPC-DS results to compare to. We scoured the internet to find comparable results for DS, and found some full runs for Alibaba Hadoop [ AL19 , AL19b ] and Cisco UCS [ Cis19 ]. Figure 3 shows the comparison of MemSQL, which is capable of running all 99 queries, with the other systems for which we were able to find full results – including one which we unfortunately cannot name the vendor for.

For the result shown in the figure, where we spent about 6,500 seconds to run the queries, we used a cluster consisting of 19 i3.16xlarge leaves and a single i3.16xlarge aggregator instance. With over 600 leaf CPUs, this was certainly a sizable (and, we have been informed by our finance team, expensive) cluster, but still similar in scale to other clusters that successfully completed the benchmark.

TPC-H

bqumaiV.png!web Figure 4 — TPC-H total runtimes comparing MemSQL and SQL Server — smaller is better. We performed the same ‘power run’ with TPC-H as for DS, where the result is the sum of the run times of the sequentially executed queries that constitute the benchmark. This figure compares MemSQL, SQL Server 2017 at scale factor 10,000, and SQL Server 2016 at scale factor 30,000 (10 TB and 30 TB).

Compared to TPC-DS, TPC-H is a much friendlier (and older) benchmark. We ran H at the 10 TB scale factor on a cluster of 8 i3.8xlarge leaf nodes, and then scaled to run it at 30 TB using 24 i3.8xlarge leaf nodes. The 30 TB dataset contains a table with 180 billion rows, compressed to about 11 TB on disk in our columnstore, and took about three and a half hours at an average of 2.4 GB/s to load from S3 using our Pipelines feature.

While researching the benchmark, we found that every run submitted to the TPC at these larger scales was for SQL Server, the best of which we’ve graphed next to our results above. As shown in Figure 4, at a lower scale, we are outperformed by a quad-socket host running SQL Server with 224 vCPUs versus our 256. But at a larger size, scale-out catches up to scale-up – and running on commodity hardware versus specially-crafted quad socket monsters is a virtue of its own, as far as data center operators are concerned. Details of the SQL Server results are given in benchmark results published by Microsoft [ MS16 , MS17 ].

2Qn2um2.png!web

Figure 5 – Data volume processed per second per core for TPC-H . This chart shows a normalized throughput measure (DB_size_MB/cores/sum_of_query_runtimes_in_seconds) for each of TPC-H 10000 and TPC-H 30000. This is to illustrate how MemSQL scales for the benchmark as data size increases.

What we’re most looking for here is validation of our scale-out architecture. A fantasy, “perfectly scalable” database would process 3x the data with 3x the hardware in the same amount of time; as illustrated in Figure 5, we took about 30% longer than that ideal. Given that queries often require a final aggregation step, some falloff is expected. Moreover, more hardware adds substantial marginal return and doesn’t hit a ceiling in these tests — per core throughput drops by 30% as total work done goes up by a factor of around 300%.

nQjuam2.png!web
Figure 6 — comparative TPC-H results at 30 TB scale factor.

We again scoured the web for TPC-H results and found several for well-known cloud data warehouse products, including Amazon Redshift, Microsoft Azure SQL Data Warehouse, and Snowflake Data Warehouse in reports from GigaOm [ GO18 , GO19 ], as well as certified single-box results on the TPC web site. As shown in Figure 6, MemSQL provides competitive results compared to these products.

Cost and Value

At this point you might think the results are pretty impressive, but you’re probably also thinking there is some special-purpose hardware helping make these benchmarks run smoothly. Well, there isn’t. Just as we didn’t add any code tricks to make the benchmark run better, we didn’t use anything other than industry-standard hardware. We really wanted to make these benchmarks replicable by anyone who wanted to test them, and you can. We explain how to do that below.

The enterprise edition of MemSQL will cost you dramatically less than legacy vendors such as Oracle, plus we have a free version running on up to a 4 node cluster that supports all enterprise features. Beyond just price, we believe MemSQL offers new value to customers by enabling them to build operational analytics applications, mixing OLTP and analytics operations, in a way that has not been feasible before. This enables real-time dashboards and other analytics on just-born data, driving new business value.

By showing results for TPC-C, TPC-H, and TPC-DS running on one scale-out-capable SQL DBMS, we believe we demonstrate that MemSQL delivers new value. In addition, given the breadth MemSQL covers, once a group of application developers in an organization are trained to use it, they can apply their skills to a broad range of applications, giving flexibility to them and their managers. This of course is also valuable to the business.

Conclusion

We’ve shown that MemSQL can deliver strong results for the TPC-H and TPC-DS analytical benchmarks and the TPC-C transaction processing benchmark. To our knowledge, we’re the only SQL DBMS supporting true shared-nothing scale-out that can do this. Our hardware needs are comparable to competitors when running these benchmarks at similar scale factors, and our software licensing costs are significantly less than those of the established database vendors. MemSQL skills can be used to build a variety of applications, including systems for transaction processing, operational analytics, operational data stores, and data warehousing, without scale limits. This will give you and your team the ability to solve multiple problems at once, in less time, with one powerful tool.

Appendix A: More Detailed Results

TPC-C

The hardware configurations we used for TPC-C for 1,000, 10,000, and 50,000 warehouse scale factors are given in the following table. The node types given are for AWS.

Warehouses Leaf Nodes Sum Leaf vCPUs Sum Leaf RAM Aggregator Nodes tpmC 1,000 3 x r3.4xlarge 48 366 GB 3 x r3.4xlarge 94,900 10,000 18 x r3.8xlarge 1,152 4.29 TB 18 x r3.8xlarge 800,400 50,000 36 x r5.metal 3,456 27 TB 21 x r5.metal 5,300,000

We used the open-source MySQL driver from Percona for TPC-C, with slight modifications to the DDL. Specifically, we made the item table into a reference table, the history table into a columnstore table, changed indexes where appropriate to be hashes instead of skip lists, and sized the bucket count appropriately for the size of the dataset. Our cluster was configured to use two replicas for data, replicated synchronously across paired hosts, with asynchronous writes to disk.

Aggregator nodes ran the benchmark driver alongside the MemSQL aggregator instances, where the driver consumed roughly half of the CPU used. Although we used the same instance type for leaves and aggregators out of convenience, in the TPC-C workload aggregators also required very little RAM, and it should be easily possible to save on costs by running on lower-end hosts.

We can’t speak highly enough of the driver from Percona and how useful it was in enabling us to run the benchmark. We generated our datasets using the utility from the same repo, albeit modified to generate pipe-separated value files. We then loaded these files in parallel with MemSQL Pipelines , instead of doing direct database inserts.

TPC-H

The hardware configurations we used on AWS for TPC-H are given in the following table.

Scale Factor Leaf Nodes Sum Leaf Phys. Cores Sum Leaf RAM Aggregator Nodes Power Run (s) 10 TB 8 x i3.8xlarge 128 1.9 TB 1 x m4.10xlarge 1,789 30 TB 24 x i3.8xlarge 384 5.72 TB 1 x m4.10xlarge 2,371

We generated our datasets using the open-source dataset generator available from tpc.org. In calculating the power run, we performed a warmup query, then averaged the result of 1-5 executions. The total number of runs was sometimes limited due to the total runtime of all the queries and the time available to finish the runs.

TPC-DS

The hardware configuration we used on AWS for TPC-DS is shown in the table below.

Scale Factor Leaf Nodes Sum Leaf Phys. Cores Sum Leaf RAM Aggregator Nodes Power Run (s) 10 TB 19 x i3.16xlarge 608 9.05 TB 1 x i3.16xlarge 6,495

We generated our datasets using the dataset generator available from tpc.org. In calculating the power run, we performed a warmup query, then averaged the result of 1-5 executions, depending on the runtime of query execution, in the same fashion as for our TPC-H runs.

Appendix B: How to Play Along at Home

If you’re interested in replicating our results, or experimenting with MemSQL and these benchmarks, we’ve included instructions on how we went about running C, H, and DS here.

TPC-C

To run the TPC-C benchmark, we relied heavily on Percona’s excellent work, which we’ve forked to github.com/memsql/tpcc-mysql. This driver allows both generating the dataset and driving load. As MemSQL is wire compatible with MySQL, we were able to use the driver binary in an effectively unaltered format; in fact, the only changes we made to the repository were as follows:

  • Fixed a units bug that caused queries to be judged as failing if they took 5 milliseconds instead of 5 seconds.
  • Changed the dataset generator to create pipe-separated value files rather than inserting directly into the database.
  • Altered the schema to have shard keys and to take advantage of MemSQL features such as reference tables and hash indices.

If you’d like to use this driver to verify our results, the README covers how to both generate data and run the benchmark.

Cluster Setup

For the purposes of the TPC-C benchmark, there are several configuration settings that will drastically impact your results. To match what we used, do the following:

  • MemSQL 7.0: While we did recently make our latest and greatest 6.x release available, one of the major improvements in MemSQL 7.0 (download here ) is an entirely new system for handling intra-cluster replication more efficiently. While 6.x releases support replication, performance is dramatically improved in 7.0.
  • Enable synchronous replication.
  • Enable asynchronous durability.
  • Once MemSQL has the correct configuration, you will need to create a database, per the Percona README.

Benchmark Execution

While running the benchmark is exceedingly simple, to fully stress a large MemSQL cluster you will need to spread the load across multiple aggregators. To get relevant results, the driver instances need to run in parallel, and only data from when all drivers are running is a valid representation of the cluster’s performance. We wrote a harness that used the database to coordinate across hosts; however, as it hooks in pretty deeply to our internal benchmarking system, we’re unable to meaningfully open source this extension.

TPC-H & DS

While the TPC-C specification is highly descriptive and gives a wide latitude in how results are achieved, TPC-H and DS are explicitly about executing SQL queries. As such, we were able to leverage the dataset and query generation tools available for free download from tpc.org: [H], [DS].

At a high level, running both benchmarks on MemSQL consists of the following steps:

  1. Data generation
  2. Create the table schema using our script (so that you use properly sized MemSQL data types and our shard key definitions and indexes) using create-tables.sql
  3. Load the data, using MemSQL Pipelines or a tool of your choice
  4. After loading the data, create statistics and optimize the tables; the commands are found in after-load.sql
  5. Run the queries by executing queries.sql

Sizing the MemSQL database cluster

We ran the benchmarks on AWS. For the DS benchmark at the 10 TB size we used a 20 node cluster with 1 MA, and 19 leaves, in total 20 nodes; for the H benchmark at 30 TB we used a cluster with an m4.10xlarge aggregator and 24 i3.8xlarge leaves.

Generating the data

Use the standard data generator for each benchmark found at http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp . To accomplish this, look under H & DS and download the tools for that benchmark. Register, download the latest version, compile the C-based generator. Data generation can take a long time, especially at a larger scale like the 10 & 30TB sizes we used. The data size of the csvs will unexpectedly be quite close to the promised value, and we highly recommend asynchronously compressing and uploading to cloud storage; we stored the generated data files in AWS s3.

Creating the schema

You can find the table schema and queries at the MemSQL github repo at memsql/benchmarks-tpc , looking under the ‘h’ or ‘ds’ subdirectories as appropriate. You can clone or copy them to your local machine by using the command

git clone [email protected]:memsql/benchmarks-tpc.git

In the benchmarks-tpc/tpcds and benchmarks-tpc/tpch  directories you can find the files we used to create and run the benchmark, respectively H and DS versions of create-tables.sql , after-load.sql and queries.sql . Two common ways to submit queries to MemSQL are to use the command line MySQL client or the MemSQL tools. The MemSQL Studio tools, including the SQL editor, are described at https://docs.memsql.com/memsql-studio/latest/memsql-studio-overview/ .

Loading the data

Depending on the system your test cluster is running on, there are various alternatives for loading data. The MemSQL Pipelines feature loads data in parallel; see https://docs.memsql.com/memsql-pipelines/v6.7/pipelines-overview/ . We used MemSQL Pipelines to load data from S3 for our tests. An example pipeline command follows.

CREATE PIPELINE catalog_sales AS LOAD DATA S3
"<memsqlpath>/sf_10000/catalog_sales."
CONFIG '{"region":"us-east-1","disable_gunzip":false}'
CREDENTIALS '{ "aws_access_key_id": "<key>",
"aws_secret_access_key": "<secret>"}'
INTO TABLE `catalog_sales`
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '|\n';

After loading

The background flusher in MemSQL automatically and asynchronously reorganizes columnstore segment files (the internal files used to store columnstore tables) for compression and to create a total ordering across all segment files in a table. To make query execution faster and avoid asynchronous processes during query execution, we manually ran the command “optimize table”, as in OPTIMIZE TABLE `call_center` ; see https://docs.memsql.com/sql-reference/v6.7/optimize-table/

MemSQL has manually created histogram statistics. Manual statistics can be created by running “analyze table” on each table, such as: ANALYZE TABLE `call_center` COLUMNS ALL ENABLE

We did this before running the queries.   These two commands are provided in after-load.sql .  

Running the queries

We ran the queries multiple times and use the average run time, not counting the initial run where the query was compiled. We used columnstore tables for this benchmark, which are stored on disk, not in memory. The difference between the warm-up run and subsequent runs is just the time to compile a query, which was generally less than a second, but might be as long as a couple of seconds for very large and complex queries. The average run time varies a little on multiple runs of a query in a shared system like AWS, but we never saw that much variation.

Many TPC benchmarks, including TPC-DS, have templatized queries that replace “placeholders” in an incomplete query text with specific values, such as numeric or string constants. We didn’t find much variation with different parameters, so we used a fixed parameterization for the ease of tracking performance across time. We include the query text with the parameters we used when running the queries in queries.sql .

References


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK