

Optimize ClickHouse performance using AWS Graviton3 - Infrastructure Solutions b...
source link: https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/improve-clickhouse-performance-up-to-26-by-using-aws-graviton3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Co-authors: Martin Ma and Zaiping Bie
Introduction
ClickHouse is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). It supports best in the industry query performance, while significantly reducing storage requirements through the innovative use of columnar storage and compression. It has been very popular in the OLAP field for the past several years and has been widely used by many enterprises.
In this blog, we compare the query latency (processing time) and throughput of ClickHouse on two Amazon EC2 instance families over a range of instance sizes. These instance families are the Amazon EC2 C7g (based on Arm Neoverse-powered AWS Graviton3 processors) and C6i (based on 3rd Generation Intel Xeon Scalable processors). Our findings demonstrate that ClickHouse deployments on C7g instances can achieve up to 26% performance advantage over C6i instances. The following sections cover the details of our testing methodology and results.
Performance benchmark setup and result
For the benchmark setup, the ClickHouse server and client are deployed in different instances. We connect the ClickHouse client to the ClickHouse server and repeatedly send preset queries. We then collect query processing time and throughput to compare performance between C7g and C6i instances.
Build Config
To achieve the best performance, besides using the latest Clang to build ClickHouse per the official procedure, we also apply CMake NATIVE and AVX-related flags as following.
architecture |
ClickHouse CMake flags |
AArch64 |
-DARCH_NATIVE=ON |
-DARCH_NATIVE=ON -DENABLE_AVX2=ON -DENABLE_AVX2_FOR_SPEC_OP=ON -DENABLE_AVX512=ON -DENABLE_AVX512_FOR_SPEC_OP=ON |
To align jemalloc behavior on C7g and C6i, the following jemalloc parameters are configured in jemalloc_internal_defs.h.in.
jemalloc parameter |
value |
LG_PAGE |
12 (One page is 2^LG_PAGE bytes) |
LG_HUGEPAGE |
21 (One huge page is 2^LG_HUGEPAGE bytes) |
Server Config
The ClickHouse server runs on C7g/C6i instance families across a range of instance sizes.
The benchmark client runs on a single C7g.4xlarge instance.
The following table summarizes the tested instance types.
Instance Type |
Instance Size (vCPU) |
Memory (GiB) |
Storage |
C7g / C6i |
2xlarge (8) |
50GB (EBS gp3) |
|
4xlarge (16) |
|||
8xlarge (32) |
|||
16xlarge (64) |
The software versions and test parameters are as following:
Software |
Version |
ClickHouse |
v22.5.1.2079-stable |
Operation System |
Amazon Linux 2 |
Kernel |
5.10.112-108.499.amzn2.aarch64 |
ClickHouse server parameter |
value |
max_threads |
vCPU number |
Note: the 'max threads' parameter specifies the number of worker threads for parallel query processing on ClickHouse server; the default value is the number of physical CPU cores. When using this default 'max threads' setting, C7g instances outperform C6i instances by 40%. But up to half of the entire CPU resource are idle in C6i instances while C7g instances are fully utilized. To fully utilize the CPU resource on C6i, we set the 'max threads' value to the vCPU number on C7g and C6i instances in this comparison.
Query Time Test
We use the web analytics dataset (“hits” table containing 100 million rows) and 43 typical queries to collect query processing time, which is provided by official benchmark method.
For each of these 43 typical queries, the average query time is the arithmetic mean of 10 consecutive queries after one warmup query. The total query time, as shown in the following tables, is the sum of the average time of these 43 queries. We observed 25.8% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.
The following table shows total query processing time (lower is better) comparison between C7g and C6i.
Instance Size |
C7g (Sec) |
C6i (Sec) |
Performance gain |
2xlarge |
34.95 |
42.77 |
18.3% |
4xlarge |
18.91 |
24.57 |
23.0% |
8xlarge |
11.72 |
15.57 |
24.8% |
16xlarge |
12.16 |
25.8% |
Table 1. ClickHouse query processing time benchmark results on C7g vs C6i
Figure 1. Query time Performance gains for C7g vs. C6i
We also selected the 3 most significant queries (Query 19, Query 33, Query 34) that consume more processing time, to observe the performance uplift on C7g instances compared to C6i instances.
Query 19 |
SELECT UserID, toMinute(EventTime) AS m, SearchPhrase, count() FROM hits_100m_obfuscated GROUP BY UserID, m, SearchPhrase ORDER BY count() DESC LIMIT 10; |
Query 33 |
SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_100m_obfuscated GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10; |
Query 34 |
SELECT URL, count() AS c FROM hits_100m_obfuscated GROUP BY URL ORDER BY c DESC LIMIT 10; |
The following tables show the result of the top 3 complex queries, comparing between C7g and C6i instances. (Lower is better)
Instance Size |
C7g (sec) |
C6i (sec) |
Performance gain |
2xlarge |
3.995 |
4.918 |
18.8% |
4xlarge |
2.002 |
2.736 |
26.8% |
8xlarge |
1.101 |
1.558 |
29.3% |
16xlarge |
0.690 |
1.010 |
31.7% |
Table 2. Query 19 results on C7g vs C6i
Figure 2. Query 19 Performance gains for C7g vs. C6i instances
Instance Size |
C7g (Sec) |
C6i (Sec) |
Performance gain |
2xlarge |
4.562 |
4.947 |
|
4xlarge |
2.351 |
2.816 |
16.5% |
8xlarge |
1.578 |
2.107 |
25.1% |
16xlarge |
1.137 |
1.608 |
29.3% |
Table 3. Query 33 results on C7g vs C6i
Figure 3. Query 33 Performance gains for C7g vs. C6i instances
Instance Size |
C7g (Sec) |
C6i (Sec) |
Performance gain |
2xlarge |
3.225 |
3.766 |
14.4% |
4xlarge |
1.793 |
2.171 |
17.4% |
8xlarge |
1.066 |
1.325 |
19.6% |
16xlarge |
0.774 |
1.036 |
25.4% |
Table 4. Query 34 results on C7g vs C6i
Figure 4. Query 34 Performance gains for C7g vs. C6i instances
Throughput Test
We used the official ClickHouse benchmark tool to collect throughput data based on the same dataset and queries. After a warmup phase, each test will use the benchmark tool to continuously send all 43 typical queries to the server, reporting queries per second (QPS) by the end of test. We observed a 31.6% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.
The following table shows the QPS (higher is better) comparison for the default single connection scenario (clickhouse-benchmark --concurrency=1) on C7g and C6i.
Instance Size |
C7g (Queries/Sec) |
C6i (Queries/Sec) |
Performance gain |
2xlarge |
0.684 |
0.581 |
17.7% |
4xlarge |
2.249 |
1.738 |
29.4% |
8xlarge |
3.529 |
2.709 |
30.3% |
16xlarge |
4.536 |
3.446 |
31.6% |
Table 5. ClickHouse throughput performance results (single connection) on C7g vs C6i

Figure 5. ClickHouse throughput performance gain (single connection) for C7g vs. C6i instances
The following table shows the QPS comparison for a multi-connection scenario (clickhouse-benchmark --concurrency=N) on C7g and C6i. (note: xlarge/2xlarge/4xlarge instances cannot support multi-connection due to a memory capacity limit)
Instance Size |
Concurrency |
C7g (Queries/Sec) |
C6i (Queries/Sec) |
performance gain |
8xlarge |
4.125 |
2.968 |
39.0% |
|
4.138 |
2.931 |
41.2% |
||
4.182 |
2.947 |
41.9% |
||
4.108 |
2.914 |
41.0% |
||
16xlarge |
5.847 |
4.003 |
46.1% |
|
6.195 |
4.071 |
52.2% |
||
6.329 |
4.093 |
54.6% |
||
6.290 |
4.112 |
53.0% |
Table 6. ClickHouse throughput performance results (multi connection) on C7g vs C6i

Figure 6. ClickHouse throughput performance gain (multi connection) for C7g vs. C6i instances
Conclusion
In addition to a 20% instance price savings, by deploying on AWS Graviton3-based C7g instances ClickHouse has seen query latency (processing time) reduced by 26% and throughput performance increased by 32%. This comparison is over equally configured 3rd generation Xeon Scalable processor-based instances.
Visit the AWS Graviton3 page for customer stories on adoption of Arm-based processors. For details on how to migrate existing applications to AWS Graviton, please check this GitHub page. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at [email protected].
Recommend
-
5
Machine Learning (ML) is one of the fastest growing segments within cloud and edge infrastructure. Within ML, deep learning inference is expected to grow even faster. In this blog, we compare the ML inference performance of three Amazon Web...
-
6
AWS Graviton3 delivers leading AES-GCM encryption performance
-
5
Introduction XGBoost (eXtreme Gradient Boosting) is an open-source machine learning library under Gradient-Boosting Decision Tree (GBDT) framework. XGBoo...
-
9
Improve Redis performance up to 36% by deploying on Alibaba Cloud Yitian 710 instances
-
9
Introduction NGINX is the most popular scale-out web application server. NetCraft ranks it as the leading web server in the world, powering more than 35% of active websites. In this blog, we demonstrate the p...
-
6
Introduction Apache httpd is one of the most popular web servers, which is a software program that usually runs in the background, as a process. It plays the r...
-
8
Arm Neoverse V1 – Top-down Methodology for Performance Analysis & Telemetry Specification
-
9
Spark on AWS Graviton2 best practices: K-Means clustering case study
-
8
Introduction Memcached is an open source, high-performance, distributed memory object caching system. It is a popular choice for powering real-time applications in web, mobile apps, gaming, ad-tech, and e-Commerce....
-
3
In this blog we explore the performance of a Nginx Reverse Proxy (RP) and API Gateway (APIGW) on AWS Graviton3-based instances. We will also refer to these collectively as RP/APIGW. We compared AWS Graviton3-based instances to Intel Xeon 'Ice Lake...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK