4

No, QuestDB is not Faster than ClickHouse

 1 year ago
source link: https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-06-15
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

No, QuestDB is not Faster than ClickHouse

Geoff Genz, Principal Engineer, ClickHouse Inc.June 15, 2022 Geoff Genz, Principal Engineer, ClickHouse Inc.

QuestDB created a minor stir in the small but active world of analytics database benchmarking with this blog article:   https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale.  This is the money quote:  “QuestDB is an order of magnitude faster than both TimescaleDB and ClickHouse in this specific query”

We at ClickHouse were . . . skeptical and intrigued.  We believe that when results are unexpected and surprising there is always an opportunity to learn. So, we set out to learn.

It turns out that the benchmark suite that QuestDB uses for the comparison is constructed in such a way that the ClickHouse query is doing a complete table scan on 1.2 billion rows while QuestDB utilizes its full indexing strategy to read just a tiny fraction of the actual data.  After adding just a few key indexes to the ClickHouse tables, ClickHouse dramatically outperforms QuestDB.

We’ll use the same specifications in AWS and (starting out) load the data the same way into QuestDB and ClickHouse.  For the timings below, “cold” means with no data preloaded in the Linux file system cache, “hot” means run right after a “cold” query has completed.  Queries are run three times.  Times for QuestDb come from the Web UI (excluding network).  For ClickHouse we use clickhouse-client.

Here's the headline query:

SELECT * FROM readings WHERE velocity > 90.0 AND latitude >= 7.75 AND latitude <= 7.80 AND longitude >= 14.90 AND longitude <= 14.95

And the results from both QuestDB and ClickHouse:

QuestDB (Cold):  46.64s  46.62s   46.63s   – 46.63s average

QuestDB (Hot):   436.05ms  366.51ms  399.10ms  – 400.55ms average

(Hot times are a little higher than in the blog post, but close enough.)

ClickHouse (Cold):  69.233s  69.227s  69.222s – 69.227s average

ClickHouse (Hot):  4.776s  4.662s 4.657s – 4.698s average

So as reported in the blog post, an order of magnitude advantage for QuestDB on “hot” queries, and even some advantage on cold queries.  

(A quick aside – what is the storage cost of that extra performance? The QuestDB “readings” directory on disk is 118.685 GB. The ClickHouse “readings” directory on disk is 39.074 GB. That's almost 3x, before tuning ClickHouse compression.)

What's going on here? Why is ClickHouse struggling a bit with this query? The culprit is the table definition for our ClickHouse table as generated by the TimescaleDb benchmark suite:

CREATE TABLE benchmark.readings (

    `created_date` Date DEFAULT today(),

    `created_at` DateTime DEFAULT now(),

    `time` String,

    `tags_id` UInt32,

    `latitude` Nullable(Float64),

    `longitude` Nullable(Float64),

    `elevation` Nullable(Float64),

    `velocity` Nullable(Float64),

    `heading` Nullable(Float64),

    `grade` Nullable(Float64),

    `fuel_consumption` Nullable(Float64),

    `additional_tags` String DEFAULT ''

)

ENGINE = MergeTree(created_date, (tags_id, created_at), 8192)

The partition key is “created_date” – that means we’re creating a partition for every day.  For this amount of data we at ClickHouse would probably recommend partitioning by month, but that’s not the end of the world.

The primary key is the tuple “tags_id, created_at”.  What?  That’s a really strange choice.  Why are we sorting by tags_id?  Maybe it makes sense with some other set of queries, but let’s tweak how our ClickHouse table is sorted in a more realistic way.  We’ll partition by month, sort by day, and add latitude to the primary/sorting key.

CREATE TABLE benchmark.readings_lat

(

    `created_at` DateTime,

    `latitude` Nullable(Float64),

    `longitude` Nullable(Float64),

    `elevation` Nullable(Float64),

    `velocity` Nullable(Float64),

    `heading` Nullable(Float64),

    `grade` Nullable(Float64),

    `tags_id` UInt32,

    `additional_tags` String

)

ENGINE = MergeTree

PARTITION BY toYYYYMM(created_at)

ORDER BY (toDate(created_at), latitude, created_at)

SETTINGS index_granularity = 8192

(Another quick aside -- this table took only 103.788 seconds to load nearly 1.2 billion rows from the other ClickHouse readings table.  ClickHouse ingest speed is really fast. At some point maybe we’ll do an ingest comparison of ClickHouse vs QuestDb and some other competitors.)

Now that we’ve assumed ‘latitude’ (that is, the location of these artificial data points) is important enough to query on that we include it in our primary key, let’s look at the performance:

ClickHouse (cold):  1.689s  1.705s  1.699s  – 1.698s average

That’s 27x faster than QuestDB assuming your users aren’t always querying exactly the same cached data on your servers.  But if they are, let’s look at the “hot” times:

ClickHouse (hot):  0.100s  0.107s  0.109s – 0.105s average

Almost 4x faster than QuestDB with the only difference being adding “latitude” to the ClickHouse sorting/primary key.

To recap – QuestDB found one edge case in an obsolete benchmark suite where a narrowly tailored query outperforms ClickHouse – if ClickHouse is handicapped by having no utilized indexes.  By simply adding an appropriate key/index, ClickHouse dominates QuestDB.  This performance advantage exists in almost any real world scenario.

Whenever a benchmark is posted that is a surprise, or unexpected, we encourage you to try and reproduce it.  While we are able to reproduce portions of the results, the testing methodology itself can lead to varying outcomes.  Substantially varying outcomes.  Happy benchmarking!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK