2

Quickwit 0.2 brings full-text search to ClickHouse and Kafka!

 3 months ago
source link: https://quickwit.io/blog/quickwit-0.2/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is damaged, please click the button below to view the snapshot at that time.

Quickwit 0.2 brings full-text search to ClickHouse and Kafka!

January 11, 2022
Quickwit 0.2 release

Quickwit wishes you a happy new year 2022! We are starting this new year with the release of Quickwit 0.2! Don't be mistaken by the tiny version increment. Quickwit 0.2 packs a lot of exciting features, and the point of this blog post is precisely to go through the main ones:

But before going through them one-by-one, here are two links for the impatients:

Without further ado, here are the features!

Ingesting Kafka natively with exactly-once semantics

From day 1, we have been designing Quickwit with two things in mind: cost-efficiency and scalability. One design choice that makes Quickwit more cost-efficient is our replication model.

Elasticsearch relies on document replication. When you push a document to Elasticsearch, that document is appended to an internal distributed queue and will eventually get indexed on all of the replicas. With the default 1 primary + 1 replicas setting, the indexing is done two times. For large datasets, this is a lot of CPU wasted.

Quickwit, on the other hand, uses a different approach. It indexes your documents on a single node and leaves it to an object storage to replicate the resulting index files (also called splits). We call this split replication. This approach also exists in the Lucene world, under the name of "segment replication". Mike McCandless has an excellent blog post on the subject with the pros and Cons of the two approaches.

Unfortunately, split replication does not free us from the need to replicate documents. Between the ingestion of a document, and its successful indexing, the single node in charge of indexing our document could catch fire. Such an event would result in the unjustifiable loss of our document.

We need to replicate documents somewhere, to consider them ingested. So did we include a distributed queue in Quickwit? Actually, we did not. Instead, we make it possible for users to plug Quickwit straight into their favorite distributed queue. Right now Quickwit only supports Kafka.

It might sound like a hassle to deploy Kafka to use Quickwit in production. Why doesn't this come with batteries included? I hear you.1 We will eventually add a PushAPI and embed a simple distributed queue within Quickwit to cater to light use cases.

However, for sizeable mission-critical use cases, it will remain the best practice to use the Kafka source to feed Quickwit... Note that it is also generally recommended to use Kafka as a buffer in front of Elasticsearch too. You can read more about this in this blog post from Elastic.

The solutions to connect Kafka to Elasticsearch involve a connector actively pushing documents to Elasticsearch. In other words, your document is first replicated in Kafka, then replicated into Elasticsearch's internal queue, then indexed independently in several replicas.

On the other hand, Quickwit supports Kafka natively! By natively, I mean that Quickwit's indexing pipeline connects itself directly to a Kafka topic. Quickwit guarantees the exactly-once semantics by publishing splits together with the Kafka offsets in an atomic manner. If an indexing node fails, it will resume indexing the Kafka topic at the offset corresponding to the last successfully published split.

But maybe you are not using Kafka? Perhaps your heart belongs to Pulsar, Amazon Kinesis, Google PubSub, Microsoft EventHub, RedPanda, or Fluvio? Quickwit is not married to Kafka, either. Implementing the support of another distributed queue is a relatively light process. Curious Rustaceans can have a peek at our Source trait.

Currently, only Kafka is supported. Still, if you kindly let us know about the distributed queue with your preference, it will help us prioritize its support in our roadmap.

A search stream API

In Quickwit 0.2, we added a search stream API. Instead of returning a traditional search result page, Quickwit's streaming API returns an HTTP Stream with the values of a given field in all the documents matching the query.

While it is not specific to ClickHouse, we implemented this feature for a company that wanted to get full-text search in ClickHouse. In that use case, Quickwit is streaming a primary id for ClickHouse to perform a join on. Plugging Quickwit to ClickHouse will be discussed in a future blog post. Meanwhile, if you also need to add search to your ClickHouse, we have a step-by-step tutorial for you.

Stream search is not just a superficial new API. Because hundreds of millions of documents could match a query, a search stream request could require streaming gigabytes of data.

For this reason, when running a stream search, Quickwit does not materialize the result set before streaming it. Internally, the root node dispatches the search stream request as a streaming gRPC request amongst the available leaf nodes. Every time a leaf node has computed a new chunk of ids, the data is streamed straight away to the client .

Our piping is quite good. On queries matching many documents, stream search averages a throughput of 500MB/s on a 2-nodes cluster2. Not too shabby, eh? Remember the index is stored on Amazon S3.

PostgreSQL metastore

In addition to the index splits, Quickwit needs a place to store metadata. That metadata includes the current index schema, the list of splits, the metadata associated with each split, etc.

Quickwit 0.1 only offered to store this metadata as flat files (one file per index). The files themselves can be stored on the file system or an object storage like Amazon S3.

While it is a very appealing solution for many use cases, this solution has several shortcomings. For instance, it does not support concurrent writes (concurrent reads, on the other hand, are supported).3 Also, readers will cache the index metadata to shave off some search latency and poll for updates periodically (typically 30s). The data freshness is, as a result, mildly impacted.

Quickwit 0.2 offers a PostgreSQL implementation that might appeal to most businesses. It solves all of the shortcomings described above, and in addition, it provides the extra comfort of running all kinds of SQL queries over Quickwit's metadata: this can be very useful for operations.

Tag pruning

Quickwit 0.1 already offered the possibility to do time pruning. If a user query targets a specific time range, Quickwit can use the splits' metadata to avoid searching into the splits that do not overlap with the query time range.

In Quickwit 0.2, we also make it possible to declare a field as tagged. For instance, a good candidate for such a field could be a tenant id or an application id. Quickwit will then store the set of values taken by this field in the split metadata.

Upon receiving a boolean user query, Quickwit will infer from the user query a filtering predicate upon the set of tag fields.

Let's take the example of the following query:

tenant_id=57 AND text = "hello happy taxpayer"

Let's assume that tenant_id is a tag field while text is not. All documents matching this query must also match the query tenant_id=57. Quickwit will only search into the splits with 57 within their tenant_id values.

A proper indexing pipeline

Finally, we entirely revamped our indexing pipeline.

By default, it now emits splits every sixty seconds. The pipeline also transparently takes care of merging splits.

Your mileage will vary depending on your hardware, schema, and your data; but typically a 4-core server indexes at a steady rate between 20MB/s and 60MB/s.

Building a stable, robust, and easy to reason about indexer is a fascinating software engineering problem. It will be the subject of its own dedicated blog post.



  1. I was screaming the same thing one week ago at Christmas.
  2. 2 x c5n.2xlarge instances.
  3. This is a limitation that is well known in the data lake world. For instance, the delta lake documentation contains the same disclaimer.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK