A Basic understanding of Kafka Connect

Reading Time: 4 minutes

Let us discuss something about Kafka connector and some basic fundamental of it. Before start, we need to have basic knowledge of Kafka or we can go through this Document.

Apache Kafka is a distributed, resilient, fault tolerant platform. Apache Kafka is a well-known name in the world of Big Data. It is one of the most used distributed streaming platforms. Kafka is just not a messaging queue but a full-fledged event streaming platform.

It is a framework for storing, reading and analyzing streaming data. It is a publish-subscribe based durable messaging system exchanging data between processes, applications, and servers.Apache Kafka is a distributed, resilient, fault tolerant platform .

Table of content

what is Kafka Connect
Architecture of Kafka connect
Connectors and tasks
Sources ans sinks
Workers
Standalone vs distributed Mode
Features
alternatives
Conclusion

What is Kafka Connect?

Apache Kafka is a distributed streaming platform and kafka Connect is framework for connecting kafka with external systems like databases, key-value stores, search indexes, and file systems, using so-called Connectors. Kafka Connect is only used to copy the streamed data, thus its scope is not broad.It executes as an independent process for testing and a distributed, scalable service support for an organization.

Kafka connect makes our task much easier to connect Kafka to the other systems, without having to write all the glue code yourself.

common Kafka Use Cases:

Source ->KafkaProducer APIKafka Connect SourceKafka <-> KafkaConsumer API, Producer APIKafka StreamsKafka <-SinkConsumer APIKafka Connect Sink

Architecture of kafka connect

AMQ Streams With Kafka Connect on Openshift - DZone Big Data Let’s discuss above architectural structural diagram,

It is a separate Cluster.
Each Worker contains one or many Connector Tasks.
A cluster can have multiple workers and worker runs on the cluster only.
Tasks are automatically load-balanced if there is any failure as shown in the picture below.
Above all, tasks in Kafka Connect act as Producers or Consumers depending on the type of Connector.
Kafka connect cluster has multiple loaded connectors

Connectors and Tasks

Connectors are responsible to manage the tasks that will run. They must decide how data will be splitted to tasks, and provide tasks with specific configuration to perform their job well.

Tasks are responsible to get things in and out of Kafka. They get their context from the worker. Once initialized, they are started with a Properties object, containing connectors configuration. Once started, the tasks poll an external source and return a list of records (and the worker will send those data to a Kafka broker).

Sources and Sinks

Kafka Connects focused on streaming data to and from kafka, According to direction of the data moved, the connector is classified as:

Source connector – Ingests entire databases and streams table updates to Kafka topics. A source connector can also collect metrics from all your application servers and store these in Kafka topics, making the data available for stream processing with low latency.
Sink connector – Delivers data from Kafka topics into secondary indexes such as Elasticsearch, or batch systems such as Hadoop for offline analysis.

Workers

Tasks are executed by Kafka connect workers

A worker is a single java process
Workers run Connectors (each connector is class inside a jar file)
A Worker can run in standalone mode or distributed mode
If a worker crashes, a rebalance will occur (the heartbeat mechanism in the Kafka consumer’s Protocol is applied here)
If a worker joins a Connect cluster, other workers will notice that and assign connectors or tasks to this new worker, in order to balance the cluster.To join a cluster, a worker must have the same group.id property.

Standalone vs Distributed Mode

Standalone

Single Process run both connectors and tasks.
Configuration use .properties files
Very easy to get start with, useful for development and testing.
Not fault tolerant, no scalability, hard to monitor

Distributed

Multiple workers run connectors and tasks
Configuration is performed by a REST API
easy to scale, and fault tolerant(rebalancing in case a worker dies)
Useful for production deployment of connectors.

Features

Kafka connect features include:

Common Framework For Kafka Connectors – makes the connector deployment easy.
REST Interface – we can manage connectors using a REST API
Automatic Offset management -Kafka Connect helps us to handle the offset commit process, which saves us the trouble of implementing this error-prone part of connector development manually
Distributed and Standalone Modes -Scale up to a large, centrally managed service supporting an entire organization or scale down to development, testing, and small production deployments.
Distributed and Scalable by Default – It builds upon the existing group management protocol. And to scale up a Kafka Connect cluster we can add more workers.
Streaming/Batch Integration – Kafka Connect is an ideal solution for bridging streaming and batch data systems in connection with Kafka’s existing capabilities
Transformations- these allow us to make simple and lightweight modifications to individual messages

alternatives

If You don’t want to use Kafka Connect to integrate Kafka with your other apps and databases. You can write your own code using the producer and Consumer API, or use the Stream API.

Or you could even use an integration framework that supports Kafka, like Apache Camel or Spring Integration.

Conclusion

In conclusion, In this blog, we have learned basics of Kafka Connector like features, use cases, Architecture etc. and in the next blog we will see how we can setup and Launch kafka connector.

If you want to know more about Apache Kafka, Streams and Connect, then I recommend these articles:

A Basic understanding of Kafka Connect

A Basic understanding of Kafka Connect

Table of content

What is Kafka Connect?

Architecture of kafka connect

Connectors and Tasks

Sources and Sinks

Workers

Standalone vs Distributed Mode

Features

alternatives

Conclusion

Recommend

路由基础之控制RIP路由的发布及路由引入

MySQL MHA高可用集群部署及故障切换

65+ Figma Templates, UI Kits, and Wireframes to Speed Up Your Design

Just Eat Takeaway is exploring a sale of Grubhub

F5 Products Multiple Vulnerabilities

CISA 发出警告，攻击者正在利用 Windows 漏洞

Einstein wasn’t a “lone genius” after all | by Ethan Siegel | Starts With A Bang...

痞子衡嵌入式：聊聊系统看门狗WDOG1在i.MXRT1xxx系统启动中的应用及影响 - 痞子衡

I Didn’t Make Good Choices, I Had Good Choices

Privy looks to protect user data in Web3, raises $8M in seed round

About Joyk