Kafka to Google Big Query persistence service

Beast

Kafka to BigQuery Sink

Architecture

Consumer : Consumes messages from kafka in batches, and pushes these batches to Read & Commit queues. These queues are blocking queues, i.e, no more messages will be consumed if the queue is full. (This is configurable based on poll timeout)
BigQuery Worker : Polls messages from the read queue, and pushes them to BigQuery. If the push operation was successful, BQ worker sends an acknowledgement to the Committer.
Committer : Committer receives the acknowledgements of successful push to BigQuery from BQ Workers. All these acknowledgements are stored in a set within the committer. Committer polls the commit queue for message batches. If that batch is present in the set, i.e., the batch has been successfully pushed to BQ, then it commits the max offset for that batch, back to Kafka, and pops it from the commit queue & set.

z63uu2b.png!web

Building & Running

Prerequisite

A kafka cluster which has messages pushed in proto format, which beast can consume
should have BigQuery project which has streaming permission
create a table for the message proto
create configuration with column mapping for the above table and configure in env file
env file should be updated with bigquery, kafka, and application parameters

Run locally:

git clone https://github.com/gojekfarm/beast
export $(cat ./env/sample.properties | xargs -L1) && gradle clean runConsumer

Run with Docker

The image is available in gojektech dockerhub.

export TAG=80076c77dc8504e7c758865602aca1b05259e5d3
docker run --env-file beast.env -v ./local_dir/project-secret.json:/var/bq-secret.json -it gojektech/beast:$TAG

-v mounts local secret file project-sercret.json to the docker mentioned location, and GOOGLE_CREDENTIALS should match the same /var/bq-secret.json which is used for BQ authentication.
TAG You could update the tag if you want the latest image, the mentioned tag is tested well.

Running on Kubernetes

Create a beast deployment for a topic in kafka, which needs to be pushed to BigQuery.

Deploymet can have multiple instance of beast
A beast container consists of the following threads:
- A kafka consumer
- Multiple BQ workers
- A committer
Deployment also includes telegraf container which pushes stats metrics Follow the instructions in chart for helm deployment

BQ Setup:

Given a TestMessage proto file, you can create bigtable with schema

# create new table from schema
bq mk --table <project_name>:dataset_name.test_messages ./docs/test_messages.schema.json

# query total records
bq query --nouse_legacy_sql 'SELECT count(*) FROM `<project_name>:dataset_name.test_messages LIMIT 10'

#  update bq schema from local schema json file
bq update --format=prettyjson <project_name>:dataset_name.test_messages  booking.schema

# dump the schema of table to file
bq show --schema --format=prettyjson <project_name>:dataset_name.test_messages > test_messages.schema.json

Produce messages to Kafka

You can generate messages with TestMessage.proto with sample-kafka-producer , which pushes N messages

Running Stencil Server

run shell script ./run_descriptor_server.sh to build descriptor in build directory, and python server on :8000
stencil url can be configured to curl http://localhost:8000/messages.desc

Contribution

You could raise issues or clarify the questions
You could raise a PR for any feature/issues To run and test locally:

git clone https://github.com/gojekfarm/beast
export $(cat ./env/sample.properties | xargs -L1) && gradlew test

You could help us with documentation

Beast

Architecture

Building & Running

Prerequisite

Run locally:

Run with Docker

Running on Kubernetes

BQ Setup:

Produce messages to Kafka

Running Stencil Server

Contribution

Recommend

人脸识别不到位，兄弟秒变父与子！AI 研习社「人脸年龄识别挑战赛」火热进行中

搏击码农：一旦让我开始，我就不会停止！丨二叉树视频

GitHub - google/honggfuzz: Security oriented fuzzer with powerful analysis optio...

36氪首发 |“人造肉”创企「Future Meat Technologies」获1400万美元A轮融资，将从实验...

从营销实战来看，为什么说Costco可能会凉？

最前线 | 诺贝尔文学奖揭晓：中国作家残雪未获奖，村上春树再次陪跑

Siri从不生气

GitHub - golang/crypto: [mirror] Go supplementary cryptography libraries

这几天比较忙，没看群没逛社交，今天看到这个，我沉默了。

node.js含有%百分号时，发送get请求时浏览器地址自动编码的问题

About Joyk