RealTimeProcessing of Data using kafka and Spark

Reading Time: 3 minutes

Before Starting it you should know about kafka, spark and what is Real time processing of Data.so let’s do some brief introduction about it.

Real Time Processing – Processing the Data that appears to take place instead of storing the data and then processing it or processing the data that stored somewhere else.

Kafka – Kafka is the maximum throughput of data from one end to another . it uses a concept of producer and consumer for producing and consuming the data. producer sends the data into topics that’s put on Kafka cluster and consumer subscribes the data from these topics. you can read about more Kafka here

Spark – spark is an open source processing engine built around speed, ease of use, and analytic. If you have large amounts of data that requires low latency processing that then Spark is the way to go. you can read about spark here

Spark Streaming – Spark Streaming is an extension of the core Spark API that enables processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

A Quick Example –

For ingesting data from Kafka, you will have to add the corresponding dependencies.

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"

First, we import the names of the Spark Streaming classes and some implicit conversions from StreamingContext into our environment in order to add useful methods to other classes we need (like DStream). StreamingContext is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads and a batch interval of 10 seconds.

import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._

val conf = new SparkConf().setMaster("local[2]").setAppName("simpleApp")
val ssc = new StreamingContext(conf, Seconds(10))

Now we will give topic name and assign the number of thread to it for example –

val topic = Map("mytopic"->2)

so here topic name – mytopic and number of threads which I assign here is 2.

KafkaUtils API is used to connect the Kafka cluster to Spark streaming. This API has the significant method createStream signature defined as below.

def createStream(
ssc: StreamingContext,
zkQuorum: String,
groupId: String,
topics: Map[String, Int],
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[(String, String)]

this method used to Create an input stream that pulls messages from Kafka Brokers.

ssc − StreamingContext object.

zkQuorum − Zookeeper quorum.

groupId − The group id for this consumer.

topics − return a map of topics to consume.

storageLevel − Storage level to use for storing the received objects.

Now we will create Dstream using KafkaUtils

val lines = KafkaUtils.createStream(ssc , "localhost:2181","defaultId",topic).map(_._2)

Here we got Dstream that is lines and perform any Action on it. for example-

lines.flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_ + _).print

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call

ssc.start()
ssc.awaitTermination()

you can find whole example Here

References:

Recommend

投放KOL广告时，如何进行数据分析？

XRP Forensics：共防止超过110万枚XRP被错误发送

ArchSummit上海2021|全球架构师峰会

What The Heck Is Python? Executive Summary!

dForce资产总量突破1亿美元

独立防病毒厂商的末日来临

以太坊市值排名升至第31，超越网飞

著名拍卖行富艺斯将拍卖多世代NFT数字艺术

欧易OKEx热门币种播报：SWFTC拉升，日内涨幅29.61%

CRUD in NodeJs with MongoDB: Explanations

About Joyk