Do big data stream processing in the stream way

Reading: Years in Big Data. Months withApache Flink. 5 Early Observations With Stream Processing: https://data-artisans.com/blog/early-observations-apache-flink.

The article suggest adopting the right solution, Flink, for big data processing. Flink is interesting and built for stream processing.

The broader view and take away may be to solve problems using the right solution. We saw many painful tries in history and in current practices still: do huge large scale data in traditional databases, do unstructured data processing in relational database, do graph processing in tables way, do stream processing in micro-batch way and etc. The specific problem should be handled by a solution built for that problem and that solution can be the most efficient and convenient one.

zY7BFbu.jpg!web

Some good examples and points from the article.

“In reality, however, processing data with as low latency as possible has been a challenge for a long time….a customer asked me how to produce an up-to-date aggregation over a tumbling five-minute window of a growing table using Hive.”

“the customer and business user really need: a representation of data as a stream and the ability to do in-stream complex/stateful analytics. ”

“Customers and end-users wrangle with the latency gap in all kinds of interesting and expensive ways.”

“it’s refreshing to be given constructs of stream, state, time and snapshots as the building blocks of event processing rather than incomplete concepts of keys, values, and execution phases.”

“The first approach is to use batch as a startingpoint then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead–hence the proportion of the overhead increases as you try to reduce latency.”

“However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real-time?”

“requirements around low latency processing and complex analysis cannot be met in an inexpensive, scalable and fault-tolerant way.”

Recommend

客户端ip获取蹲坑启示：不要侥幸

iOS 之ReactiveCocoa

Django 中使用 Celery

Node v10.14.0 (LTS)

摩根大通报告 12 个亮点总结：金融领域的机器学习工具有哪些？

Cocoapods iOS Tutorial

PRCV2018美图短视频实时分类挑战赛落幕，第一名解决方案技术解读

iOS UILabel添加行间距、字间距

知乎社区核心业务 Golang 化实践 - 知乎

Understanding Android’s vector image format: VectorDrawable

About Joyk