23

Do big data stream processing in the stream way

 5 years ago
source link: https://www.tuicool.com/articles/hit/YfUZVb3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Reading: Years in Big Data. Months withApache Flink. 5 Early Observations With Stream Processing: https://data-artisans.com/blog/early-observations-apache-flink.

The article suggest adopting the right solution, Flink, for big data processing. Flink is interesting and built for stream processing.

The broader view and take away may be to solve problems using the right solution. We saw many painful tries in history and in current practices still: do huge large scale data in traditional databases, do unstructured data processing in relational database, do graph processing in tables way, do stream processing in micro-batch way and etc. The specific problem should be handled by a solution built for that problem and that solution can be the most efficient and convenient one.

zY7BFbu.jpg!web

Some good examples and points from the article.

“In reality, however, processing data with as low latency as possible has been a challenge for a long time….a customer asked me how to produce an up-to-date aggregation over a tumbling five-minute window of a growing table using Hive.”

“the customer and business user really need: a representation of data as a stream and the ability to do in-stream complex/stateful analytics. ”

“Customers and end-users wrangle with the latency gap in all kinds of interesting and expensive ways.”

“it’s refreshing to be given constructs of stream, state, time and snapshots as the building blocks of event processing rather than incomplete concepts of keys, values, and execution phases.”

“The first approach is to use batch as a startingpoint then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead–hence the proportion of the overhead increases as you try to reduce latency.”

“However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real-time?”

“requirements around low latency processing and complex analysis cannot be met in an inexpensive, scalable and fault-tolerant way.”


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK