52

Lesser-Known Tips on Apache Oozie

 4 years ago
source link: https://towardsdatascience.com/lesser-known-tips-on-apache-oozie-1e9bee9169da?gi=cd7dcdd3dc23
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Tips and best practices for job scheduling using Apache Oozie

1*uoVl2GcziNS1uEHIt9wlOg.png?q=20

Source: Apache Oozie

At work, I build automated data pipelines that perform ETL/ELT on millions of rows of data on a daily basis and one of the job schedulers widely used in my team is Apache Oozie . Oozie makes it easy to schedule and coordinate Hadoop jobs (such as MapReduce, Sqoop, Hive jobs), track job progresses, and recover from failures. Most importantly, Oozie is very scalable as it can run hundreds or even thousands of jobs concurrently!

I had a few painful debugging experiences with Oozie and I found that job scheduling with Oozie can be very tricky if you don’t know the mechanism behind Oozie’s scheduling system (which the official documentation itself does not explain much about.) In this blog post, I will demonstrate how to schedule Hadoop jobs with data dependency using Oozie, provide solutions to potential problems you may run into and explain its underlying mechanisms to help you understand how Oozie works behind-the-scenes.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK