6

Data Observability and Pipelines: OpenLineage and Marquez

 3 years ago
source link: https://mattturck.com/datakin/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Data Observability and Pipelines: OpenLineage and Marquez

There’s an inherent tension at the heart of modern data infrastructure. On the one hand, it’s becoming more mission-critical every day, as companies around the world rely on it to run their business. On the other hand, it’s more complex, and potentially brittle, than ever, an “assembly chain” involving multiple tools and repositories.

This tension has led to the emergence of DataOps as a distinct and very active segment. One particularly important area is known as “data lineage“. The concept is basically to monitor data pipelines and understand the journey of data through its various transformations and usages. This makes it possible to fix any issues that happen along the way, and go to the root of data quality, and potentially fairness, issues.

Because data lineage involves many different tools, platforms and companies, it makes sense for those different parts of the ecosystem to collaborate around standard definitions. This is the concept behind OpenLineage, a cross-industry effort involving creators and contributors from key data projects (DBT, Spark, Pandas, etc.), gathered together at the initiative of the founders of Datakin, an SF startup beyond the open source data lineage project Marquez (originally started at WeWork).

At our most recent Data Driven NYC, we had the pleasure of hosting Julien Le Dem, CTO of Datakin. His talk (video below) is very approachable and educational.

In addition to co-founding Datakin, Julien is a well-known open source contributor. He is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Arrow. Julien Prior to Datakin, Julien was a Senior Principal Engineer at WeWork, an architect at Dremio and the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. He notes in his bio that “His French accent makes his talks particularly attractive.”

Posted on February 1, 2021February 1, 2021Categories Big Data, Data Driven NYC

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

Notify me of follow-up comments by email.

Notify me of new posts by email.

Post navigation

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Email Address


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK