19

Launch HN: Dataform (YC W18) – Build Reliable SQL Data Pipelines as a Team

 4 years ago
source link: https://www.tuicool.com/articles/jiqIBbI
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Hi HN!

We’re Guillaume and Lewis, founders of Dataform, and we're excited (and nervous) to be posting this on HN.

Dataform is a platform for data analysts to manage data workflows in cloud data warehouses such as Google BigQuery, Amazon Redshift or Snowflake. With our open source framework and our web app, analysts can develop and schedule reliable pipelines to turn raw data into reliable datasets they need for analytics.

Before starting Dataform, we managed engineering teams in AdSense and led product analytics for publisher ads. We heavily relied on data (and data pipelines!) to generate insights, drive better decisions and build better products. Companies like Google invest a lot to build internal data tools for analysts to manage data and build data pipelines. In 5 minutes I could define a new dataset in SQL that would be updated every day and then use it in my reports.

Most businesses today are centralising their raw data into cloud data warehouses but lack the tools to manage it efficiently. Pipelines run manually or via custom scripts that break often. Or the company decides to invest engineering resources to set up, maintain and debug a framework like Airflow. But that’s just for scheduling and the technical bar is often too high for analysts to contribute.

We saw a need for a self-service solution for data teams to manage data efficiently, so that analysts can own the entire workflow from raw data to analytics. We built Dataform with two core principles in mind:

1. Bring engineering best practices to data management. In Dataform, you build data pipelines in SQL, and our open source framework lets you seamlessly define dependencies, build incremental tables and reuse code across scripts. You can write tests against your raw and transformed data to ensure data quality across your analytics. Lastly, our development environment also facilitates the adoption of best practices, where analysts can develop with version control, code review or sandboxed environments.

2. Let data teams focus on data, not infrastructure. We want to bring a better, faster and cheaper alternative to what businesses have to build and maintain in-house today. Our web app comes with a collaborative SQL editor, where teams develop and push their changes to GitHub. You can then orchestrate your data pipelines without having to maintain any infrastructure.

Here's is a short video demo where we develop two new datasets, push the code to GitHub and schedule their execution, in under 5 minutes.

https://www.youtube.com/watch?v=axDKf0_FhYU

You can sign up at https://dataform.co . If you're curious how it works - here are the docs: https://docs.dataform.co and the link to our open framework: https://github.com/dataform-co/dataform

We would love to hear your feedback and answer any questions you might have!

Lewis and Guillaume


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK