Databricks Open Sources Delta Lake to Make Data Lakes More Reliable

Databricks recently announced open sourcing Delta Lake , their proprietary storage layer to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark while Delta Lake is already being in use in several companies like McGraw Hill, McAffee, Upwork and Booz Allen Hamilton.

Delta Lake is addressing the heterogeneous data problem that data lakes often have. Ingesting data from multiple pipelines means that engineers need to enforce data integrity manually, throughout all the data sources. Delta Lake can bring ACID transactions to the data lake, with the strongest level of isolation applied, serializability.

Delta Lake provides time travelling, being able to fetch every version of a file in time, a feature quite useful for GDPR and other audit related requests. Metadata on files are stored using the exact same process as data, enabling the same level of processing and feature richness.

Delta Lake provides schema enforcement capabilities. Data types and presence of fields can be checked and enforced, making sure that the data can be kept clean. Schema changes on the other hand, don’t require DDL but can be applied automatically.

Delta Lake is deployed on top of the existing data lake, it is compatible with both batch and streaming data and can be plugged into an existing Spark job as a new data source. Data is stored in the familiar Apache Parquet format.

Delta Lake is also compatible with MLFlow , Databricks newest open source platform that was launched last year. The code is available on GitHub .

Recommend

FGO StyleGAN: This Heroic Spirit Doesn’t Exist

TypeScript - The Top Programming Language to Learn in 2019

Kubernetes Service Mesh

Using Bayesian Games to Address the Exploration-Exploitation Dilemma in Deep Lea...

Unicode & Character Encodings in Python: A Painless Guide

Ethereum-Based Stock Exchange Plans First Company Listing in June

简单梳理Redux的源码与运行机制

Kubernetes的三种外部访问方式：NodePort、LoadBalancer 和 Ingress

Why I’m completely rewriting my $5,735 MRR SaaS

Serverless?Node.js Puppeteer 渗透测试爬虫实践

About Joyk