37

Databricks Open Sources Delta Lake to Make Data Lakes More Reliable

 4 years ago
source link: https://www.tuicool.com/articles/JnEJvqM
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Databricks recently announced open sourcing Delta Lake , their proprietary storage layer to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark while Delta Lake is already being in use in several companies like McGraw Hill, McAffee, Upwork and Booz Allen Hamilton.

Delta Lake is addressing the heterogeneous data problem that data lakes often have. Ingesting data from multiple pipelines means that engineers need to enforce data integrity manually, throughout all the data sources. Delta Lake can bring ACID transactions to the data lake, with the strongest level of isolation applied, serializability.

Delta Lake provides time travelling, being able to fetch every version of a file in time, a feature quite useful for GDPR and other audit related requests. Metadata on files are stored using the exact same process as data, enabling the same level of processing and feature richness.

Delta Lake provides schema enforcement capabilities. Data types and presence of fields can be checked and enforced, making sure that the data can be kept clean. Schema changes on the other hand, don’t require DDL but can be applied automatically.

Delta Lake is deployed on top of the existing data lake, it is compatible with both batch and streaming data and can be plugged into an existing Spark job as a new data source. Data is stored in the familiar Apache Parquet format.

Delta Lake is also compatible with MLFlow , Databricks newest open source platform that was launched last year. The code is available on GitHub .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK