1

Spark: ACID compliant or not

 2 years ago
source link: https://blog.knoldus.com/spark-acid-compliant-or-not/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Spark: ACID compliant or not

Reading Time: 4 minutes

Spark, one of the most successful projects in the Apache Software Foundation, is evolving day by day as a market leader for Big Data processing. It is an open-source cluster computing framework schemed for fast computing. It’s efficiency and cool features make it a preferred choice among data scientists and Tech giants like Amazon, eBay, yahoo. Along with the many advantages and features that make it a powerful Big Data tool, there are some essential features missing from Spark. ACID Transactions is one of Them. 

Hold on! Do we know what does ACID Transactions mean? We will see what ACID means for Spark and whether spark really accommodates it or not.

Atomicity & Consistency

Atomicity states that it should either write full data or nothing to the data source when using spark data frame writer. Consistency, on the other hand, ensures that the data is always in a valid state.

As evident from the spark documentation below, it is clear that while saving data frame to a data source, existing data will be deleted before writing the new data. But in case of job failure, the original data will be lost or corrupted and no new data will be written.

Let’s understand this with an example.

      //Job-1
       Dataset<Long> data = spark.range(100);
       data.write().csv(FILE_PATH);

In the above code snippet, we have a job which creates 100 records and writes them into the source. When we execute this, we get a record consisting of Integers from 0 to 99 in the FILE_PATH directory. We can count the record:

spark.read().csv(FILE_PATH).count()

Now, Let’s create another Job to do same task but with an exception.

 
      //Job-2
        try {
            spark.range(50,150).map((MapFunction<Long, Integer>) num ->
                    {
                        if (num > 50) {
                            throw new RuntimeException("Atomicity failed");
                        }
                        return Math.toIntExact(num);
                    }, Encoders.INT()
            ).write().mode("overwrite").csv(FILE_PATH);
        } catch (Exception e) {
            LOGGER.info("failed while OverWriteData");
            throw new RuntimeException("Oops! Atomicity failed");
        }      

In the above code snippet, Job-2 will overwrite the set of 100 records created by job-1, but this job fails in the middle of the write operation. When we execute Job-2, a RuntimeException will be thrown as expected. Let’s go to the FILE_PATH directory and look for records created by Job-1. We didn’t find the right records. Not only did Job-2 fail, but it deleted the records created by Job-1 as well. This left our system in an inconsistent state and violate the Consistency property of transaction.

Also, overwrite operation of writing data frame to a source is two-fold. First, the operation deletes the existing data from the data source, and the next operation writes new data into it. But as seen from the above code, the existing data is successfully deleted, but no new data is written. If the operation would have been atomic, then the state would have been rolled-back to the original data in the source, but it is not what happened. Thus, the Atomicity principal of ACID properties is violated too. Also when we count the record we don’t get the expected output.

Isolation & Durability

We know that when a transaction is in process and not yet committed, it must remain isolated from any other transaction. This is called Isolation Property. It means writing to a data set shouldn’t impact another concurrent read/write on the same data set. Apache Spark does not have a strict notion of a commit. What I mean to say is, there are task-level commits, and finally, a job-level commit that Spark implements. But this implementation is broken due to lack of Atomicity in write operations. And hence, Spark is not offering isolation property as well.

Finally, Durability. It is the ACID property which guarantees that transactions that have committed will survive permanently. However, when Spark doesn’t correctly implement the commit, then all the durability features offered by the storage goes for a toss.

Conclusion

With the above discussion we can conclude that spark is not ACID compliant. I hope, it will be helpful for you.

You can find source code here

Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed above. Therefore, Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, our programs talk to the delta lake for reading and writing the data. Thus, delta lake takes the responsibility of complying with ACID properties. Explore more here >>

Happy Blogging !!!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK