35

Accelerating Spark 3.0 Google DataProc Project with NVIDIA GPUs in 6 simple step...

 3 years ago
source link: https://towardsdatascience.com/accelerating-spark-3-0-google-dataproc-project-with-nvidia-gpus-in-6-simple-steps-ab8c26d38957?gi=6688271d6e43
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Accelerating Spark 3.0 Google DataProc Project with NVIDIA GPUs in 6 simple steps

Spark 3.0 + GPU is here. And it is a gamechanger

Data Exploration is a key part of Data Science. And does it take long? Ahh. Don’t even ask. Preparing a data set for ML not only requires understanding the data set, cleaning, and creating new features, it also involves doing these steps repeatedly until we have a fine-tuned system.

As we moved towards bigger datasets,Apache Spark came as a ray of hope. It gave us a scalable and distributed in-memory system to work with Big Data. By the by, we also saw frameworks likePytorch and Tensorflow that inherently parallelized matrix computations using thousands of GPU cores.

But never did we see these two systems working in tandem in the past. We continued to use Spark for Big Data ETL tasks and GPUs for matrix intensive problems inDeep Learning.

Mv6NFvr.png!web

Source

And that is where Spark 3.0 comes. It provides us with a way to add NVIDIA GPUs to our Spark cluster nodes. The work done by these nodes can now be parallelized using both the CPU+GPU using the software platform for GPU computing,RAPIDS.

Spark + GPU + RAPIDS = Spark 3.0

As per NVIDIA , the early adopters of Spark 3.0 already see a significantly faster performance with their current data loads. Such reductions in processing times can allow Data Scientists to perform more iterations on much bigger datasets, allowing Retailers to improve their forecasting, finance companies to enhance their credit models, and ad tech firms to improve their ability to predict click-through rates.

Excited yet. So how can you start using Spark 3.0? Luckily, Google Cloud, Spark, and NVIDIA have come together and simplified the cluster creation process for us. With Dataproc on Google Cloud, we can have a fully-managed Apache Spark cluster with GPUs in a few minutes.

This post is about setting up your own Dataproc Spark Cluster with NVIDIA GPUs on Google Cloud.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK