Install Apache Spark on Ubuntu 20.04

<?xml encoding="utf-8" ??>

Introduction

Apache Spark is an open-source, general-purpose, multi-language analytics engine for large-scale data processing. It works on both single and multiple nodes by utilizing the RAM in clusters to perform fast data queries on large amounts of data. It offers batch data processing and real-time streaming, with support of high-level APIs in languages such as Python, SQL, Scala, Java or R. The framework offers in-memory technologies that allow it to store queries and data directly in the main memory of the cluster nodes.

This article explains how to install Apache Spark on Ubuntu 20.04 server.

Prerequisites

Deploy a fully updated Vultr Ubuntu 20.04 Server.
Create a non-root user with sudo access.

1. Install Java

Update system packages.

$ sudo apt update

Install Java.

$ sudo apt install default-jdk -y

Verify Java installation.

$ java -version

2. Install Apache Spark

Install required packages.

$ sudo apt install curl mlocate git scala -y

Download Apache Spark. Find the latest release from the downloads page.

$ curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Extract the Spark tarball.

$ sudo tar xvf spark-3.2.0-bin-hadoop3.2.tgz

Create an installation directory /opt/spark.

$ sudo mkdir /opt/spark

Move the extracted files to the installation directory.

$ sudo mv spark-3.2.0-bin-hadoop3.2/* /opt/spark

Change the permission of the directory.

$ sudo chmod -R 777 /opt/spark

Edit the bashrc configuration file to add Apache Spark installation directory to the system path.

$ sudo nano ~/.bashrc

Add the code below at the end of the file, save and exit the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save the changes to take effect.

$ source ~/.bashrc

Start the standalone master server.

$ start-master.sh

Find your server hostname from the dashboard by visiting http://ServerIPaddress:8080. Under the URL value. It might look like this:

URL: spark://my-server-development:7077

Start the Apache Spark worker process. Change spark://ubuntu:7077 with your server hostname.

$ start-slave.sh spark://ubuntu:7077

3. Access Apache Spark Web Interface

Go to your browser address bar to access the web interface and type in http://ServerIPaddress:8080 to access the web install wizard. For example:

http://192.0.2.10:8080

Conclusion

You have installed Apache Spark on your server. You can now access the main dashboard begin managing your clusters.

More Information

For more information about Apache Spark, please see the official documentation.

Want to contribute?

You could earn up to $600 by adding new articles

Submit your article Suggest an update Request an article

Introduction

Prerequisites

1. Install Java

2. Install Apache Spark

3. Access Apache Spark Web Interface

Conclusion

More Information

Want to contribute?

Recommend

Securely Connect to your Debian 11 Cloud Server over VNC

How to Install WireGuard VPN Server on Rocky Linux

Gopher部落：2022年要做的事儿

How To Install RethinkDB on Ubuntu 20.04 LTS

2022年优秀网络安全专家喜爱的五大数据加密方法

Virtual Desktop on Vultr Cloud with Rocky Linux and NoMachine

2022年Linux内核提权漏洞总结

How to Install Grav CMS on Debian 11

How to Install Multicraft on CentOS 7

CISA在其积极利用的漏洞目录中增加了95个新漏洞

About Joyk