PySpark in Google Colab

Photo by Ashim D’Silva on Unsplash

With broadening sources of the data pool, the topic of Big Data has received an increasing amount of attention in the past few years. Besides dealing with the gigantic data of all kinds and shapes, the target turnaround time of the analysis part for the big data has been reduced significantly. Not only has this speed and efficiency helped in the immediate analysis of the Big Data but also in identifying new opportunities. This, in turn, has lead to smarter business moves, more efficient operations, higher profits, and happier customers.

Apache Spark was build to analyze Big Data with faster speed. One of the important features that Apache Spark offers is the ability to run the computations in memory. It is also considered to be more efficient than MapReduce for the complex application running on Disk.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.

PySpark is the interface that gives access to Spark using the Python programming language. PySpark is an API developed in python for spark programming and writing spark applications in Python style, although the underlying execution model is the same for all the API languages.

In this tutorial, we will mostly deal with the PySpark machine learning library Mllib that can be used to import the Linear Regression model or other machine learning models.

Yes, but why Google Colab?

Colab by Google is based on Jupyter Notebook which is an incredibly powerful tool that leverages google docs features. Since it runs on google server, we don't need to install anything in our system locally, be it Spark or deep learning model. The most attractive features of Colab are the free GPU and TPU support! Since the GPU support runs on Google's own server, it is, in fact, faster than some commercially available GPUs like the Nvidia 1050Ti. A piece of general system information allocated for a user looks like the following:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 11439MB | Used: 0MB | Util  0% | Total 11439MB

If you are interested to know more about Colab, thisarticle byAnna Bonner points out some of the outstanding benefits of using Colab.

Enough of the small talks. Let’s create a simple linear regression model with PySpark in Google Colab.

To open Colab Jupyter Notebook, click on this link .

Running Pyspark in Colab

To run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.

Follow the steps to install the dependencies:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
!tar xf spark-2.3.2-bin-hadoop2.7.tgz
!pip install -q findspark

Now that we have installed Spark and Java in Colab, it is time to set the environment path that enables us to run PySpark in our Colab environment. Set the location of Java and Spark by running the following code:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.2-bin-hadoop2.7"

We can run a local spark session to test our installation:

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Our Colab is ready to run PySpark. Let's build a simple Linear Regression model.

Linear Regression Model

Linear Regression model is one the oldest and widely used machine learning approach which assumes a relationship between dependent and independent variables. For example, a modeler might want to predict the forecast of the rain based on the humidity ratio. Linear Regression consists of the best fitting line through the scattered points on the graph and the best fitting line is known as the regression line. Detailed about linear regression can be found here .

For our purpose of starting with Pyspark in Colab and to keep things simple, we will use the famous Boston Housing dataset. A full description of this dataset can be found in this link .

The goal of this exercise is to predict the housing prices from the given features. Let’s predict the prices of the Boston Housing dataset by considering MEDV as the target variable and all other variables as input features.

We can download the dataset from this link and keep it somewhere accessible in our local drives. The dataset can be loaded in the Colab directory using the following command from the same drive.

from google.colab import files
files.upload()

We can now check the directory content of the Colab

!ls

We should see a file named BostonHousing.csv saved. Now that we have uploaded the dataset successfully, we can start analyzing.

For our linear regression model, we need to import Vector Assembler and Linear Regression modules from the PySpark API. Vector Assembler is a transformer tool that assembles all the features into one vector from multiple columns that contain type double . We should have used ( must use ) StringIndexer if any of our columns contains string values to convert it into numeric values. Luckily, the BostonHousing dataset only contains type double, so we can skip StringIndexer for now.

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

dataset = spark.read.csv('BostonHousing.csv',inferSchema=True, header =True)

Notice that we used InferSchema inside read.csv(). InferSchema automatically infers different data types for each column.

Let us print look into the dataset to see the data types of each column:

dataset.printSchema()

It should print the data types as follows:

In the next step, we will convert all the features from different columns into a single column and we can call the new vector column as ‘Attributes’ in the outputCol.

#Input all the features in one vector column
assembler = VectorAssembler(inputCols=['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat'], outputCol = 'Attributes')

output = assembler.transform(dataset)

#Input vs Output
finalized_data = output.select("Attributes","medv")

finalized_data.show()

Here, ‘Attributes’ are the input features from all the columns and ‘medv’ is the target column.

Next, we should split the training and testing data according to our dataset (0.8 and 0.2 in this case).

#Split training and testing data
train_data,test_data = finalized_data.randomSplit([0.8,0.2])

regressor = LinearRegression(featuresCol = 'Attributes', labelCol = 'medv')

#Learn to fit the model from training set
regressor = regressor.fit(train_data)

#To predict the prices on testing set
pred = regressor.evaluate(test_data)

#Predict the model
pred.predictions.show()

The predicted score in the prediction column as output:

We can also print the coefficient and intercept of the regression model by using the following command:

#coefficient of the regression model
coeff = regressor.coefficients

#X and Y intercept
intr = regressor.intercept

print ("The coefficient of the model is : %a" %coeff)
print ("The Intercept of the model is : %f" %intr)

Once we are done with the basic linear regression operation, we can go a bit further and analyze our model statistically by importing RegressionEvaluator module from Pyspark.

from pyspark.ml.evaluation import RegressionEvaluator
eval = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="rmse")

# Root Mean Square Error
rmse = eval.evaluate(pred.predictions)
print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval.evaluate(pred.predictions, {eval.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval.evaluate(pred.predictions, {eval.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"})
print("r2: %.3f" %r2)

That’s it. You have created your first machine learning model using Pyspark in Google Colab.

You can access the full code from in github from here .

Please let me know if you run into any other newbie problems that I might be able to help you with. I’d love to help you if I can!

Yes, but why Google Colab?

Running Pyspark in Colab

Linear Regression Model

Recommend

如何找到全局最小值？先让局部极小值消失吧

DDoS攻击惯犯图鉴

用霍夫变换&SCNN码一个车道追踪器

如何使用消息队列，Spring Boot和Kubernetes扩展微服务

人脸识别技术全面总结：从传统方法到深度学习

深入理解平衡二叉树AVL与Python实现

GitHub - tdpetrou/pandas_cub: Learn how to build a data analysis library from sc...

11日0点:SanDisk 闪迪 Ultra 3D 至尊高速3D 固态硬盘 480G&512G 479元包邮（需用...

Spam does not bring us joy—ridding Gmail of 100 million more spam messages with...

2019年云计算市场，运营商准备好了！

About Joyk