Finding Needle in Haystack with Apache Spark

TL; DR:Customer churn is a real deal for businesses, and predicting which user is likely to churn might be difficult in an ever growing (Big) data. Apache Spark allows data scientist to do data cleaning/modelling/prediction at scale right in Big Data without fuss.

M ost of the data scientist know that working with data is not always straightforward. Processes like cleaning, imputing missing values, feature engineering, modelling, and predicting can all be a big beast on its own even the times when the data is small enough to put in a laptops memory. Things easily can get easily more complicated if the data is much bigger than that. One very common way to tackle this problem is putting the data in SQL or No-SQL DBs and do the most of wrangling/cleaning there before summarising data and moving the summarised data to local workstation for modelling.

However there are times data scientists need a lot of data to input in their model and train and predict on big data. This is not that easy with conventional libraries, like Python Pandas, scikit-learn or R dplyr as we have only limited amount of memory to fit in.

Apache Spark is one of biggest the stars in the Big Data ecosystem. It allows data scientists to work with familiar tools, but allowing Spark to do all the heavy work like parallelisation and task scaling. It provides tools like Spark Data Frames, which is similar to R Data Frames or Pandas Data frames. If you prefer traditional SQL, you can use SQL to wrangle data rather than using data frames. Spark supports many machine learning algorithms out of the box via MLlib library. It also supports streaming for data engineers via Spark Streaming, and finally it supports natively graph processing via GraphX. Spark is kind of Swiss army knife of Data Scientists/Engineers when dealing with Big Data.

In this post we’ll work on a business case and rather a common task for many Data Scientists as it has very direct impact on business marketing or strategy efforts. We will try to predict users likely to churn for a music streaming platform. Efforts to keep users engaged via promotions or discounts on about to churn users based on good predictions is critical.

In this example we have a small subset (128MB) of the full dataset available (12GB). As I’m working on a local cluster (means only my laptop rather than a set of servers) I will be doing analysis on small dataset, but all methodologies we will explore down is true for bigger data, nothing is fundamentally different, just Spark will handle the parallelism in most cases.

The data we use comes from Udacity Data Scientist Nanodegree Program Apache Spark Capstone Project

Import necessary libraries. We will import some of them again for clarity later

Spark needs something called Spark Session, this is the driver between your code and the master node. Let’s create one ,if there isn’t already one, or return it if there is already one.

Creating/getting Spark session. Note that our cluster is a local one.

Loading and Cleaning Sparkify DataSet

We import the user activity dataset mini_sparkify_event_data.json into Spark and then clean some missing values.

Read data into Spark and do few simple checks on data

Let’s have a glimpse on first few rows to get a feel of our data

data.take(5)[Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30'), Row(artist='Five Iron Frenzy', auth='Logged In', firstName='Micah', gender='M', itemInSession=79, lastName='Long', length=236.09424, level='free', location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Canada', status=200, ts=1538352180000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9'), Row(artist='Adam Lambert', auth='Logged In', firstName='Colin', gender='M', itemInSession=51, lastName='Freeman', length=282.8273, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Time For Miracles', status=200, ts=1538352394000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30'), Row(artist='Enigma', auth='Logged In', firstName='Micah', gender='M', itemInSession=80, lastName='Long', length=262.71302, level='free', location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Knocking On Forbidden Doors', status=200, ts=1538352416000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9'), Row(artist='Daft Punk', auth='Logged In', firstName='Colin', gender='M', itemInSession=52, lastName='Freeman', length=223.60771, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Harder Better Faster Stronger', status=200, ts=1538352676000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30')]

Each row is identifies a User activity, like current artist, song, sessionId, userId, and from which device user is listening to, unix timestamp, gender, if the user is logged in and paid user, user details.. etc

Data Cleaning

We should check NULL values in our dataset. Based on the structure we may omit some columns or impute if necessary for our analysis. Let’s have a look at it both statistically and visually

Checking Null and NaN values in our dataset

We have plenty of NULL values particularly in certain columns, such as artist , length or song .

We also observe that there might be a correlation in terms of null values count among groups of columns, such as artist , length or song all have same number of NULL values. This is true for firstName , gender , lastName , location , registration and userAgent .

None of other columns have any missing values.

Also let’s visually check missing values whether to see if our previous claim of correlation is supported. If that is true we can say not just NULL counts are the same but the rows numbers where NULL values appear should be the same too.

Since pyspark do not have a visualizaton library, we sample and convert spark dataframe to pandas dataframe and visualize there using Seaborn.

plt.figure(figsize=(12, 12))
sns.heatmap(data.sample(False, 0.1, 42).toPandas().isnull())

3UzM7nn.png!web

Using Python Seaborn heatmap we can visualise NULL values in the dataset

The heatmap plot above supports our claim, columns are correlated in terms of missing values numbers and index.

One last thing I’d like to check is that if firstName and other similar fields are NULL, are artist and similar fields are also NULL?

sns.heatmap(data.filter(data.firstName.isNull()).toPandas().isnull())

aA3EFn3.png!web

Filelds like `artist`, `location`.. etc all are null when `fistName` is null

Yes, they are, so there is not only correlation inside groups where NULL counts are equal, but also inter groups NULL pattern is also correlated. In plain English, if firstName is NULL then gender is also null as their NULL count is equal, but the graph above also states artist col will be also NULL

Loading and Cleaning Sparkify DataSet

Data Cleaning

Recommend

Here’s how Flink stores your State

History of UNIX Design and Interfaces

Link Underlines That Animate Into Block Backgrounds

Windows Server 2008 R2文件服务器升级到Windows Server 2016-王春海的博客

SHELL 超详细基础知识,适合新手小白(一)-程亚庆的博客

Kubernetes 系列第二篇: 使用 kubectl 命令创建 Kubernetes 应用-胡源的博客

Zabbix 上Windows性能监控-Yoke-home

出海日报 | 支付宝全球用户数超12亿；腾讯投资PolicyBazaar1.5亿美元

网红“跨界”影视圈

创投日报 | 「滴普科技」获晨兴、高瓴及IDG资本领投3500万美元A轮融资，以及今天值得...

About Joyk