Accelerate your Exploratory Data Analysis with Pandas Profiling

Accelerate Your Exploratory Data Analysis With Pandas-Profiling

Exploratory Data Analysis is tedious. Automate the process and generate detailed interactive reports with a single line of code using Pandas-Profiling

Sukanta Roy

Apr 19 ·8min read

ZjAveuq.jpg!web

Photo by Lukas Blazek on Unsplash

When starting a new data science project, the first step after getting your hands on the data set for the first time is to understand it. We achieve this by performing Exploratory Data Analysis (EDA). This includes finding out the data type of each variable, the distribution of the target variable, number of distinct values for each predictor variable, if there is any duplicate or missing values in the data set etc.

If you have ever done EDA on any data set (and I assume you have as you are reading this article), I don’t need to tell you how time consuming this process can be. And if you have been a part of many data science projects (be it in your job or by doing personal projects) you know how repetitive all these process can be. But with the Open source library Pandas-profiling that doesn’t have to be the case anymore.

What is Pandas-Profiling?

V7fiaeA.jpg!web

Photo by Juan Rumimpunu on Unsplash

Pandas-profiling is an open source library that can generate beautiful interactive reports for any data set, with just a single line of code. Sound’s interesting? Let’s take a look at the documentation to get a better understanding of what it does.

Pandas-profiling generates profile reports from a pandas DataFrame . The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:

Type inference: detect the types of columns in a data frame.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, inter-quartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables(Spearman, Pearson and Kendall matrices)
Missing values matrix , count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Now that we know what pandas-profiling is all about, let’s see how to install it and use it in a Jupyter Notebook or in Google Colab in the following section.

Install Pandas-profiling:

Using pip

You can install pandas-profiling very easily using pip package manager with the following command:

pip install pandas-profiling[notebook,html]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using Conda

If you are using conda, then you can use the following command to installation

conda install -c conda-forge pandas-profiling

Installation in Google Colab

Google colab comes pre-installed with Pandas-profiling, but unfortunately it comes with an older version of it (v1.4). If you are following this article or the GitHub documentation, then the code will not run on Google Colab unless you install the latest version of the library (v2.6).

To do that, you need to first uninstall the existing library and install the latest one as follows:

# To uninstall
!pip uninstall !pip uninstall pandas_profiling

Now to install, we need to run the pip install command.

!pip install pandas-profiling[notebook,html]

Generate Reports:

aqENzab.jpg!web

Photo by Kevin Ku on Unsplash

Now that we are done with the prerequisites, let’s get into the fun part of analyzing some data set.

The data set I will be using for this example is the Titanic data set.

Load the libraries:

import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

Import the data

file = cache_file("titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")data = pd.read_csv(file)

6vyqY3I.png!web

Loading the dataset

Generate report:

To generate the report, run the following code in the notebook.

profile = ProfileReport(data, title="Titanic Dataset", html={'style': {'full_width': True}}, sort="None")

N3U7fun.png!web

Generate report

That’s it. With a single line of code you have generated the a detailed profile report. Now let us see the results by including the report in the notebook.

Include the report in Notebook as IFrame

profile.to_notebook_iframe()

This will include the interactive report as HTML iframe in the notebook.

Saving the report

Save the report as a HTML file using the following code:

profile.to_file(output_file="your_report.html")

Or obtain the data as JSON using:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file(output_file="your_report.json")

The Results:

Now that we know how to generate reports using pandas-profiling, let’s look at the result.

Overview:

faAfEb7.png!web

Overview

ryMZn2V.png!web

Warnings

Pandas_profiling creates a very descriptive overview of the predictor variables, by calculating the total missing cells, duplicate rows, number of distinct values, missing values, zeros for the predictor variables. It also marks the variables that have high cardinality or have missing values in the warning section, as you can see in the above image.

Besides all these, it generates detailed analysis for each variable. I will go through some of them in this article, to see the full report with all the codes, find the colab link at the end of the article.

Class distribution:

FjyeUfa.png!web

Numerical Features:

ZJBnUfM.png!web

For the numerical features, besides having detailed statistics like mean, standard deviation, min, max, Interquartile range (IQR) etc. it also plots the histogram, gives the list of common and extreme values.

Categorical Features:

Similar to the numerical features, for categorical features it calculates common values, lengths, characters etc.

u2UV326.png!web

Interactions:

One of the most interesting things is the interactions and correlation sections of the report. In the interaction section the pandas_profiling library automatically generates interaction plots for every pair of variables . You can get the interaction plot of any pair by selecting the specific variables from the two headers (Like in this example, I have selected passengerId and Age)

UNJBFzz.png!web

Correlation Matrix:

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'’ is less than the average weight of people 5'6'’, and their average weight is less than that of people 5'7'’, etc. Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

When it comes to generating correlation matrix for all the numerical features, the pandas_profiling library gives us all the popular options to choose from including Pearson’s r , Spearman’s ρ etc.

Un6JfqY.png!web

Correlations

Now that, we know the advantages of using pandas_profiling, it is also useful to note the disadvantage that this library has.

Disadvantage:

The main disadvantage of pandas profiling is its use with large data sets. With the increase in the size of the data the time to generate the report also increases a lot.

One way to solve this problem is to generate the profile report for a part of the data set. But while doing this, it is very important to make sure that the data is randomly sampled so that it is representative of all the data we have. We can do this by:

from pandas_profiling import ProfileReport# Generate report for 10000 data points
profile = ProfileReport(data.sample(n = 10000), title="Titanic Data set", html={'style': {'full_width': True}}, sort="None")# save to file
profile.to_file(output_file='10000datapoints.html')

Alternatively, if you are insistent on getting the report on the whole data set, you can do that by using the minimal mode . In the minimal mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large data set. The code for the same is given below:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file(output_file="output.html")

Conclusion:

Now that you know what is pandas-profiling and how to use it, I hope it will save you a ton of time which you can use for more advanced analysis specific to the problem in hand.

If you want to get the full report with working code, you can take a look at the following notebook. And if you would like to read some of my other articles then you can find the links below.

Demo

Demo on Titanic Data set

colab.research.google.com

Pandas-Profiling GitHub repo:

pandas-profiling/pandas-profiling

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for…

github.com

If you loved this article, you may also like some of my the other articles.

The Trap of tutorials and online courses

How tutorials and online courses can create an illusion of competence, and how not to fall into this trap

towardsdatascience.com

Machine Learning Case Study: A data-driven approach to predict the success of bank telemarketing

Predicting whether a customer will subscribe a term deposit or not given customer relationship data

towardsdatascience.com

What is ACM ICPC and how to prepare for it (the beginner’s guide)

What is ACM ICPC?

codeburst.io

About Me:

BzmY73N.jpg!web

Hi, I am Sukanta Roy. A software developer, an aspiring Machine Learning Engineer, Former Google Summer of Code 2018 student and a huge psychology buff. If any of these things interest you, you can follow me on medium or you can connect with me on LinkedIn .