25

Interactive Distribution Plots with Plotly

 3 years ago
source link: https://towardsdatascience.com/interactive-distribution-plots-with-plotly-ea58efc78885?gi=ad92428cbf66
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to create informative distribution plots

7Zb6JfV.jpg!web

Photo by Kym Ellis on Unsplash

Plotly Python (plotly.py) is an open-source plotting library built on plotly javascript (plotly.js). Plotly express is a high-level interface of plotly.py that allows us to create many interactive and informative visualizations.

In this post, we will create different types of distribution plots using plotly express. Distribution plots are very informative tools that are widely used in statistical analysis as well as in exploratory data analysis part of data science projects. As the name suggests, distribution plots show the distribution of values and give an overview about the values that are more likely to be observed, how much the values are spread out, the ranges that are more densely populated with values and so on.

We will first use a dataset that includes basic information about the courses on Coursera. The dataset is available here on Kaggle. Let’s start by reading the data into a pandas dataframe.

import numpy as np
import pandas as pdcoursera = pd.read_csv("coursera_data.csv")
print(coursera.shape)
coursera.head()

jeINjiV.png!web

We need to do some data cleaning and manipulation. The “Unnamed: 0” columns is just an ID so we can drop it.

coursera.drop('Unnamed: 0', axis=1, inplace=True)

“Course_students_enrolled” column is not in a proper numerical format. For instance, 5.3k should be 5300 and so on. There are different ways to accomplish this task. What we will do is to separate the last characters and convert it to numerical values (k for thousand, m for million). Then multiply with the values to get actual enrollment numbers.

coursera['unit'] = [x.strip()[-1] for x in coursera['course_students_enrolled']]unit_values = {'k':1000, 'm':1000000}
coursera.unit.replace(unit_values, inplace=True)

7Z7zUzB.png!web

We also need to remove the letters “k” and “m” from the course_students_enrolled column which can be done by slicing the string to exclude the last character.

coursera['course_students_enrolled'] = coursera['course_students_enrolled'].str.slice(0, -1)

Then we can multiply these two columns to create an “enrollment” column. To be able to multiple, both columns must have a numeric data type.

coursera['course_students_enrolled'] = coursera['course_students_enrolled'].astype("float")coursera['enrollment'] = coursera['course_students_enrolled'] * coursera['unit']coursera.drop(['course_students_enrolled','unit'], axis=1, inplace=True)coursera.head()
qq2uua2.png!web

Let’s first create a basic histogram on enrollment column. To make the plots look better, I removed the outliers whose enrollment values are more than 500k.

df = coursera[coursera.enrollment < 500000]fig = px.histogram(df, x="enrollment")
fig.show()

faUnuiY.png!web

Most of the courses have less than 100k enrollments. We can also check if course difficulty has any effect on enrollment.

fig = px.histogram(df, x="enrollment", color="course_difficulty", 
                   facet_col="course_difficulty",
                   title="Course Enrollment Numbers")
fig.show()

qy2quaR.gif

Beginner level courses have the most enrollment numbers and the values decreases as the difficulty increases. However, within each difficulty level, there is a similar distribution trend.

We can create a similar plot to visualize course ratings. Let’s also use the y-axis for a different purpose this time. By passing the enrollment column to y parameter, we can see the total number of enrollments in addition to the distribution of course ratings.

fig = px.histogram(df, x="course_rating", y="enrollment",
                   color="course_difficulty", 
                   facet_col="course_difficulty",
                   title="Course Ratings")
fig.show()

jiMvMfy.gif


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK