12

Valuable Data Analysis with Pandas Value Counts

 4 years ago
source link: https://mc.ai/valuable-data-analysis-with-pandas-value-counts/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The data

In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.

It can be downloaded here .

In the code below I have imported the data and the libraries that I will be using throughout the article.

import pandas as pdimport matplotlib.pyplot as plt
%matplotlib inlinedata = pd.read_csv('KaggleV2-May-2016.csv')
data.head()
The first few rows of the Medical Appointments No-Show data set from Kaggle.com

Basic counts

The value_counts() function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender column.

data['Gender'].value_counts()

To sort values in ascending or descending order we can use the sort argument. In the code below I have added sort=True to display the counts in the Age column in descending order.

data['Age'].value_counts(sort=True)

Combine with groupby()

The value_counts function can be combined with other Panadas functions for richer analysis techniques. One example is to combine with the groupby() function. In the below example I am counting values in the Gender column and applying groupby() to further understand the number of no-shows in each group.

data['No-show'].groupby(data['Gender']).value_counts(sort=True)

Normalize

In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.

We can add the normalize argument to value_counts() to display the values in this way.

data['No-show'].groupby(data['Gender']).value_counts(normalize=True)

Binning

For columns where there are a large number of unique values the output of the value_counts() function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.

Fortunately value_counts() has a bins argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=5 to split the Age counts into 5 groups. We now have a count of values in each of these bins.

data['Age'].value_counts(bins=5)

Once again showing absolute numbers is not particularly useful so let’s add the normalize=True argument as well. Now we have a useful piece of analysis.

data['Age'].value_counts(bins=5, normalize=True)

Combine with nlargest()

There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood column.

If we simply run value_counts() against this we get an output that is not particularly insightful.

data['Neighbourhood'].value_counts(sort=True)

A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest() as shown below.

data['Neighbourhood'].value_counts(sort=True).nlargest(10)

We can also use nsmallest() to display the bottom 10 neighbourhoods which might also prove useful.

data['Neighbourhood'].value_counts(sort=True).nsmallest(10)

Plotting

Another handy combination is the Pandas plotting functionality together with value_counts(). Having the ability to display the analyses we get from value_counts() as visualisations can make it far easier to view trends and patterns.

We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here .

Let’s look a few examples.

We can use a bar plot to view the top 10 neighbourhoods.

data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()

We can make a pie chart to better visualise the Gender column.

data['Gender'].value_counts().plot.pie()

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK