Data Summarization Using Pandas In Python

Pandas, Pandas and Pandas. When it comes to data manipulation and analysis, nothing can serve the purpose better than Pandas. In previous stories, we have learned many data operations using pandas. Today is another day where we are going to explore the data summarization topic using pandas in python. So, without wasting much time on the intro, let’s roll!

Data Summarization

The word data summarization is nothing but extracting and presenting the raw data as a summary of it. Just presenting the raw data cannot make any sense to your audience. So, breaking the data into subsets and then gathering or summarizing the insights can craft a neat story any day.

Pandas offers many functions such as count, value counts, crosstab, group by, and more to present the raw data in an informative way.

Well, in this story, we are going to explore all the data summarization techniques using pandas in python.

Pandas Count

Pandas count is a very simple function that is used to get the count of the data points. Its applications are limited compared to crosstab and Groupby. But, it is quite useful at all times.

Before we move forward, let’s install all the required libraries for data summarization in python.

#Pandas

import pandas as pd

#Numpy

import numpy as np

#Matplotlib

import matplotlib.pyplot as plt

#seaborn

import seaborn as sns

Now, let’s load our Titanic data. The reason I am using this data is, it is pretty easy to understand the data summarization using these attributes. So, if you are a beginner or a pro, it will best suit the purpose.

#titanic data

import pandas as pd

data = pd.read_csv('titanic.csv')

We can dig deep to understand the basic information about the data.

#data columns

data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',

'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],

dtype='object')

#datatypes

data.dtypes

PassengerId int64

Survived int64

Pclass int64

Name object

Sex object

Age float64

SibSp int64

Parch int64

Ticket object

Fare float64

Cabin object

Embarked object

dtype: object

Well, we have both numerical and categorical data types in our data and it will spice up things for sure.

Now, it’s time to count the values present in both rows and columns.

#count of values in columns

data.count(0)

PassengerId 891

Survived 891

Pclass 891

Name 891

Sex 891

Age 714

SibSp 891

Parch 891

Ticket 891

Fare 891

Cabin 204

Embarked 889

dtype: int64

You can see that most of the columns have 891 values. But columns such as cabin and Age have less value. It indicates the presence of null values or missing data. Let’s look at the rows for the same.

#count of values in rows

data.count(1)

0 11

1 12

2 11

3 12

4 11

..

886 11

887 12

888 10

889 12

890 11

Length: 891, dtype: int64

You can observe that not all the rows have the same number of values. An ideal row of this data should have 12 values.

Index

You can observe or inspect the data by index level as well. Let’s use set_index function for the same.

#set index

data = data.set_index(['Sex','Pclass'])

data.head(2)

That’s our index level data watch!

Now, we have 2 attributes as our data index. So, let’s set the count level as ‘Sex’ to get the particular data.

#count level

data.count(level = 'Sex')

Similarly for ‘Pclass’

#count level

data.count(level = 'Pclass')

That’s ‘some’ information you need to work with data modeling.

Pandas Value_counts

The value counts function has more functionality compared to the count function with 1-2 lines of code. Definitely, it will earn more respect in your eyes as it can perform the operations of the group by functioning more seamlessly.

#value counts

data.value_counts(['Pclass'])

Pclass
3         491
1         216
2         184
dtype: int64

That’s cool. We now have information about all three classes and the values that belong to each of them.

One of the best features of the value_counts function is, you can even normalize the data.

#normalization

data.value_counts(['Pclass'], normalize = True, sort = True, ascending = True)

Pclass
2         0.206510
1         0.242424
3         0.551066
dtype: float64

Here, we have not only normalized the values but also sorted the values in ascending order which makes some sense

For the data attribute which has no levels in it such as “fare”, we can create the bins. Let’s see how it works.

#bins

data['Fare'].value_counts(bins=5)

(-0.513, 102.466]     838
(102.466, 204.932]     33
(204.932, 307.398]     17
(409.863, 512.329]      3
(307.398, 409.863]      0
Name: Fare, dtype: int64

Well, we have created 5 bin ranges for the “fare”. Most of the ticket prices are in the 0 – 100 range and belong to Pclass 1.

Pandas Crosstab

A crosstab is a simple function that shows the relationship between two variables. It is very handy to quickly analyze two variables.

Now, let’s see the relationship between Sex and the Survivability of the passengers in the data.

#crosstab

pd.crosstab(data['Sex'],data['Survived'])

Survived 0 1

Sex

female 81 233

male 468 109

You can see the clear relationship between Sex with Survivability. We can plot this data for better visibility.

That’s cool! I hope things were better now.

In the crosstab, we can do so much. We can add multiple data layers in the cross tab and even we can visualize the same.

#multiple layers crosstab

pd.crosstab([data['Pclass'], data['Sex']], [data['Embarked'], data['Survived']],

rownames = ['Pclass', 'gender'],

colnames = ['Embarked', 'Survived'],

dropna=False)

There is a lot of information in just one table. That’s crosstab for you! Finally, let’s plot the correlation plot for this table data, and let’s see how it works.

#correlation

import seaborn as sns

sns.heatmap(pd.crosstab([data['Pclass'],data['Sex']],[data['Embarked'],data['Survived']]),annot = True)

We have got an amazing correlation plot showing key information about the data.

Data Summarization – Conclusion

Data manipulation and analysis are most important as you will get to know about key insights and hidden patterns in your data. In this regard, data summarization is one of the best techniques you can make use of to get into your data for the best analysis.

That’s all for now and I hope this story helps you in your analysis. Happy Python!!!

More read: Data manipulation and statistical analysis