Data Summarization Using Pandas In Python
source link: https://www.journaldev.com/54444/data-summarization-python-pandas
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Pandas, Pandas and Pandas. When it comes to data manipulation and analysis, nothing can serve the purpose better than Pandas. In previous stories, we have learned many data operations using pandas. Today is another day where we are going to explore the data summarization topic using pandas in python. So, without wasting much time on the intro, let’s roll!
Data Summarization
The word data summarization is nothing but extracting and presenting the raw data as a summary of it. Just presenting the raw data cannot make any sense to your audience. So, breaking the data into subsets and then gathering or summarizing the insights can craft a neat story any day.
Pandas offers many functions such as count, value counts, crosstab, group by, and more to present the raw data in an informative way.
Well, in this story, we are going to explore all the data summarization techniques using pandas in python.
Pandas Count
Pandas count is a very simple function that is used to get the count of the data points. Its applications are limited compared to crosstab and Groupby. But, it is quite useful at all times.
Before we move forward, let’s install all the required libraries for data summarization in python.
#Pandas
import
pandas as pd
#Numpy
import
numpy as np
#Matplotlib
import
matplotlib.pyplot as plt
#seaborn
import
seaborn as sns
Now, let’s load our Titanic data. The reason I am using this data is, it is pretty easy to understand the data summarization using these attributes. So, if you are a beginner or a pro, it will best suit the purpose.
#titanic data
import
pandas as pd
data
=
pd.read_csv(
'titanic.csv'
)
We can dig deep to understand the basic information about the data.
#data columns
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
#datatypes
data.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Well, we have both numerical and categorical data types in our data and it will spice up things for sure.
Now, it’s time to count the values present in both rows and columns.
#count of values in columns
data.count(
0
)
PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64
You can see that most of the columns have 891 values. But columns such as cabin and Age have less value. It indicates the presence of null values or missing data. Let’s look at the rows for the same.
#count of values in rows
data.count(
1
)
0 11
1 12
2 11
3 12
4 11
..
886 11
887 12
888 10
889 12
890 11
Length: 891, dtype: int64
You can observe that not all the rows have the same number of values. An ideal row of this data should have 12 values.
Index
You can observe or inspect the data by index level as well. Let’s use set_index
function for the same.
#set index
data
=
data.set_index([
'Sex'
,
'Pclass'
])
data.head(
2
)
That’s our index level data watch!
Now, we have 2 attributes as our data index. So, let’s set the count level as ‘Sex’ to get the particular data.
#count level
data.count(level
=
'Sex'
)
Similarly for ‘Pclass’
#count level
data.count(level
=
'Pclass'
)
That’s ‘some’ information you need to work with data modeling.
Pandas Value_counts
The value counts function has more functionality compared to the count function with 1-2 lines of code. Definitely, it will earn more respect in your eyes as it can perform the operations of the group by functioning more seamlessly.
#value counts
data.value_counts([
'Pclass'
])
Pclass 3 491 1 216 2 184 dtype: int64
That’s cool. We now have information about all three classes and the values that belong to each of them.
One of the best features of the value_counts
function is, you can even normalize the data.
#normalization
data.value_counts([
'Pclass'
], normalize
=
True
, sort
=
True
, ascending
=
True
)
Pclass 2 0.206510 1 0.242424 3 0.551066 dtype: float64
Here, we have not only normalized the values but also sorted the values in ascending order which makes some sense
For the data attribute which has no levels in it such as “fare”, we can create the bins. Let’s see how it works.
#bins
data[
'Fare'
].value_counts(bins
=
5
)
(-0.513, 102.466] 838 (102.466, 204.932] 33 (204.932, 307.398] 17 (409.863, 512.329] 3 (307.398, 409.863] 0 Name: Fare, dtype: int64
Well, we have created 5 bin ranges for the “fare”. Most of the ticket prices are in the 0 – 100 range and belong to Pclass 1.
Pandas Crosstab
A crosstab is a simple function that shows the relationship between two variables. It is very handy to quickly analyze two variables.
Now, let’s see the relationship between Sex and the Survivability of the passengers in the data.
#crosstab
pd.crosstab(data[
'Sex'
],data[
'Survived'
])
Survived 0 1
Sex
female 81 233
male 468 109
You can see the clear relationship between Sex with Survivability. We can plot this data for better visibility.
That’s cool! I hope things were better now.
In the crosstab, we can do so much. We can add multiple data layers in the cross tab and even we can visualize the same.
#multiple layers crosstab
pd.crosstab([data[
'Pclass'
], data[
'Sex'
]], [data[
'Embarked'
], data[
'Survived'
]],
rownames
=
[
'Pclass'
,
'gender'
],
colnames
=
[
'Embarked'
,
'Survived'
],
dropna
=
False
)
There is a lot of information in just one table. That’s crosstab for you! Finally, let’s plot the correlation plot for this table data, and let’s see how it works.
#correlation
import
seaborn as sns
sns.heatmap(pd.crosstab([data[
'Pclass'
],data[
'Sex'
]],[data[
'Embarked'
],data[
'Survived'
]]),annot
=
True
)
We have got an amazing correlation plot showing key information about the data.
Data Summarization – Conclusion
Data manipulation and analysis are most important as you will get to know about key insights and hidden patterns in your data. In this regard, data summarization is one of the best techniques you can make use of to get into your data for the best analysis.
That’s all for now and I hope this story helps you in your analysis. Happy Python!!!
More read: Data manipulation and statistical analysis
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK