Exploratory Data Analysis using Pandas
source link: https://towardsdatascience.com/exploratory-data-analysis-using-pandas-4f97de631456?gi=91c9fd99b198
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
In this article we will focus on ‘ Brazil’s Amazon Forest Fires Dataset’ and perform some basic analysis using Pandas
library and visualise data using Matplotlib
and Seaborn
libraries.
P andas is a most powerful tool of Python that allows us to do anything and everything with datasets such as — analysing the data, organising, cleaning, sorting, filtering, aggregating, calculating and more!, which makes data analysis very easy.
Pandas — Python’s most powerful library for Data Analysis…
In this post, we will go over the ‘Amazon fires dataset’ (downloaded from Kaggle.com) and explore pandas functionalities which will help us to do Exploratory Data Analysis(EDA) by doing few exercises and then visualising the data using python’s visualisation libraries.
First, let’s dig into our Kaggle Dataset - Forest Fires in Brazil (amazon.csv), which you can download here .
For those who are not aware of Kaggle.com, let me give you some idea..
Kaggle is a most popular online community for data scientists and machine learners who can participate in analytical competitions, build predictive models and is a great place for users looking for interesting datasets. We can find all varieties of data including image datasets, CSVs, time-series datasets etc. , which are free to download.
What is this dataset about?
This dataset contains 10 years of data which has total number of forest fires occurred in Amazon rainforest (Brazil states) for the period 1998 to 2017.
The Amazon is a vast region that spans across eight rapidly developing countries: Brazil, Bolivia, Peru, Ecuador, Colombia, Venezuela, Guyana, Suriname and French Guiana. The majority of the amazon forest is contained within Brazil, with 60% of the rainforest.
Amazon rainforest wildfires
Now, let’s start our pandas exercises to explore this dataset and draw some insights.
Let’s follow below steps :
- First we will import all the required python libraries.
- Download dataset from kaggle, save it to our desktop and extract it to Jupyter notebook using python.
- Analyse few questions and visualise the data.
Before importing our libraries, let’s install them using pip install command to our python Jupyter notebook or any other python interface where we do our coding.
Pandas : pip install pandas Numpy : pip install numpy Matplotlib : pip install matplotlib Seaborn : pip install seaborn
Let’s read the csv file using pandas and save it to the dataframe
Now, let’s quickly check top 10 rows of our dataset
Dataset description:
- Column 1- ‘year’ : Year when forest fires happened
- Column 2- ‘state’ : Brazilian State
- Column 3- ‘month’ : Month when forest fires happened
- Column 4- ‘number’ : Number of forest fires reported
- Column 5- ‘date’ : Date when forest fires were reported
info()
method gives us the quick overview of our dataset like total number of rows and columns, datatypes and number of null values if any
From above output it is clear that we have total of 6454 rows(includes header), 5 columns and no null fields(which is a good sign for our analysis )
Let’s have a look at the statistical summary about our dataset ( for numeric values) using describe()
method
Data Cleaning :
Cleaning up data is the first and most important step, as it ensures the quality of the data is met to prepare data for visualisation. From above dataset after thorough check we see that ‘ number ’ column(number of forest fires reported) which is of float type has values in decimal point such as 18.566, 14.59, 11.068 … That means this value is not rounded and it doesn't make any sense to the number of forest fires reported . So let’s clean up this data using round function and store the data back to our main dataframe .
First let’s apply round()
function for sample data…
Before :
After :
Now we will apply round()
method to entire dataset using numpy
Exercise 1 :To check minimum and maximum of ‘year’ column
Exercise 2 :To find out total number of fires in ‘ Acre ’ state and visualising data based on each ‘year’
Before jumping into this exercise, I would like to deep dive into an important concept called ‘ Boolean Indexing’ in Pandas which will be of very much help when dealing with subsets of data based on the actual values of the data.
Boolean Indexing :
Boolean indexing as the name suggests, is used when we want to extract subsets of data from the dataframe based on some conditions. We can also have multiple conditions which can be grouped in brackets and apply to the dataframe.
Let’s look at below example of how boolean indexing works in pandas. In our example, we’ll work with the dataframe of employees and their salary :
import pandas as pd df = pd.DataFrame({'EmployeeName': ['John','Sam','Sara','Nick','Bob','Julie'], 'Salary':[5000,8000,7000,10000,3000,5000]})
Let’s check which employee have a salary of 5000 . First, we will perform a vectorised boolean operation that produces a boolean series:
salary_bool = df['Salary'] == 5000
Now, we can use this series to index the whole dataframe, leaving us with the rows that correspond only to employees whose salary is 5000
salary_5000 = df[salary_bool]
# that means it returns only those rows which are 'True'
I hope now you have some idea how boolean indexing works..You can find more info about Boolean Indexing tutorial here ..
Now, let’s get started with our actual exercise i.e, to find the total number of forest fires in ‘Acre’ state:
It’s clear from above output that total number of fires reported in ‘Acre’ state are 18463 . Yes, this is too much to understand for the first time :). So let me break this code and explain it step by step.
Step 1 :Let’s use boolean indexing to get only ‘Acre’ state subset and assign it to variable called ‘amazon_acre’
amazon_acre = amazon['state'] == 'Acre'
‘amazon_acre’ will generate a series for us which shows True and False for each row based on our condition.
Step 2:Let’s use this series to index the entire dataset and assign it to variable called ‘amazon_acre_data’
amazon_acre_data = amazon[amazon_acre] # total 239 entries
First five rows of amazon_acre_data
Step 3 :Next let’s display only ‘number’ column from above dataset
amazon_acre_number = amazon_acre_data['number']
# this will display only ‘number’ column values.
Step 4 :Now we can use sum()
function to amazon_acre_number
variable to find total number of fires.
There you go.!! It’s always a best practice to break the code when we are experimenting the data for the first time. But, eventually to gain better programming skills we shall work on minimising the coding lines.
Now let’s use groupby()
method on ‘year’ column and get total number of fires for each year.
Here in output, we see that ‘year’ is marked as index and ‘number’ as column. Just to make it more simple for visualising data, we can use reset_index()
method to make index to be treated as a column.
Visualisation of above dataset:
Importing librariesMatplotlib is python’s data visualisation library, which allows us to visualise the data and is very easy to get started for simple plots. Matplotlib consists of several plots like line, bar, scatter, histogram etc., I recommend you to explore the official Matplotlib webpage for more info.
Seaborn is python’s most popular statistical visualisation library which is built on top of Matplotlib. Check out seaborn official webpage for all different types of seaborn plots.
Let’s visualise ‘acre_fires_year’ dataset using matplotlib
and seaborn(barplot)
More number of fires are reported in 2002
Note: plt.figure() creates a figure object, which we here used to customise the size of the chart.
Exercise 3 :To find out total number of fires in all states
For this, let’s use groupby()
on ‘state’ column and find out total number of fires.
It’s clear from above plot that most forest fires occurred in ‘Mato Grosso’, followed by ‘Paraiba’ and ‘Sao Paulo’ states.
Exercise 4 :To find out total number of fires in 2017 and visualising data based on each ‘month’
Exercise 5 :To find out average number of fires occurred
Exercise 6 :To find out the state names where fires occurred in ‘December’ month
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK