39

Exploratory Data Analysis using Pandas

 4 years ago
source link: https://towardsdatascience.com/exploratory-data-analysis-using-pandas-4f97de631456?gi=91c9fd99b198
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

a6jaqyq.jpg!web

In this article we will focus on ‘ Brazil’s Amazon Forest Fires Dataset’ and perform some basic analysis using Pandas library and visualise data using Matplotlib and Seaborn libraries.

P andas is a most powerful tool of Python that allows us to do anything and everything with datasets such as — analysing the data, organising, cleaning, sorting, filtering, aggregating, calculating and more!, which makes data analysis very easy.

zInaYfA.png!web

Pandas — Python’s most powerful library for Data Analysis…

In this post, we will go over the ‘Amazon fires dataset’ (downloaded from Kaggle.com) and explore pandas functionalities which will help us to do Exploratory Data Analysis(EDA) by doing few exercises and then visualising the data using python’s visualisation libraries.

First, let’s dig into our Kaggle Dataset - Forest Fires in Brazil (amazon.csv), which you can download here .

For those who are not aware of Kaggle.com, let me give you some idea..

Kaggle is a most popular online community for data scientists and machine learners who can participate in analytical competitions, build predictive models and is a great place for users looking for interesting datasets. We can find all varieties of data including image datasets, CSVs, time-series datasets etc. , which are free to download.

What is this dataset about?

This dataset contains 10 years of data which has total number of forest fires occurred in Amazon rainforest (Brazil states) for the period 1998 to 2017.

The Amazon is a vast region that spans across eight rapidly developing countries: Brazil, Bolivia, Peru, Ecuador, Colombia, Venezuela, Guyana, Suriname and French Guiana. The majority of the amazon forest is contained within Brazil, with 60% of the rainforest.

ZfeaU37.jpg!web

Amazon rainforest wildfires

Now, let’s start our pandas exercises to explore this dataset and draw some insights.

Let’s follow below steps :

  • First we will import all the required python libraries.
  • Download dataset from kaggle, save it to our desktop and extract it to Jupyter notebook using python.
  • Analyse few questions and visualise the data.

Before importing our libraries, let’s install them using pip install command to our python Jupyter notebook or any other python interface where we do our coding.

Pandas     : pip install pandas
Numpy      : pip install numpy
Matplotlib : pip install matplotlib
Seaborn    : pip install seaborn

Let’s read the csv file using pandas and save it to the dataframe

Now, let’s quickly check top 10 rows of our dataset

zQJjimn.png!web

Dataset description:

  • Column 1- ‘year’ : Year when forest fires happened
  • Column 2- ‘state’ : Brazilian State
  • Column 3- ‘month’ : Month when forest fires happened
  • Column 4- ‘number’ : Number of forest fires reported
  • Column 5- ‘date’ : Date when forest fires were reported

info() method gives us the quick overview of our dataset like total number of rows and columns, datatypes and number of null values if any

m6Br6jA.png!web

From above output it is clear that we have total of 6454 rows(includes header), 5 columns and no null fields(which is a good sign for our analysis )

Let’s have a look at the statistical summary about our dataset ( for numeric values) using describe() method

I7r6v2a.png!web

Data Cleaning :

Cleaning up data is the first and most important step, as it ensures the quality of the data is met to prepare data for visualisation. From above dataset after thorough check we see that ‘ number ’ column(number of forest fires reported) which is of float type has values in decimal point such as 18.566, 14.59, 11.068 … That means this value is not rounded and it doesn't make any sense to the number of forest fires reported . So let’s clean up this data using round function and store the data back to our main dataframe .

First let’s apply round() function for sample data…

Before :

FNvu6bi.png!web

After :

En2em2z.png!web

Now we will apply round() method to entire dataset using numpy

‘number’ column values are now corrected

Exercise 1 :To check minimum and maximum of ‘year’ column

3qAjiqQ.png!web

Exercise 2 :To find out total number of fires in ‘ Acre ’ state and visualising data based on each ‘year’

Before jumping into this exercise, I would like to deep dive into an important concept called ‘ Boolean Indexing’ in Pandas which will be of very much help when dealing with subsets of data based on the actual values of the data.

Boolean Indexing :

Boolean indexing as the name suggests, is used when we want to extract subsets of data from the dataframe based on some conditions. We can also have multiple conditions which can be grouped in brackets and apply to the dataframe.

Let’s look at below example of how boolean indexing works in pandas. In our example, we’ll work with the dataframe of employees and their salary :

import pandas as pd
df = pd.DataFrame({'EmployeeName':  ['John','Sam','Sara','Nick','Bob','Julie'],
                   'Salary':[5000,8000,7000,10000,3000,5000]})

raeqAr7.png!web

Let’s check which employee have a salary of 5000 . First, we will perform a vectorised boolean operation that produces a boolean series:

salary_bool = df['Salary'] == 5000

reaIriq.png!web

Now, we can use this series to index the whole dataframe, leaving us with the rows that correspond only to employees whose salary is 5000

salary_5000 = df[salary_bool] 
# that means it returns only those rows which are 'True'
Njeq6vR.png!web

I hope now you have some idea how boolean indexing works..You can find more info about Boolean Indexing tutorial here ..

Now, let’s get started with our actual exercise i.e, to find the total number of forest fires in ‘Acre’ state:

It’s clear from above output that total number of fires reported in ‘Acre’ state are 18463 . Yes, this is too much to understand for the first time :). So let me break this code and explain it step by step.

Step 1 :Let’s use boolean indexing to get only ‘Acre’ state subset and assign it to variable called ‘amazon_acre’

amazon_acre = amazon['state'] == 'Acre' 

‘amazon_acre’ will generate a series for us which shows True and False for each row based on our condition.

Step 2:Let’s use this series to index the entire dataset and assign it to variable called ‘amazon_acre_data’

amazon_acre_data = amazon[amazon_acre] # total 239 entries

q2Y7Nbj.png!web

First five rows of amazon_acre_data

Step 3 :Next let’s display only ‘number’ column from above dataset

amazon_acre_number = amazon_acre_data['number'] # this will display only ‘number’ column values.

Step 4 :Now we can use sum() function to amazon_acre_number variable to find total number of fires.

VrE3eeA.png!web

There you go.!! It’s always a best practice to break the code when we are experimenting the data for the first time. But, eventually to gain better programming skills we shall work on minimising the coding lines.

Now let’s use groupby() method on ‘year’ column and get total number of fires for each year.

BjIbUvz.png!web

Here in output, we see that ‘year’ is marked as index and ‘number’ as column. Just to make it more simple for visualising data, we can use reset_index() method to make index to be treated as a column.

ZVnU7nR.png!web

Visualisation of above dataset:

Importing libraries

Matplotlib is python’s data visualisation library, which allows us to visualise the data and is very easy to get started for simple plots. Matplotlib consists of several plots like line, bar, scatter, histogram etc., I recommend you to explore the official Matplotlib webpage for more info.

Seaborn is python’s most popular statistical visualisation library which is built on top of Matplotlib. Check out seaborn official webpage for all different types of seaborn plots.

Let’s visualise ‘acre_fires_year’ dataset using matplotlib and seaborn(barplot)

VNRb2uQ.png!web

More number of fires are reported in 2002

Note: plt.figure() creates a figure object, which we here used to customise the size of the chart.

Exercise 3 :To find out total number of fires in all states

For this, let’s use groupby() on ‘state’ column and find out total number of fires.

uEFfYnB.png!web

2UnqUjY.png!web

It’s clear from above plot that most forest fires occurred in ‘Mato Grosso’, followed by ‘Paraiba’ and ‘Sao Paulo’ states.

Exercise 4 :To find out total number of fires in 2017 and visualising data based on each ‘month’

FJreUja.png!web

Exercise 5 :To find out average number of fires occurred

Ijiaeqj.png!web

Exercise 6 :To find out the state names where fires occurred in ‘December’ month


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK